Fuzzy matching use case. We improve data quality when loading data into BigQuery using Trifacta software, which simplifies the process.
Category: Data Preparation
Data Preparation is the process to make the data ready for use and enhancing its quality. It handles recurring logical data modelling, physical data modelling, sourcing and accessing situations.
To clarify, Data Preparation is one of the phases of Data Ingestion. The other phases are Data Acquisition and Data Landing.
We use Data Preparation:
- To improve the performance of future, repeatable processes of data.
- To apply business rules to convert a set of data values. The goal is to transform the source data format into an altered view of a destination data set. We include obfuscation (encryption) and masking (anonymization) in this area.
- To process data for presentation or consumption for different applications.
Ideally, we first acquire data without transforming it. Once it has landed in the target, we can prepare the data for consumption. Thus, we eliminate the business and technical problems that inaccurate, contradictory and inconsistent data causes.
However, some scenarios require some minimum transformations to allow the ingestion from a technical point of view. E.g., formatting a date (15.3.22) to unify all formats (15/3/2022). In some other cases, we prefer to include them to ensure a certain level of data quality. As when we receive records as “Vanilla Inc”, “vanila”, or “vanilla In”, we may unify them to reduce duplication of records. In other scenarios, we need to blur, null, tokenize, substitute or encrypt the data before it lands for security reasons.
In this category, there are how-to articles, tips, tricks, best practices and use cases for data preparation, including data validation transformations, data wrangling, fuzzy matching and others.
Fuzzy Matching or Approximate String Matching
Fuzzy matching is a technique used to match text strings that may be less than 100% perfect. We use it in web searches, data quality, etc.
A brief introduction to BigQuery’s architecture
This post helps to understand BigQuery’s internal design and how efficiently to load, replicate or migrate data into it.