Category: Data Preparation

Data Preparation is the process to make the data ready for use and enhancing its quality. It handles recurring logical data modelling, physical data modelling, sourcing and accessing situations.

To clarify, Data Preparation is one of the phases of Data Ingestion. The other phases are Data Acquisition and Data Landing.

We use Data Preparation:

To improve the performance of future, repeatable processes of data.
To apply business rules to convert a set of data values. The goal is to transform the source data format into an altered view of a destination data set. We include obfuscation (encryption) and masking (anonymization) in this area.
To process data for presentation or consumption for different applications.

Ideally, we first acquire data without transforming it. Once it has landed in the target, we can prepare the data for consumption. Thus, we eliminate the business and technical problems that inaccurate, contradictory and inconsistent data causes.

However, some scenarios require some minimum transformations to allow the ingestion from a technical point of view. E.g., formatting a date (15.3.22) to unify all formats (15/3/2022). In some other cases, we prefer to include them to ensure a certain level of data quality. As when we receive records as “Vanilla Inc”, “vanila”, or “vanilla In”, we may unify them to reduce duplication of records. In other scenarios, we need to blur, null, tokenize, substitute or encrypt the data before it lands for security reasons.

In this category, there are how-to articles, tips, tricks, best practices and use cases for data preparation, including data validation transformations, data wrangling, fuzzy matching and others.

Category: Data Preparation

Fuzzy Matching Demo: Inconsistent Company Names

Fuzzy Matching or Approximate String Matching

A brief introduction to BigQuery’s architecture