When we transfer data, either migrate, load, copy, or replicate it, from a storage system (like a database, object storage, etc.) to another, we must check the quality of the copy by validating and reconciling it. These practices allow us to identify corrupted during the transfer and correct it. For example:
- Our transfer process may be interrupted, leaving the data in an inconsistent state, i.e. in different states in the source and the target. E.g., if a network error disconnects the transfer.
- An error in the data integration logic, or an intended effect of an application or the business logic implemented in the source. E.g., the unextracted rows on delta loads that the SAP incremental offload process causes.
- We can’t insert values in the target because they are incompatible with the data type.
- The data transfer process fails for any error, e.g., insufficient space in the target.
Data Validation
Data Validation is the suit of processes we use to check or prove the data we copied is correct and accurate. It is implemented by enforcing rules and checks to ensure the logical consistency of input and stored data. For example:
- Data type check: To confirm that the data entered is correct. E.g., ensure we insert numbers in an integer field rather than text.
- Range and constraint check: Verify whether input data falls within a predefined range. For example, the values from a counter must be positive or cero, a text string must have a certain length, and a latitude value should be between -90 and 90, while a longitude value must be between -180 and 180.
- Code and cross-reference check: To ensure that a field contains a value from a valid list or follows specific formatting rules. E.g., we can check a postal code against a list of valid codes.
- Structured check or format check: Certain data types can have several formats. E.g., dates can be ‘DD-MM-YYYY’, ‘YYYY-MM-DD’, ‘DD.MM.YYYY’, ‘MM/DD/YYYY’, etc. If we implement a rule that ensures all dates are in the same format, we maintain consistency across data and through time.
- Consistency check: To confirm the data has been entered in a logically consistent way. E.g., ensure the delivery date for a parcel is after the shipping date.
- Uniqueness check: Some fields, like email addresses or passport numbers, must have unique values. We use the uniqueness check to ensure that we don’t enter an item multiple times into a data storage.
The Data Validation article in Wikipedia describes the different kinds of data validation and the types of rules you can implement.
Due to its logic, we can implement many of the Data Validation rules in the source and target systems and the solution we use to transfer the data. Therefore, they automatically check it. For instance, many third-party tools used for data integration already include data validation rules, such as data type checks. We must implement other validations where we need to know the nature of the fields or our requirements, as code checks or format ones.
Data Reconciliation
Divergently, Data Reconciliation is the suit of procedures we employ to compare the data between the source and the target systems after we transfer it. It must detect an unexpected modification in the copy and ensure the data in the target is consistent with the one on the source. I.e., Data Reconciliation guarantees that any source transaction will bring the target database from one valid state to another.
We must know the source and target systems, their data lifecycle, and how exactly you transfer the data to reconcile it. I list below several different scenarios for reconciling data. It is not an exhaustive list of all possibilities, but it will help you evaluate the best Data Reconciliation method for your situation.
1. If you copy all data on the source to the target, you can count the records at both ends. You can also see the number of rows you copied in the log of the Data Integration solution.
You can also use a checksum function, which detects accidental changes in a data sequence to protect its integrity. A checksum function verifies there aren’t discrepancies between an initial and a final check after the transmission. So the data is transmitted along with its sum value, so you can calculate the value at the target and thus compare it with the sum value received.
2. However, if you use a Change Data Capture (CDC) process, you identify the changes on the source and transfer them to the target. The log in the Data Integration solution won’t show the number of rows it copied but the number of operations it did on the data. You may apply several operations to one record, such as inserting a new row, updating it several times, and then deleting it. There is no row at the end, even though several operations appear on the log.
Additionally, the order of applying changes to the target on CDC mode may change the result. For example, it is not the same inserting a row, updating it, and deleting it; that inserting a row, deleting it and afterwards trying to update it.
When you copy data between two systems in CDC mode, you must reconcile the data by checking the source and the target. The log of the Data Integration solution won’t be helpful.
3. Additionally, using your Data Integration solution to conciliate data implies applying the same methods you used to copy the data. So if there is a logical error in the transfer process, you won’t be able to detect it, and your Data Reconciliation methodology will be inaccurate. You should use a different method for Data Reconciliation than for Data Integration.
4. If you purge the data on the source and only leave a little historical data, you can’t compare the number of rows in the source and the target after some time. You need to reconcile the last chuck of data you moved as soon as you do it.
5. Some applications aggregate the historical data and delete the details, like SAP ERP. You need to know this lifecycle process to reconcile the data you copy from the SAP ERP to a Data Warehouse.
6. If there are changes in the source, such as new tables, drop old ones or alter them.
7. You can apply some optimisation, validation or transformation during the data copy that affects the data as you move it to the target and maintain it. For example, if you have a BLOB field on the source and you split it into several columns when you move it to the target. Whenever there is a change in the BLOB field, the Data Integration solution will show an update in all the target fields, even though it may only change the data in those columns at the target that needs it.
8. You must find the right field(s) to match the source and target data and evaluate if the copy is correct. A Data Reconciliation method identifies what data is missing in the target or extra information (duplicates, rows that don’t exist in the source, etc.). Remember that sometimes the Data Reconciliation process may show your copy is incorrect because a lifecycle or maintenance process was running at one of the ends.
Data Quality
Both Data Validation and Reconciliation contribute to an overall good Data Quality, which omni.sci defines as:
«The measure of how well suited a data set is to serve its specific purpose. Measures of data quality are based on data quality characteristics such as accuracy, completeness, consistency, validity, uniqueness, and timeliness.»
Data Quality is a vast topic, and I won’t discuss it today.
Leave a Reply