ETL Process in Data Warehouse

1.2 ETL Data Transformation Process

Data transformation is the important step in ETL process where some value addition happens. In this step changes to the data occur and we will confirm whether the data can be used for its intended purposes. Data transformation is performed in the staging area.

Data quality check

Data quality checks are run at two places - once after data extraction and another time after cleaning of data. Additional quality check is to confirm that data is as intended. A successful quality check assures that resulted data is:

  • Correct
  • Unambiguous
  • Consistent
  • Complete

ETL speed vs Data quality check

Above graph describes the paradox of ETL speed vs data quality screening. If we include more quality check then the ETL speed has to be compromised. Similarly if we need best possible ETL speed then data quality check has to be minimized.

Screening of data to achieve a desired quality contains running the staging data through multiple rules or checks. Some of those are listed below:

  • Detecting data anomaly by using record counts in tables. We may further do sampling of data for a particular column of a table.
  • Enforcing constraints on physical columns of a table. Data is verified for multiple aspects such as NOT NULL constraint, numeric values outside the boundary, invalid column lengths, invalid lookup values etc.
  • Enforcing structural constraints. We may ensure physical or logical primary and foreign keys constraints to verify referential integrity relationships.
  • Enforcing business rules. Depending on requirement, a set of business rules may be verified to confirm correctness of the data.

Data transformation workflow

 

ETL data transformation work flow

Data loaded into staging tables will be subjected to series of cleaning and confirming processes. Then the result of these processes will be used to decide next cource of action. If there are fatal errors then we may stop ETL process and fix the issues. Otherwise, we will continue to load the data to target data warehouse.