ETL is an abbreviation for Extraction Transformation Loading. Purpose of ETL is to get data out of the source systems and load it into the data warehouse. Simply a process of copying data from one place to other. Typically, data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database.
Many data warehouses also incorporate data from non-OLTP systems such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading. Therefore, when defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation.
ETL is not one time event
ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers. It is not a one time event as new data is added to the Data Warehouse periodically – monthly, daily, hourly.
ETL is an integral, ongoing, and recurring part of a data warehouse and therefore ETL process is:
- Well documented
- Easily changeable
ETL staging database
ETL operations should be performed on a relational database server separate from the source databases and the data warehouse database. This is to ensure that there is no or minimal impact on operational systems and BI applications. Therefore separate staging area is created for ET process.
Staging database creates a logical and physical separation between the source systems and the data warehouse. Therefore it minimizes the impact of the intense periodic ETL activity on source and data warehouse databases.
Why we need staging area?
A staging database is used as a "working area" for the ETL. This will help to load data from the sources, modify & cleanse them before loading them into the data warehouse.
One of the objectives of the staging area is to facilitate restartability and minimise the impact the extraction has on your source system. You extract data from your source system only once and store a copy of it in your staging database. Should your ETL fail further down the line, you do not need to impact your source system by extracting the data for a second time. If you store the results of each logical step of your transformation in staging tables, you can restart you ETL from the last successful staging step.
1.1 ETL Data Extraction Process
The integration of all of the disparate systems across the enterprise is the real challenge to getting the data warehouse to a state where it is usable
In extraction process data is extracted from heterogeneous data sources. Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data. We have to integrate different:
- Operating Systems
- Communication protocols
ETL logical data mapping
Before starting with implementing data extraction, we need to have a logical data map before the physical data can be transformed. The logical data map describes the relationship between the extreme starting points and the extreme ending points of your ETL system usually presented in a table or spreadsheet. The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes.
Above screenshot of the mapping document header gives us a glimse of how ETL mapping document looks like. The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process. The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts.
The transformation can contain anything from the absolute solution to nothing at all. Most often, the transformation can be expressed in SQL. The SQL may or may not be the complete statement.
Another important activity involved in the initial phase of ETL or data extraction is the analysis of source system. Broadly, we do couple of important tasks here:
- The data discovery phase
- The anomaly detection phase
The data discovery phase
Once you understand what the target needs to look like, you need to identify and examine the data in source systems. Key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it. It is up to the ETL team to drill down further into the data requirements to determine each and every source system, table, and attribute required to load the data warehouse.
During data discovery phase some of the below tasks are accomplished:
- Collecting and Documenting Source Systems
- Keeping track of source systems
- Determining the System of Record - Point of originating of data
- Definition of the system-of-record is important because in most enterprises data is stored redundantly across many different systems.
- Enterprises do this to make nonintegrated systems share data. It is very common that the same piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in varying versions of the same data
Understanding the content of the data is crucial for determining the best approach for retrieval. Examples include giving attention to NULL values and data fields.
NULL values: An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss! Remember, in a relational database NULL is not equal to NULL. That is why those joins fail. Check for NULL values in every foreign key in the source database. When NULL values are present, you must outer join the tables.
Dates in non-date fields: Dates are very peculiar elements because they are the only logical elements that can come in various formats, literally containing different values and having the exact same meaning. Fortunately, most database systems support most of the various formats for display purposes but store them in a single standard format.
Change data capture
During the initial load, capturing changes to data content in the source data is unimportant because you are most likely extracting the entire data source or a potion of it from a predetermined point in time. Later, the ability to capture data changes in the source system instantly becomes priority. The ETL team is responsible for capturing data-content changes during the incremental load.
How to identify changed data?
There are multiple ways to implement change data captue logic. Below are some of those:
Audit columns are provided as part of database feature and are updated by triggers. Audit columns are appended to the end of each table to store the date and time a record was added or modified. You must analyze and test each of the columns to ensure that it is a reliable source to indicate changed data. If you find any NULL values, you must to find an alternative approach for detecting change – example using outer joins.
Process of Elimination
This is not the most efficient technique, but most reliable for capturing changed data.
Process of elimination preserves exactly one copy of each previous extraction in the staging area for future use. During the next run, the process takes the entire source table(s) into the staging area and makes a comparison against the retained data from the last process. Only differences (deltas) are sent to the data warehouse.
Initial and Incremental Loads
Here we create two separate tables: previous load and current load.
The initial process bulk loads into the current load table. Since change detection is irrelevant during the initial load, the data continues on to be transformed and loaded into the ultimate target fact table. When the process is complete, it drops the previous load table, renames the current load table to previous load, and creates an empty current load table. Since none of these tasks involve database logging, they are very fast!
The next time the load process is run, the current load table is populated.
Once data is loaded next time, we have to select the current load table MINUS the previous load table. Transform the data and then load the result set into the data warehouse.
1.2 ETL Data Transformation Process
Data transformation is the important step in ETL process where some value addition happens. In this step changes to the data occur and we will confirm whether the data can be used for its intended purposes. Data transformation is performed in the staging area.
Data quality check
Data quality checks are run at two places - once after data extraction and another time after cleaning of data. Additional quality check is to confirm that data is as intended. A successful quality check assures that resulted data is:
Above graph describes the paradox of ETL speed vs data quality screening. If we include more quality check then the ETL speed has to be compromised. Similarly if we need best possible ETL speed then data quality check has to be minimized.
Screening of data to achieve a desired quality contains running the staging data through multiple rules or checks. Some of those are listed below:
- Detecting data anomaly by using record counts in tables. We may further do sampling of data for a particular column of a table.
- Enforcing constraints on physical columns of a table. Data is verified for multiple aspects such as NOT NULL constraint, numeric values outside the boundary, invalid column lengths, invalid lookup values etc.
- Enforcing structural constraints. We may ensure physical or logical primary and foreign keys constraints to verify referential integrity relationships.
- Enforcing business rules. Depending on requirement, a set of business rules may be verified to confirm correctness of the data.
Data transformation workflow
Data loaded into staging tables will be subjected to series of cleaning and confirming processes. Then the result of these processes will be used to decide next cource of action. If there are fatal errors then we may stop ETL process and fix the issues. Otherwise, we will continue to load the data to target data warehouse.
1.3 ETL Data Loading Process
Loading data is the final step of ETL process. This is where data is finally moved to a data warehouse database. Data loading process can be further devided into two categories based on the nature of target tables - dimension tables and fact tables.
Loading data to dimension tables
Dimension tables store attributes, or dimensions, that describe the objects in a fact table. For example, a Dm_Time table may store details of days, months, and year values. Dimension tables have some unique characters as compared to fact tables:
- Physically built to have the minimal sets of components
- Often the primary key is a single field containing meaningless unique integer – it is also called surrogate Keys.
- Data warehouse system owns these keys and assigns the values on its own
- Dimension tables are de-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key.
- Dimension table possessed one or more other fields that compose the natural key - also called business key - of the dimension
Above mentioned characteristic of a dimension table is only typical, it is not a mandatory. Some of the characteristics may not apply to a dimension table depending of design and business requirements.
Loading slowly changing dimensions (SCD)
The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.
When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses
- Type 1
- Type 2
- Type 3
Type 1 SCD
In this type of data load, old value will be over written. In this method no history of dimension changes is kept in the database.
Type 2 SCD
In this methodology all history of dimension changes is kept in the database. For each changed record we add a new record and update the corresponding old record.
Type 3 SCD
In this type usually only the current and previous value of dimension is kept in the database. This is achieved through additional column in a target table.
Loading fact tables
Fact tables hold the measurements of an enterprise. The relationship between fact tables and measurements is extremely simple. If a measurement exists, it can be modeled as a fact table row. If a fact table row exists, it is a measurement.
Building fact table keys
The final ETL step is converting the natural keys in the new input records into the correct, contemporary surrogate keys. ETL maintains a special surrogate key lookup table for each dimension. This table is updated whenever a new dimension entity is created and whenever a Type 2 change occurs on an existing dimension entity.
All of the required lookup tables should be pinned in memory so that they can be randomly accessed as each incoming fact record presents its natural keys. This is one of the reasons for making the lookup tables separate from the original data warehouse dimension tables.
Managing fact table indexes
Fact tables are indexed to optimize SELECT queries to assist fast response for reporting system. However, while loading the data these indexes will slow down the data load process. So, we need to take some actions to minimize the impact.
- Drop all indexes in pre-load time
- Segregate Updates from inserts
- Load updates
- Rebuild indexes
Managing fact table partitions
Partitions allow a table (and its indexes) to be physically divided into minitables for administrative purposes and to improve query performance. The most common partitioning strategy on fact tables is to partition the table by the date key. Because the date dimension is preloaded and static, you know exactly what the surrogate keys are.
Partition will be done on the fact table key that joins to the date dimension for the optimizer to recognize the constraint. The ETL team must be advised of any table partitions that need to be maintained.