ETL Process in Data Warehouse

1.3 ETL Data Loading Process

Loading data is the final step of ETL process. This is where data is finally moved to a data warehouse database. Data loading process can be further devided into two categories based on the nature of target tables - dimension tables and fact tables.

 

Fact and Dimension table structure
Fact and Dimension table structure

Loading data to dimension tables

Dimension tables store attributes, or dimensions, that describe the objects in a fact table. For example, a Dm_Time table may store details of days, months, and year values. Dimension tables have some unique characters as compared to fact tables:

  • Physically built to have the minimal sets of components
  • Often the primary key is a single field containing meaningless unique integer – it is also called surrogate Keys.
  • Data warehouse system owns these keys and assigns the values on its own
  • Dimension tables are de-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key.
  • Dimension table possessed one or more other fields that compose the natural key - also called business key - of the dimension

Above mentioned characteristic of a dimension table is only typical, it is not a mandatory. Some of the characteristics may not apply to a dimension table depending of design and business requirements.

Loading slowly changing dimensions (SCD)

The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.

When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses

  1. Type 1
  2. Type 2
  3. Type 3

Type 1 SCD

Type 1 SCD

In this type of data load, old value will be over written. In this method no history of dimension changes is kept in the database.

Type 2 SCD

Type 2 SCD

In this methodology all history of dimension changes is kept in the database. For each changed record we add a new record and update the corresponding old record.

Type 3 SCD

Type 3 SCD

In this type usually only the current and previous value of dimension is kept in the database. This is achieved through additional column in a target table.

Loading fact tables

Fact tables hold the measurements of an enterprise. The relationship between fact tables and measurements is extremely simple. If a measurement exists, it can be modeled as a fact table row. If a fact table row exists, it is a measurement.

Building fact table keys

The final ETL step is converting the natural keys in the new input records into the correct, contemporary surrogate keys. ETL maintains a special surrogate key lookup table for each dimension. This table is updated whenever a new dimension entity is created and whenever a Type 2 change occurs on an existing dimension entity.

 

Building fact table keys
Building fact table keys

All of the required lookup tables should be pinned in memory so that they can be randomly accessed as each incoming fact record presents its natural keys. This is one of the reasons for making the lookup tables separate from the original data warehouse dimension tables.

Managing fact table indexes

Fact tables are indexed to optimize SELECT queries to assist fast response for reporting system. However, while loading the data these indexes will slow down the data load process. So, we need to take some actions to minimize the impact.

  • Drop all indexes in pre-load time
  • Segregate Updates from inserts
  • Load updates
  • Rebuild indexes

Managing fact table partitions

Partitions allow a table (and its indexes) to be physically divided into minitables for administrative purposes and to improve query performance. The most common partitioning strategy on fact tables is to partition the table by the date key. Because the date dimension is preloaded and static, you know exactly what the surrogate keys are.

Partition will be done on the fact table key that joins to the date dimension for the optimizer to recognize the constraint. The ETL team must be advised of any table partitions that need to be maintained.