ETL Process in Data Warehouse

ETL is an abbreviation for Extraction Transformation Loading. Purpose of ETL is to get data out of the source systems and load it into the data warehouse. Simply a process of copying data from one place to other. Typically, data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database.

Many data warehouses also incorporate data from non-OLTP systems such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading. Therefore, when defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation.

ETL is not one time event

ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers. It is not a one time event as new data is added to the Data Warehouse periodically – monthly, daily, hourly.

ETL is an integral, ongoing, and recurring part of a data warehouse and therefore ETL process is:

  • Automated
  • Well documented
  • Easily changeable

ETL staging database

ETL operations should be performed on a relational database server separate from the source databases and the data warehouse database. This is to ensure that there is no or minimal impact on operational systems and BI applications. Therefore separate staging area is created for ET process.

Staging database creates a logical and physical separation between the source systems and the data warehouse. Therefore it minimizes the impact of the intense periodic ETL activity on source and data warehouse databases.

Why we need staging area?

A staging database is used as a "working area" for the ETL.  This will help to load data from the sources, modify & cleanse them before loading them into the data warehouse.

One of the objectives of the staging area is to facilitate restartability and minimise the impact the extraction has on your source system.  You extract data from your source system only once and store a copy of it in your staging database.  Should your ETL fail further down the line, you do not need to impact your source system by extracting the data for a second time.  If you store the results of each logical step of your transformation in staging tables, you can restart you ETL from the last successful staging step.