Term ETL refers to methods access and manipulate source data and load it into target data. ETL is the abbreviation of three words Extract, Transform and Load.
Purpose of ETL process is to extract data, mostly from different types of systems, transform it into a structure that’s more appropriate for reporting and analysis and then finally load it into the target database, usually a data warehouse.
We will briefly explore all the three steps of the ETL process and then will look at the big picture of where ETL stands in overall data warehouse architecture.
ETL process of Extract from source
This is the first step of ETL process. Here, we extract data from different internal and external sources, structured and/or unstructured. Each of these separate systems may also use a different data organization and/or format. Common data source formats are relational databases, spreadsheet file, and flat files.
We send queries to these source systems using native connections ODBC or OLEDB middleware. Often, these data are moved to a staging area in almost the same format as source systems have. In some cases we want only the data that is new or has been changed, the queries will only return the changes. Some ETL tools can do this automatically, providing a changed data capture (CDC) mechanism.
ETL process of Transform the data
Transform stage of ETL is often the most complicated one. Once we have the data in staging are we run series of rules to to derive the data to load into target database.
During this process we also check on data quality and cleans the data if necessary. Also join and union various tables, filter and sort the data using specific attributes, pivot to another structure and make business calculations.
Some transformation examples include:
- Deriving new sales amount as sales_amount * unit_price
- Encoding with standard values such as mapping "Male" to "M"
ETL process of Load into target
The final lstep of ETL process is to load the data to target table in the data warehouse. Based on the need, this process considers various options.
Some data warehouses may overwrite existing information with cumulative information; updating extracted data is frequently done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in an historical form at regular intervals—for example, hourly.
A typical ETL architecture in a data warehouse
As described in the diagram above, data is first gathered from various Operational Applications database such as sales database and customer database. This data is then moved to a staging area. Now, a series of functions or rules are applied on this data and then moved to a data warehouse. An extended part of the ETL process would be to load the data from an enterprise data warehouse to various data marts for use consumption.
ETL and beyond
But, today, ETL is much more than that. Most ETL software also covers data profiling, data quality control, monitoring and cleansing, real-time and on-demand data integration in a service oriented architecture (SOA), and master data management. We will not explore these topics in detail here as it is overwhelming for a beginner.