Building a data warehouse

We have a wide range of tools, techniques, and frameworks to build a data warehouse. They will certainly improve the quality of output and speed wth which a warehouse is built. But key factor for successful data warehose design is how well you know the data you are dealing with.

It is important to have the understanding that a data warehouse is not an individual data repository. It is an overall strategy, or process, for building decision support systems and a knowledge-based applications architecture and environment that supports both everyday tactical decision making and long-term business strategizing for an organization. Therefore, a solid understanding of the business is very important step before deisgning a data warehouse.

Key activities in data warehouseing process

Objective to build a data warehouse demands not just the simple collecting and reporting of data but meeting business expectations as well. Therefore the activities involved need both business experts as well as technical experts to come together. Below listed are some key activities done:

  • Identify and select the hardware/software/middleware components to implement it
  • Develop a scaleable architecture to serve as the Warehouse’s technical and application foundation
  • Identify the business information that must be contained in the Warehouse
  • Identifyand prioritize subject areas to be included in the Data Warehouse
  • Manage the scope of each subject area which will be implemented into the Warehouse on an iterative basis
  • Extract, cleanse, aggregate, transform the data to ensure accuracy and consistency
  • Establish a refresh program that is consistent with business needs, timing and cycles
  • Provide user-friendly, powerful tools at the desktop to access the data in the Warehouse
  • Educate the business community about the realm of possibilities that are available to them through Data Warehousing
  • Establish processes for maintaining, enhancing, and ensuring the ongoing success and applicability of the Warehouse

Let us explore high level design process of building a data warehouse.

Data warehouse design model

Data warehouse design process

Above block diagram shows three most important aspect of a data warehouse design. We need to design data warehouse, define metadata repository and deside on data extraction process. For now, let us not go into deeper level of inputs, outputs, check-points and many repetitions of each step within each aspect of the process.

To meet the integrity goals of the data warehouse concept, the right connections must be made between the three aspects of the design process. For example, the design of data edits in the extract process might suggest that initial column designs in the warehouse should be revised because a source data element is too complex (codes two distinct pieces of information); as a result, the original and the new column in the warehouse must be named appropriately and the element definitions in the metadata must be re-written.

Each of the three design processes is based on a slightly different understanding of the word "data." Throughout the rest of these guidelines the discussion tries to separate and differentiate each of the three aspects as much as possible.

Data warehouse design

Data warehouse table design process involves many considerations, important ones are listed here. While designing a enterprise data warehouse(EDW), deeper analysis has to be done on each of these aspects.

  • Understand the ER diagram of source tables
  • Understand source queries
  • Design target tables and the table relationships
  • Set appropriate data types for target fields
  • Follow understandable namings for tables, views and other elements of warehouse database
  • Plan and design for data warehouse security

ETL process

ETL process involves no less complexity compared to designing a data warehouse repository. A detailed ETL process may involve many aspects, but let us focus on some important aspects of it.

  • Understand the source application system
  • Choose data sources
  • Design table normalizations
  • Build source target mappings
  • Implement the ETL code
  • Develop ETL documentation
  • Test and verify the target data
  • Plan and move the code to production environment

Metadata repository design

A metadata describes the structures that contain the actual data and metadata repository is created to store such information. The purpose of the metadata repository is to provide a consistent and reliable means of access to data. Designing a metadata repository involves multiple considerations:

  • Mapping users language to technical entities
  • Mapping business and application understandings
  • Defining context, tables, elements, and values
  • Developing details of policies that govern, technical attributes, specifications that transform, and programs that manipulate each elements
  • Writing solutions to each business/user query
  • Delivering metadata repository to end users