Data Warehousing 101

Business Intelligence has been with us for quite a while now in one form or another, from the days of decision support systems of the 1970s through the coining of the term data warehousing in the early 1990s and the addition of OLAP (Online Analytical Processing) to our lexicon through to the current world of business intelligence.

By and large although technologies have come and gone during that time or undergone many changes, the processes that allow end users access to the data on which decisions are made have remained fairly constant.

In a nutshell, this process involves the extraction of data from operational systems and transforming the data into a format more suited for analysis.

The result is that organizations invest in a traditional BI stack or cherry pick best of breed solutions to suit their needs, and bypass those that they do not need.

The Traditional BI Stack

The term BI Stack refers to the multiple architectural layers inherent in the solution. These are described below at a high level.

ETL
This is the integration layer that enables the extraction of data from a whole raft of applications, databases and flat files. The key players in this space are Informatica, Ab Initio, Business Objects, Microsoft and IBM through their acquisition of Ascential.

Staging
There are not many BI projects in this day and age that do not require some form of staging area separate to the data warehouse. This staging area acts as

  • a landing area for incoming data from multiple source systems.
  • for the application of business rules upon the incoming data
  • cleansing and consolidation of data
  • transforming the data ready to load into the data warehouse.
The staging layer and ETL are frequently closely coupled as often the ETL tool is the mechanism used to carry out the majority of the steps outlined above.

Data Warehouse 
Whether you advocate the normalized approach of Inmon or the dimensional approach of Kimball, this stage is the key piece in the architectural jigsaw of the typical BI project. A data warehouse is a central repository for all, or a significant part of the data that can be collected by an enterprises various business systems. This is an area that all of the major database vendors are now onboard with and they enable their software to be configured specifically for decision support.

The key players are Oracle, Microsoft, Teradata and IBM.

Data Marts/OLAP Cubes 
Typically, there is a layer of abstraction between the end user and the data warehouse, and this is where there is the most diversification in terms of approach. The two main routes that can be taken are either relational or multi-dimensional.

Relational OLAP (ROLAP) uses a relational database to store data in either a star schema or snowflake data marts containing data aggregated to a higher level that in the atomic level data warehouse. The advantage of this is that the data is accessible by SQL meaning that you do not require a business intelligence tool to access the data. ROLAP databases tend to be very scalable and allow for the storage of vast amounts of data within their structures.

Multi-dimensional OLAP (MOLAP) The arrangement of data into cubes avoids a limitation of relational databases which are not best suited for near instantaneous analysis of large amounts of data. OLAP cubes conceptually are extensions to the two-dimensional array of a spreadsheet. For example an organisation might wish to analyse some financial data (measures) by product, by time-period, by city, by type of revenue and cost, and by comparing actual data with a budget. These additional methods of analysing the data are known as dimensions. Recently many OLAP vendors have embraced a standardised query language for cubes called Multidimensional Expressions (MDX).

Presentation Layer
This layer consists of how the user interacts with the data, from applications as simple as an Excel spreadsheet to complex data mining applications. At a high level, these business intelligence tools fall into one of two camps. There are query, reporting and analysis tools and advanced analytics.

Query, Reporting & Analysis tools
These tools include ad-hoc query and multidimensional analysis tools as well as dashboards. These tools offer rows and column reporting, the ability to show data in a chart , and the ability to be able to slice and dice data to easily change the layout of a report and drill through dimension hierarchies in order to show summary or detail figures.

Advanced Analytics
These tools include data mining and statistical analysis applications which use technologies such as neural networks, clustering and Bayesian Classifiers among others to establish relationships in the data that may be too complex to be extracted using query/analysis tools. This enables predictions to be made based on past behaviour.

Often solutions comprise of software sourced from multiple vendors to cover each layer of the stack and it is the job of IT and/or outside help to make them all fit together. While this gives you the ability to take best of breed solutions specifically designed from the ground up for that purpose, this can prove to be a difficult task, owing to different technologies and platforms and it is no surprise that recent changes has seen some vendors moving to a deep vertical often integrated product offering across the whole BI stack.

Among these it is the more traditional technology stack players such as Microsoft and Oracle who are moving more and more into this space offering a one-stop-shop for all things BI related. Oracle has Oracle 10g, Warehouse Builder at the back-end and following their purchase of Hyperion, they have a strong offering at the front-end too. Microsoft has the current release of SQL Server (2008) and its suite of integrated BI applications (Integration Services, Analysis Services and Reporting Services).