Introduction
Within the market space in which Montage operates there is a predominance of two particular applications, for performing ETL (Extraction , Transformation & Loading) , in use on data warehouse projects in New Zealand. The first is SSIS (Integration Services) from Microsoft and the second is a product called Red from an Auckland based company - Wherescape.This article attempts to give a brief summary of the main points of difference between these two and our thoughts around each. It should be noted that this is not an in-depth analysis but aims to provide a flavour of each tools strength and weaknesses as we have found through use in real data warehouse projects.
Microsoft had first tried their hand at ETL in SQL Server 2000 with the Data Transformation Services (DTS) application which was very limited in its functionality compared to other products on the market at that time. SSIS is part of the Business Intelligence suite of applications that are bundled with SQL Server 2005 (the others are Analysis Services and Reporting Services.)
In contrast, Wherescape position Red as a data warehouse development tool and not just an ETL product. The main aim of using this software is to be able to build data warehouses very quickly through prototyping methodologies and take those prototypes through to a production environment and then be able to react quickly to change.
Look & Feel
SSIS: The user interface used to develop the jobs is accessible through Visual Studio and is very much the traditional ETL tool which allows developers to build solutions through a visual representation of the workflow. New transformations can be dropped into the main diagram from a pick list and then attached to preceding and succeeding objects in the pipeline.RED: At first glance, the front end is not intuitive as it is not a graphical interface in the traditional sense. But rather the display is separated into three main panels. Down the left hand side is a list of objects grouped by project and object type. The middle pane is the one mainly used during development which allows the developer to manipulate fields on target tables and alter the transformations on a field by field basis.
Red does allow the user to view the workflow graphically once it has been built, this diagrammatic view allows the user to view the Star Schema, draw source tracking / track forward views, and to show linked tables and dependencies. Although these views do enhance the documentation, there is no facility to use them to perform automated impact analysis to view which objects would be affected by a change upstream which would be a very useful facility. This functionality is provided in an automated report available from the main screen.
Technology
SSIS: SSIS transformations are built into the application and are independent of whichever the underlying database is. All the transformation workflows are performed in memory. Which means that a pipeline which reads data from a table in a database, performs a number of transformations before inserting that data into a target table in the data warehouse can all be performed in memory. This reduces the I/O impact on the database but also requires a suitable level of memory to be available on the server.RED: Conversely, Red is totally dependent on the underlying database. Effectively, for each table that it creates, Red will also create an underlying procedure (eg PL/SQL in Oracle, Stored Procedures in SQL Server) in the database which reads the source table and performs the necessary field level transformations and writes the data to the target table. All of the stages where data is transformed will write the data to a table, this can build up multiple levels of staging tables which can be complicated to maintain, as when changes are made to upstream tables, they need to be propagated through subsequent tables.
Data Sources
SSIS: Integration Services connects to a wide range of data sources including: Relational databases through OLE DB or .NET connectivity, extracts straight from Excel spreadsheets, flat files and XML.RED: Red also connects to a large number of sources: Relational databases through native database drivers or via ODBC connectivity, flat files, Excel spreadsheets and additionally Microsoft Analysis Services cubes
Transformations
SSIS: In addition to the standard types of transformations (joins, merge / union operations, multicast and conditional splits) SSIS offers a number of additional powerful transformations as well as the ability to build custom objects to include in projects, plus the script component which allows the developer to write custom code in a VB or .NET script. Some of the advanced transformations are the ability to pivot/unpivot data - convert data from row based to column based or vice versa, and the Slowly Changing Dimension transform which provides the ability to maintain dimension tables (and can also be used to maintain fact tables)RED: As stated earlier, this is a purpose built data warehouse development tool. The software knows how to deal with dimensions and facts and can automatically create the code necessary to maintain both types of table with both initial and incremental loads. Transformation are able to be called as functions on a field, which provides all of the functions offered by the underlying database as well as those that can be custom created and stored in the database. It is possible to do most of the operations provided by SSIS, but this usually requires some creative development and multiple steps. For instance, as pivot is not a standard function in SQL Server 2000, it is necessary to create a job step for each column to populate into a separate table and then union them together at the end.
Summary
SSIS: Initially the user interface is difficult to use it as it was not at all intuitive in comparison to other high end ETL tools (Informatica & Data Stage) our staff have used on other projects; it appears that things were done in the "Microsoft Way" for the sake of doing them that way. With more exposure to the application this becomes less of an issue.However there are a number of issues with SSIS which can cause pain, namely that it is very fussy about data types when mapping between objects and they have to be in the exact same format between them. You would expect to be able to at perform a conversion at the time of mapping; however it is necessary to add in additional steps to perform these conversions as part of the data flow.
Another other issue is in regard to the Slowly Changing Dimension transform which while being a very powerful component has on numerous occasions managed to lose all of the mappings if a change is made to a single upstream object which might have an impact downstream.
An additional irritation is that the slowly changing dimension transform requires a click to add each field followed by click / drop down to set type 1 or type 2 change tracking, fair enough its not a big deal if your table has only a handful of fields, but not when you need to apply this to over 250 fields!
RED: As I stated earlier, on
first impression this application was not
intuitive. In our experience it is after using
it to go through an initial implementation that
the benefits of its approach become apparent.
The application is very powerful and is very
simple to use to build projects in a logical
manner. It is possible to build solutions very
quickly. Some of our staff have found that it
has allowed them to be more productive than with
any similar type tool that they have used in the
past.
The caveat with this is that you need to follow the "Red Methodology" which mirrors the Kimball approach to design and build of a data warehouse. Which may not be a problem for experienced data warehouse consultants, but others may well need some assistance. One of the main "gripes" with the application is around the scheduler component which is confusing when trying to understand the linkages between objects. Particularly knowing which ones are related and which are parent and child jobs.
- RH
