PDW 2012 | SQL Server 2005-2014
With database servers getting more and more powerful, some traditional concepts of Business Intelligence solution may be reconsidered. One of those concepts for which you can find a lot of lively debates recently is the question of whether to use ETL or ELT.
Here are just a few of the blog posts and discussions you can find on this topic:
- http://www.jamesserra.com/archive/2012/01/difference-between-etl-and-elt/
- http://it.toolbox.com/blogs/infosphere/so-what-is-better-etl-or-elt-13572
- http://www.iri.com/blog/data-transformation2/etl-vs-elt-we-posit-you-judge/
- many more …
In short, the main differences are shown in the table below:
ETL | ELT | |
Characteristics |
|
|
Pros |
|
|
Cons |
|
|
This comparison is by far not complete and if you read the links above (and many others that target this topic) you can find a lot more pros/cons and opinions. In fact, I don’t want to say one is better than the other. But here is what we recently found to work well in a large project using the Parallel Data Warehouse (PDW) for an initially 30TB (and growing) database. The following illustration which I recently used on SQL Conference 2014 in Germany shows the key concepts:
We’re using ETL (Microsoft SQL Server Integration Services) to
- Orchestrate the load process
- workflow management (make sure the right things happen in the right order)
- dealing with technical challenges (e.g. temporary tables, partition switching on PDW)
- implement configuration management (for example server and database names)
- logging and process monitoring (reports)
- Load dimensions (small amount of data)
- Collecting master data from source systems, merge and prepare this data
- Generation of surrogate keys for the dimensions
- Keeping track of historical changes (modeled as SCD2 or intermediate bridge tables)
- Building up all dimensions and transferring the data to the PDW (using a reload operation)
- Early arriving facts (create missing dimension rows, distinct counts run on PDW)
Why?
- Integration Services (SSIS) well suited for these tasks
- SMP SQL Server offers good support for dimension processing tasks (identity column, T-SQL merge statement etc.)
- Additional services like SQL Server Data Quality Service (DQS) and SQL Server Master Data Services (MDS) are currently not supported to run on the PDW
This is also true for more sophisticated tasks and the use of web services for example to find duplicate customers,to correct misspelled street names, to guess the gender from the first name. Also if you need to use custom assemblies, for example to access special source systems or include specific calculations, ETL tools are the better choice.
Then, we’re using ELT (distributed SQL, DSQL) on the Parallel Data Warehouse to
- process fact data (large amount of data) after it is bulk loaded with no modifications into a staging database on the PDW
- Data preparation (for example removing duplicate rows)
- Linking fact table data to dimensions
- Performing calculations (using SQL window functions intensively)
- Merge new data to archive
- store the data in the persisted stage area (without creating duplicates if the data was there already)
Why?
- Much better performance observed compared to SMP SQL Server/SSIS
- in our case, usually about 10-20 times faster, depending on the source data and the transformations
- In some cases (for example removing duplicate rows in the source data) even 100 times faster
- Faster loads allow us to fully reload many TB in case this is needed (this gives more options for backup strategies and for the dimensional modeling)
- Solution will directly benefit from future MPP scale up without any need of coding
CONCLUSION
ETL and ELT may work well together. In this scenario we did the dimension processing as well as the full workflow management using ETL tools (SSIS on an SMP SQL Server) and the processing of the large transactional tables using ELT (distributed SQL on PDW).
Kommentare (0)