PDW: Cube processing experience
Currently we are doing a migration of a data warehouse to PDW at one of our customers.
Here are some important features of the BI-environment to give you the understanding of the project:
- Data in DWH (MS SQL Server) + AS Cubes
- about 2 TB in all cubes
- daily updates of last months
- complex cube-loading queries
- short processing window
The last point actually means “every 10 minutes count”, so we did a lot of optimization on every stage of the daily data loading chain. The substantial part of the process are the SQL queries loading the cube facts. There we have to join up to 6 fact tables for the single measure group.
We had a powerful DWH machine (SMP ) that did its work great:
- MS SQL Server 2008 R2 Enterprise
- AMD Opteron 8431 (24 CPU, 2.4Ghz)
- 256GB RAM
- FusionIO for TempDB (critically important for our joins of fact tables!)
We migrate to the MS PDW v1 with ten compute nodes (each 12 cores Xeon X5670 2.93Ghz). We were able to distribute our fact tables in a way to have joins without sending any data between compute nodes.
The PDW v1 have some limitations compared with SQL Server Enterprise, so we had to:
- persist out computed columns
- get rid of table-valued functions in cube partition definitions
Both were actually not a problem at all. The table-valued functions for partition loading are actually a nice feature in a real-live production environment, but our cube management architecture makes it easy to handle also with complete SQL queries in partition definition.
In order to compare the performance we selected the heaviest measure group where for every partition six tables are joined (about 20-30Mio records are selected from each table).
Our old SMP machine could effectively deliver data simultaneously for up to 4 cube partitions. The first records reached the cube after about one minute from the start of queries. With PDW the data starts coming after same one minute, but for 32 partitions simultaneously! So we have potentially the order of 8 as a performance boost!
Unfortunately the network (1Gbit) becomes immediately the bottleneck for the cube processing, so we plan to build an Infiniband connection between PDW and AS server.
At another testing stand we had Analysis Services direct at the Landing Zone of the PDW v1, thus having Infiniband between PDW and AS. The result was the full speed loading with 32 threads!
So after first hands-on experience with the migration of DWH model, which was surprisingly easy, and loading cubes from PDW we are very optimistic seeing MS PDW as a cornerstone of the BI infrastructure. The PDW v2 brings even more possibilities. Especially interesting can be having in one PDW-appliance a relational DWH model and a Stage Area in Hadoop .