Parallel Data Warehouse (PDW) and ROLAP
PDW 2012 | SQL Server 2012 | SQL Server 2014
This post is about using the Parallel Data Warehouse as a ROLAP source for SSAS. For PDW v1 this wasn’t recommended but the quintessence of this post is, that this really works well with PDW 2012. In fact, this is the first time I saw MOLAP performance on large ROLAP tables (over a billion rows) and again, another big plus for the PDW with the column store index. In fact, I’m really excited about this (and I’ll tell you why in a minute), but maybe I wasn’t loud enough. So here again:
“Using SSAS ROLAP with PDW 2012 is working really well!!!”
But, and I have to lower my voice again, I have to agree with Chris Webb that there is almost no information about it out there. So enough reason to write about this truly amazing story.
Before I’m going into some relevant topics, let me briefly recap the benefits of ROLAP against MOLAP:
|LOW LATENCY||No need to process MOLAP partitions: low latency (data in the relational data warehouse tables are immediately available to the end users)|
|NO/LESS STORAGE REQUIRED||The ROLAP cube only contains the model, not the data. Therefore almost no disk space is required for storing the cube. It’s just the presentation of the model. The MOLAP/ROLAP is a technical implementation issue which is not visible to the end user. For both options, the end user gets an easy to use, highly interactive quick responding data model, which can be used from many tools including Excel pivot tables, Reporting Services, Tableau and other advanced analytical frontend tools.|
|LOWER PROJECT COSTS||No need to design and maintain partitions in the cube (see remarks regarding partitioning below): less development and maintenance afford (for example for daily delta updates)|
|MORE FLEXIBLE||In MOLAP, many changes to a dimension require a full processing of the dimensions which results in all the attached measure group partitions switching to the ‘unprocessed’ state and need to be processed again.
If you have a large cube, this process could take many hours. In ROLAP, all this is not necessary. Changes to cube dimensions are online immediately.
|EASY DEPLOYMENT||Development, testing and deployment to production is much easier since the data is immediately available to end users|
|SUPPORTS LARGE DIMENSIONS||Large dimensions (with many million rows) are difficult to handle for MOLAP SSAS. Processing takes a long time and query performance may go down. But ROLAP works well with large dimensions.|
|SUPPORTS VERY LARGE FACT TABLES||MOLAP cube sizes of 4 or 5 TB are possible and due to the compression in the cube storage, this corresponds to fact table sizes of 50 TB and more. However, if you go beyond, there is a point where only ROLAP cubes can solve the amount of data.|
So there are many advantages when using ROLAP partitions in SSAS. However, there always was a big disadvantage:
|BAD PERFORMANCE||Poor query performance for ROLAP partitions compared to MOLAP partitions.|
Now, with the memory optimized column store index, especially with the parallel query engine of the PDW, you can get an incredible good query performance from ROLAP partitions. Therefore, we have to cross out this disadvantage:
|Poor query performance for ROLAP partitions compared to MOLAP partitions.
With column store index, ROLAP partitions are really fast
And since column store index is also available on SQL Server 2012 (non-clustered, read-only) and 2014 (clustered, updatable) this should also apply to the SMP SQL Server (I haven’t tested it out with huge amounts of data though).
Here are some remarks/recommendations if you’re planning ROLAP on PDW:
Clustered columnstore index
As mentioned above, the clustered column store index of the PDW is the key to using ROLAP on PDW 2012 and maybe the most important reason why ROLAP is now a reliable option on PDW at all. So make sure, your (fact-) tables are stored in clustered column store mode.
Fast network connection between PDW and Analysis Services
Obviously, a fast network connection between the SSAS server and the PDW is important to get a good performance. Of course this is also true for MOLAP or mixed environments. As of today, I would recommend to add the SSAS server to the appliance’s infiniband network.
Table layout: distributed/replicated
Most of the ROLAP queries will basically look like
- select Dim1.Property1, Dim2.Property2, Sum(Fact1.Amount) SumOfAmount
- from Fact1
- inner join Dim1 on Fact1.Dim1Key=Dim1.Dim1Key
- inner join Dim2 on Fact1.Dim2Key=Dim2.Dim2Key
- group by Dim1.Property1, Dim2.Property2
In order to have queries like this respond well, the tables should be distribution-compatible. In many cases you can achieve this by turning the dimension tables into replicated tables. I have more detailed explanation on distribution and aggregation compatibility in some older posts and there is also a good post by Stephan Köppen about this topic here . An incompatible distribution when joining two large fact tables (for example a fact table with a many-to-many bridge table) results in shuffle move or even broadcast move operations, that are also fast, but not lightning fast as you would expect for online analytical applications. So my recommendation is to carefully choose the distribution keys so that the required joins can be resolved locally. Aggregation compatibility is more difficult to achieve for all types of queries. However, from my current experience PDW responded very fast even if the query was not aggregation compatible.
Having a large amount of data in MOLAP or ROLAP cubes usually requires partitioning of the MOLAP/ROLAP measure groups. For MOLAP, recommendations vary from about 20-50 million rows per partition. Therefore storing a billion rows results in at least 20-50 partitions. In practical scenarios you often end up with many more partitions in order to implement daily incremental loading. But because PDW is designed to run large queries it’s much better to use only one partition, instead of firing a small-shot charge of queries to the appliance. Internally the PDW uses distributions in order to run the query using all cores in parallel, so there is no need to create partitions for performance reasons.
By the way, since many small queries require more computing power then a few complex queries you should be careful with Excel’s pivot option “convert to formula”…
As described in this blog post by Anthony Mattas (and in many other posts), you should set EnableRolapDistinctCountOnDataSource in the Analysis Services properties in order to compute the distinct count calculation on the database instead of fetching the distinct rows to Analysis Services.
Please note that this property is not yet available in the server properties but must be set manually in the msmdsrv.ini file (which can be found below the instance in the OLAP\Config sub directory).
Having all your table statistics up to date
This generally is a very important thing when working with the PDW, not only when using ROLAP. While the compute nodes have auto create/auto update enabled, statistics are not (apart from very few cases) automatically created/updated on the control node. Without proper statistics, PDW cannot create an optimal distributed query plan. The simple thing is, that for most cases, where performance is an issue with PDW, incorrect statistics are the root cause.
Dealing with large dimensions
In some cases, having a measure group with many attached dimensions can cause problems if those dimensions are referenced in the query (on rows, columns, filter). I’m currently trying to narrow this down, but one possible reason could be the missing primary key constraints on the PDW together with large dimensions. Consider this simple query:
- select P.Color, Sum(S.SalesAmount) SumOfSalesAmount
- from [dbo].[FactInternetSales] S
- inner join [dbo].[DimProduct] P on S.ProductKey=P.ProductKey
- group by P.Color
If you have a primary key on the dbo.DimProduct.ProductKey, the optimizer knows that the inner join cannot produce more rows than exist in the fact table because for each row from the fact table we can only find at most one row in the dimension table. Without the primary key (which is the situation in the PDW) the optimizer has to consider density information from the statistics. This will work pretty well, but let’s say that for a larger dimension the statistics gives something like: “for each row from the fact table, you might be getting 1.3 rows from the dimension table”. Again, nothing much happened here. But assuming you have many dimensions, the effect may grow exponentially. With 8 dimensions and 30% over guess you would end up at 1.38 = 8.16. So instead of querying for example a billion rows, the optimizer thinks that we’re about to query 8 billion rows. This could have a huge effect on the query plan. If you encounter such issues, one option could be to convert the dimensions in the data source view to query binding. For example, the query for the product dimension may look like this:
- Min(Color) Color,
- Min(EnglishProductName) EnglishProductName,
- Min(ListPrice) ListPrice
- group by ProductKey
Since ProductKey is actually a logical primary key, rewriting the dimension query this way gives the same result as
- select ProductKey, Color, EnglishProductName, ListPrice from DimProduct
but because of the group by operation, the optimizer now know for sure, that the ProductKey is unique thus giving a better estimate of the resulting rows.
Again, I’m still investigating these cases and the benefit of the query rewrite, but if you encounter performance issues, this may be one option to try.
Aggregation design and proactive caching
Since ROLAP partitions rely on indexed views in the data source, you cannot use ROLAP aggregations on the PDW 2012. However, from the query response we got so far, there might not be much need for aggregations at all if your data is stored as a clustered columnstore index. If you need aggregations, you could try HOLAP aggregations. We haven’t tried this so far, but I’m planning to do more investigation.
Proactive caching is currently only supported in polling mode (not with the trace mechanism).
Since we’re still in the process of adjusting our PDW ROLAP environment I’m going to write more posts with tips and tricks, so stay tuned. Actually, we’re investigating these topics:
- How does ROLAP perform with role based security in the SSAS cube?
- How does ROLAP perform with many users?
- How does ROLAP work with more complicated MDX calculations involving PARALLELPERIOD, aggregates (AGGREGATE, SUM, MIN, MAX etc.) over dynamic sets etc.? Which MDX calculations are performing better / for which calculations shall we still use MOLAP?
Also some interesting recommendations (for example regarding the “count of rows” aggregation or the proper setup of your Analysis Services server) can be found in the SQLCAT Whitepaper Analysis Services ROLAP for SQL Server Data Warehouses .