BLOG

Some more thoughts on quantiles…

19.07.2008 Hilmar Buchta

SQL Server 2005

In a previous post some weeks ago I wrote about calculating a quantile in MDX. Of course you could also think of using a SSAS stored procedure for the calculation as stored procedures offer the full range of .Net programmability and libraries for doing the calculations. In my last post about quantiles I simplified the calculation a lot by assuming that our computation results in a single element. This made the calculation easy to be written in MDX. But if you need the quantile at a higher precision you could approximate the values between two discrete elements.

First let’s take a brief look at the simplification we did when calculating the quantile in MDX: The calculation of the quantile is simple in the case that (n-1) * p% gives an integer (n=number of cases, p%=percent of the quantile). In our example we had 10001 scenarios (n=10001) and computed a 5% quantile (p%=5%) giving (n-1)*p% = 500. So we simply have to take the 501th element from the sorted set of cases.

If y:=1+(n-1)*p% is not an integer, we have to use the following calculation:

Q = ( S(b) – S(a) ) * ( y – a )

where a is the integer part of (n-1)*p%+1, b=a+1 and S(a), S(b) the scenario values at index a or b. There are other special cases for the computation but I don’t want to get into too much detail here.

The following code just shows the simple calculation (sort the elements and take the Nth element) but it can easily be extended to a higher sophisticated calculation:

  1. using System;      
  2. using System.Collections.Generic;      
  3. using Microsoft.AnalysisServices.AdomdServer;      
  4. using Microsoft.AnalysisServices;
  5. namespace ASStatistics      
  6. {      
  7.     public class Statistic      
  8.     {      
  9.         public static double Quantile(Set inputSet, Expression inputExpression, double quantile)      
  10.         {
  11.             if (inputSet.Tuples.Count == 0) return 0;
  12.             System.Diagnostics.Debug.Assert(quantile >= 0 & quantile < 1, „Error in Quantile calculation“);                        
  13.             double[] q=new double[inputSet.Tuples.Count];      
  14.             int i=0;
  15.             foreach (Tuple tuple in inputSet)      
  16.             {      
  17.                 q[i] = inputExpression.Calculate(tuple).ToDouble();      
  18.                 i++;      
  19.             }
  20.             Array.Sort(q);
  21.             int position = (int)(quantile * i);      
  22.             return q[position];      
  23.         }
  24.     }      
  25. }

After installing this library with the stored procedure to our SSAS server (e.g. using SSMS), we can test it using MDX:

  1. with      
  2. member [Quantile_0] as ASStatistics.Quantile([Date].[Date].[Date],[Measures].[Gross Profit],0)      
  3. member [Quantile_5] as ASStatistics.Quantile([Date].[Date].[Date],[Measures].[Gross Profit],0.05)      
  4. member [Quantile_20] as ASStatistics.Quantile([Date].[Date].[Date],[Measures].[Gross Profit],0.20)
  5. select {[Quantile_0],[Quantile_5],[Quantile_20]} on 0      
  6. from [Adventure Works]

image

So, the worst profit at one day was -599576$ but only in 5% of the days the profit was below 986$ and only in 20% of the days the profit was below 4576$.

Now, the code above should be easy to adapt to all the cases for the quantile computation. If I find some time to do so, I’ll post it later on.

So it’s really easy to do the calculation in a stored procedure. But how about performance? I used the following queries to measure the performance (both formatted by the current release of Mosha’s MDX Studio ). First the query for the MDX calculation:

  1. WITH  
  2.   MEMBER [QuantileMDX] AS      
  3.     max(      
  4.       bottomcount(      
  5.         [Date].[Date].[Date],      
  6.         (int(((Count([Date].[Date].[Date]) 1) * 0.05)) + 1),      
  7.         [Measures].[Gross Profit]),      
  8.       [Measures].[Gross Profit])      
  9. SELECT      
  10.   [QuantileMDX] ON 0      
  11. FROM [Adventure Works]

And here’s the query for the stored procedure approach:

  1. WITH      
  2.   MEMBER [QuantileStoredProcedure] AS      
  3.     ASStatistics.Quantile(      
  4.     [Date].[Date].[Date],      
  5.     [Measures].[Gross Profit], 0.05)      
  6. SELECT      
  7.   [QuantileStoredProcedure] ON 0      
  8. FROM [Adventure Works]

Before measuring performance it’s important to take the OLAP cache into account. The cache can be cleared using a simple XMLA script like this (see here for more details ):

  1. <Batch xmlns=“http://schemas.microsoft.com/analysisservices/2003/engine“>
  2.   <ClearCache xmlns=“http://schemas.microsoft.com/analysisservices/2003/engine“>      
  3.     <Object>      
  4.       <DatabaseID>Adventure Works DW</DatabaseID>      
  5.     </Object>      
  6.   </ClearCache>      
  7. </Batch>

So here are the result of the above queries (tested on my notebook):

  Cold Cache Warm Cache
MDX 188ms 36ms
Stored Procedure 217ms 63ms

The values differed a little so I took the average over 10 executions (removing the best and worse case before). Of course the comparison is not absolutely accurate because the calculation method, the implementation of sort in both languages etc. also have an influence on the result. Furthermore, both functions could by optimized (maybe you want to use the MDX tail function instead of max or try another approach).

I still think the result is significant. While there is not a big difference with the cold cache, the difference becomes much bigger with the warm cache. The MDX variant takes more profit from the cache than the stored procedure approach.

While experimenting with the stored procedure I also added another variant that only takes a set and the quantile as parameters:

  1. public static double Quantile(Set inputSet, double quantile)

The only difference to the version above is that I evaluate the set results using

  1. MDXValue.FromTuple(tuple).ToDouble();

instead of

  1. inputExpression.Calculate(tuple).ToDouble();

This way, I can call the function in both ways (with two or three parameters):

  1. ASStatistics.Quantile(([Date].[Date].[Date],[Measures].[Gross Profit]), 0.05)

or

  1. ASStatistics.Quantile([Date].[Date].[Date],[Measures].[Gross Profit], 0.05)

It’s quite interesting (and I didn’t expected this) that the variant with the two parameters is much slower. Even with a warm cache it took about 1500ms to compute the above result (more than 9 sec on a cold cache).

So, although passing sets to stored procedure is no longer a bottleneck for performance, you should still be aware of performance issues.

Your email address will not be published. Required fields are marked *

Join #teamoraylispeople

Gestalte mit uns
die Welt der Daten