BLOG

Do you trust your Data Mining results? – Part 2/3

10.10.2010 Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

While the first part of this post was more about the idea and interpreting the results of the test, this part shows how to implement the Monte Carlo test.

First, we need a table with the predicted data mining probabilities. This is the output of the PredictProbability function from your mining result query. I’m using the same source data as in my previous post here. If you like you can easily create your own table and populate it with random probability values in order to test the code for the simulation below:

  1. CREATE TABLE [dbo].[Mining_Result](
  2.     [CaseKey] [int] NOT NULL,
  3.     [PredictScore] [float] NULL
  4. ) ON [PRIMARY]
  5. declare @i int=0
  6.     while (@i<10000) begin
  7.         insert into Mining_Result(CaseKey, PredictScore)
  8.         values(@i, convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5)
  9.         set @i=@i+1
  10. end

Don’t be confused by the convert(…cast…cast newid()…) expression. This is just my approach to calculate a random number within an SQL select statement.

Next we need a table for storing our Mining results:

  1. CREATE TABLE [dbo].[Mining_Histogram](
  2.     [NumCases] [int] NOT NULL,
  3.     [Count] [int] NULL,
  4.     [Perc] [float] NULL,
  5.     [RunningPerc] [float] NULL,
  6. CONSTRAINT [PK_DM_Histogram] PRIMARY KEY CLUSTERED
  7. (
  8.     [NumCases] ASC
  9. )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
  10. ) ON [PRIMARY]

Then this is how we’re doing our Monte Carlo test:

— perform a monte carlo test:

  1. while (@lp<@numtrials) begin
  2.     select @cnt=COUNT(*) from Mining_Result where PredictScore >
  3.         convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5
  4.     if exists(select NumCases from Mining_Histogram Where NumCases=@cnt)
  5.             update Mining_Histogram set [Count]=[Count]+1 Where NumCases=@cnt
  6.         else
  7.             insert into Mining_Histogram(NumCases,[Count]) values (@cnt, 1)
  8.     set @lp=@lp+1;
  9. end

I’m using the same trick for the random numbers as shown above. In this example, we’re doing 10,000 iterations. For each iterations we compute the number of cases for which the Predicted Score is higher than a random number. For example, if for a certain case the predict score is 0.8 it is more likely that a random number between 0.0 and 1.0 is below the score than for a prediction score of 0.1.

Next, we’re filling the gaps in our histogram table with zeros to make the histogram look nicer:

  1. declare @min int;
  2. declare @max int;
  3. select @min=MIN(NumCases), @max=MAX(NumCases) from Mining_Histogram
  4. set @lp=@min;
  5. while (@lp<@max) begin
  6.     if not exists(select NumCases From Mining_Histogram Where NumCases=@lp)
  7.         insert into Mining_Histogram(NumCases,[Count]) values (@lp, 0);
  8.     set @lp=@lp+1
  9. end

Finally we’re computing the row probability and the running total using this T-SQL:

  1. declare @maxcount float;
  2. select @maxcount=SUM([Count]) from Mining_Histogram;
  3. update Mining_Histogram Set Perc=[Count]/@maxcount;
  4. declare @CaseIdx int
  5. declare @perc float
  6. declare @RunningTotal float =0
  7. DECLARE rt_cursor CURSOR FOR select NumCases, Perc From Mining_Histogram
  8. OPEN rt_cursor
  9. FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc
  10. WHILE @@FETCH_STATUS = 0
  11. BEGIN
  12.   SET @RunningTotal = @RunningTotal + @perc
  13.   update Mining_Histogram set RunningPerc=@RunningTotal Where NumCases=@CaseIdx
  14.   FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc
  15. END
  16. CLOSE rt_cursor
  17. DEALLOCATE rt_cursor

After running the simulation this is how the plots of the result look like (using my own values). The first plot shows the value of the field NumCases on the x-axis and the value of the field perc on the y-axis. The second plot has the same x-axis but shows the RunningPerc field on the y-axis:

image

image

These two plots look very much the same as the plots from my last post (although I used C# code there to generate the histogram data).

If you used the randomly generated scores from above for testing, you will notice the peak being around 5000 cases (instead of 2800 cases in my example).

And if you like a smoother version of the density function  (as all the teeth and bumps mainly result from Monte Carlo approach), you could use this SQL query to compute a moving average:

  1. declare @minrange int=0
  2. declare @windowsize int = 50
  3. select @minrange=Min(NumCases) from Mining_Histogram
  4. SELECT     H.NumCases, AVG(H1.[Count]) [Count], AVG(H1.Perc) Perc
  5. FROM         Mining_Histogram H
  6. left join Mining_Histogram H1 on H1.NumCases between H.NumCases@windowsize and H.NumCases
  7. where H.NumCases>@minrange+@windowsize
  8. group by H.NumCases

image

In order to do the histogram computation automatically with prediction query I recommend putting the code in an SSIS script component. I would also use another type of random number generator. This also allows you to set the seed for the random number generator. For my implementation I used an asynchronous script component that first loads all cases into memory (ArrayList collection), then performs the Monte Carlo test on the in-memory data and then writes the results back to the output buffer. This allows you do work with more scenarios and to log the progress during the loading and testing phase of the component.

I’m planning to write a Books Online Community Technical article on this topic. This article will be more detailed regarding the implementation. I will post a link to this article in my blog then.

Your email address will not be published. Required fields are marked *

Join #teamoraylispeople

Gestalte mit uns
die Welt der Daten