07.04.2014

Discover missing rows of data

Technical Value

PDW 2012 | SQL Server 2012 | SQL Server 2014

If your source data contains a subsequent number without gaps it’s relatively easy to find out if data rows are missing. The approach I’m showing here uses window functions that are available since SQL Server 2012 and SQL Server Parallel Data Warehouse 2012.

In order to have some sample data for this post, I’m using the FactInternetSales table of the AdventureWorksDW2012 database. Let’s pretend the column SalesOrderNumber of that table should not have any gaps. I convert the column data to a numeric type and use only the rows having line item sequence number equal to 1 for my sample data.

  1. SELECT      
  2. SalesOrderNumber,      
  3. convert(int, substring(SalesOrderNumber,3,255)) SalesOrderIntNumber      
  4. FROM [FactInternetSales]      
  5. WHERE [SalesOrderLineNumber]=1      
  6. ORDER BY SalesOrderNumber

image

Usually the order number is sequentially but we find some gaps here. For example, the order following order number SO43842 is SO43918, so there are 43918 – 43842 – 1 = 75 rows missing.

Using window functions and a sub query, we can add the next number as a column to the query together with the distance:

  1. select *, NextSalesOrderIntNumber-SalesOrderIntNumber-1 MissingRows      
  2. from      
  3. (      
  4. select      
  5.   SalesOrderIntNumber,      
  6.   lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber      
  7. from      
  8. (SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))      
  9.   SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1      
  10. ) TransactionData      
  11. ) TransactionDataSequence

image

As you can see, the 75 missing rows are now being reported correctly by the query. The only task left to do now is to aggregate the amount of missing rows by replacing the outer query like this:

  1. select Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows    
  2. from    
  3. (    
  4. select    
  5.   SalesOrderIntNumber,    
  6.   lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber    
  7. from    
  8. (SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))    
  9.   SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1    
  10. ) TransactionData    
  11. ) TransactionDataSequence

image

As a quality measure you could show the ratio of the missing rows to the total rows (or 100% minus this ratio as a data completeness measure) and – assuming that the missing rows had an average sales amount – also the estimated missing amount. And it’s also useful to get the result on more granular level, for example per month. Here is the full query:

  1. select      
  2.  
  3. orderdatekey/100 [Month],      
  4.  
  5. Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows,      
  6.  
  7. convert(float,Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1))/count(*)      
  8. MissingRowsRatio,      
  9.  
  10. convert(float,Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1))/count(*) 
  11. * Sum([ExtendedAmount]) MissingRowsEstimatedValue      
  12.  
  13. from      
  14. (      
  15. select      
  16.   SalesOrderIntNumber,      
  17.   lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber,      
  18.   [ExtendedAmount], OrderDateKey      
  19. from      
  20. (SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))      
  21. SalesOrderIntNumber, [ExtendedAmount], OrderDateKey      
  22. FROM [FactInternetSales] where [SalesOrderLineNumber]=1      
  23. ) TransactionData      
  24. ) TransactionDataSequence
  25.  
  26. group by orderdatekey/100      
  27. order by orderdatekey/100

image

Plotting the result over the time gives a good overview. For my example data, quality improved a lot since August 2007.

image

Conclusion: This is another example how window functions provide an elegant solution for solving analytical data tasks. And since this works perfectly on a PDW, the approach works well even with billions of rows of data.

Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden