BLOG

Discover missing rows of data

07.04.2014 Hilmar Buchta

PDW 201 2 | SQL Server 2012 | SQL Server 2014

If your source data contains a subsequent number without gaps it’s relatively easy to find out if data rows are missing. The approach I’m showing here uses window functions that are available since SQL Server 2012 and SQL Server Parallel Data Warehouse 2012.

In order to have some sample data for this post, I’m using the FactInternetSales table of the AdventureWorksDW2012 database. Let’s pretend the column SalesOrderNumber of that table should not have any gaps. I convert the column data to a numeric type and use only the rows having line item sequence number equal to 1 for my sample data.

  1. SELECT
  2. SalesOrderNumber,
  3. convert(int, substring(SalesOrderNumber,3,255)) SalesOrderIntNumber
  4. FROM [FactInternetSales]
  5. WHERE [SalesOrderLineNumber]=1
  6. ORDER BY SalesOrderNumber

image

Usually the order number is sequentially but we find some gaps here. For example, the order following order number SO43842 is SO43918, so there are 43918 – 43842 – 1 = 75 rows missing.

Using window functions and a sub query, we can add the next number as a column to the query together with the distance:

  1. select *, NextSalesOrderIntNumberSalesOrderIntNumber1 MissingRows
  2. from
  3. (
  4. select
  5.   SalesOrderIntNumber,
  6.   lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
  7. from
  8. (SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
  9.   SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
  10. ) TransactionData
  11. ) TransactionDataSequence

image

As you can see, the 75 missing rows are now being reported correctly by the query. The only task left to do now is to aggregate the amount of missing rows by replacing the outer query like this:

  1. select Sum(NextSalesOrderIntNumberSalesOrderIntNumber1) MissingRows
  2. from
  3. (
  4. select
  5.   SalesOrderIntNumber,
  6.   lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
  7. from
  8. (SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
  9.   SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
  10. ) TransactionData
  11. ) TransactionDataSequence

image

As a quality measure you could show the ratio of the missing rows to the total rows (or 100% minus this ratio as a data completeness measure) and – assuming that the missing rows had an average sales amount – also the estimated missing amount. And it’s also useful to get the result on more granular level, for example per month. Here is the full query:

  1. select
  2. orderdatekey/100 [Month],
  3. Sum(NextSalesOrderIntNumberSalesOrderIntNumber1) MissingRows,
  4. convert(float,Sum(NextSalesOrderIntNumberSalesOrderIntNumber1))/count(*)
  5. MissingRowsRatio,
  6. convert(float,Sum(NextSalesOrderIntNumberSalesOrderIntNumber1))/count(*)
  7. * Sum([ExtendedAmount]) MissingRowsEstimatedValue
  8. from
  9. (
  10. select
  11.   SalesOrderIntNumber,
  12.   lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber,
  13.   [ExtendedAmount], OrderDateKey
  14. from
  15. (SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
  16. SalesOrderIntNumber, [ExtendedAmount], OrderDateKey
  17. FROM [FactInternetSales] where [SalesOrderLineNumber]=1
  18. ) TransactionData
  19. ) TransactionDataSequence
  20. group by orderdatekey/100
  21. order by orderdatekey/100

image

Plotting the result over the time gives a good overview. For my example data, quality improved a lot since August 2007.

image

Conclusion: This is another example how window functions provide an elegant solution for solving analytical data tasks. And since this works perfectly on a PDW, the approach works well even with billions of rows of data.

Your email address will not be published. Required fields are marked *

Join #teamoraylispeople

Gestalte mit uns
die Welt der Daten