PDW 201 2 | SQL Server 2012 | SQL Server 2014
If your source data contains a subsequent number without gaps it’s relatively easy to find out if data rows are missing. The approach I’m showing here uses window functions that are available since SQL Server 2012 and SQL Server Parallel Data Warehouse 2012.
In order to have some sample data for this post, I’m using the FactInternetSales table of the AdventureWorksDW2012 database. Let’s pretend the column SalesOrderNumber of that table should not have any gaps. I convert the column data to a numeric type and use only the rows having line item sequence number equal to 1 for my sample data.
-
SELECT
-
SalesOrderNumber,
-
convert(int, substring(SalesOrderNumber,3,255)) SalesOrderIntNumber
-
FROM [FactInternetSales]
-
WHERE [SalesOrderLineNumber]=1
-
ORDER BY SalesOrderNumber
Usually the order number is sequentially but we find some gaps here. For example, the order following order number SO43842 is SO43918, so there are 43918 – 43842 – 1 = 75 rows missing.
Using window functions and a sub query, we can add the next number as a column to the query together with the distance:
-
select *, NextSalesOrderIntNumber–SalesOrderIntNumber–1 MissingRows
-
from
-
(
-
select
-
SalesOrderIntNumber,
-
lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
-
from
-
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
-
SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
-
) TransactionData
-
) TransactionDataSequence
As you can see, the 75 missing rows are now being reported correctly by the query. The only task left to do now is to aggregate the amount of missing rows by replacing the outer query like this:
-
select Sum(NextSalesOrderIntNumber–SalesOrderIntNumber–1) MissingRows
-
from
-
(
-
select
-
SalesOrderIntNumber,
-
lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
-
from
-
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
-
SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
-
) TransactionData
-
) TransactionDataSequence
As a quality measure you could show the ratio of the missing rows to the total rows (or 100% minus this ratio as a data completeness measure) and – assuming that the missing rows had an average sales amount – also the estimated missing amount. And it’s also useful to get the result on more granular level, for example per month. Here is the full query:
-
select
-
orderdatekey/100 [Month],
-
Sum(NextSalesOrderIntNumber–SalesOrderIntNumber–1) MissingRows,
-
convert(float,Sum(NextSalesOrderIntNumber–SalesOrderIntNumber–1))/count(*)
-
MissingRowsRatio,
-
convert(float,Sum(NextSalesOrderIntNumber–SalesOrderIntNumber–1))/count(*)
-
* Sum([ExtendedAmount]) MissingRowsEstimatedValue
-
from
-
(
-
select
-
SalesOrderIntNumber,
-
lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber,
-
[ExtendedAmount], OrderDateKey
-
from
-
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
-
SalesOrderIntNumber, [ExtendedAmount], OrderDateKey
-
FROM [FactInternetSales] where [SalesOrderLineNumber]=1
-
) TransactionData
-
) TransactionDataSequence
-
group by orderdatekey/100
-
order by orderdatekey/100
Plotting the result over the time gives a good overview. For my example data, quality improved a lot since August 2007.
Conclusion: This is another example how window functions provide an elegant solution for solving analytical data tasks. And since this works perfectly on a PDW, the approach works well even with billions of rows of data.
Kommentare (0)