16.03.2014

Converting events to hourly based aggregations

Technical Value

PDW 2012 | SQL Server 2012

For today’s post I didn’t find a really good title. Here’s what this post is about: Sometimes you’ll find event based data in your source system (something happens at a specific point in time) but for the data warehouse you want to transform this data to match a given time dimension. The situation is similar to an older post I wrote about SQL Server window functions .

There are some approaches to accomplish this task. For today’s post I’d like to show a SQL-based approach we’re currently using in an ELT process on a Parallel Data Warehouse (PDW) 2012.

Let’s assume you’re working for a retailer who is interested in the number of cash desks being open at a time. A subset of the event data (single day, single store) may look like this:

t1

In this case cash desk 2 opened at 06:35, then desk 1 opens at 8:27, then desk 2 closes at 11:58 and so on. The question is, how many cash desks are open for example from 08:00 to 09:00? If a cash desk is only open for half an our in the given time range, it should be counted as 0.5, so between 08:00 and 09:00 approximately 1.5 cash desks where open (desk 2 for the full hour and desk 1 for half of an hour).

In order to get the number of cash desk being open, we first convert the transaction type to a delta: +1 means a desk opens, –1 means a desk closes. Here is the query together with the result:

 

  1. select      
  2.   TransactionDate      
  3.   , StoreID      
  4.   , TransactionTime      
  5.   , CashdeskID      
  6.   , case TransactionType      
  7.     when 'signon' then 1      
  8.     when 'signoff' then -1      
  9.   else 0      
  10.   end CashdeskDelta      
  11. from (      
  12.   select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType      
  13.   from POSData where TransactionType in ('signon','signoff')) P      
  14. order by 1,2,3

The result (again a subset) may look like this:

t2

After transforming the transaction type to a numeric value, we can aggregate it using a window function. Therefore I’m using the query from above as a sub query:

 

  1. select *,      
  2. sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime]) OpenCashdesks,      
  3. datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by TransactionDate, StoreID order by [TransactionTime])) TimeDelta      
  4. from      
  5. (      
  6. -- query from above --      
  7. ) CheckoutDetails      
  8. order by 1,2,3

Again, this shows the power of the window functions. The query gives us the number of open cash desks together with the number of seconds to the next event.

t3

For example, from 8:27 to 11:58, 12622 seconds passed. During this time, 2 cash desks were open. This is a huge step towards the solution but we still have no hour based time frame in the data. However, this can easily be created by cross joining the dimensions for store and time. For my example, I have no store or time dimension (as you should usually have), so I’m using the table sys.all_objects here to generate a sufficient number of data rows:

 

  1. with      
  2. Timeslot AS (      
  3. select T2.TransactionDate, T2.StoreID, T1.TimeGrid from      
  4. (select top 24 convert(time,dateadd(hour,row_number () over (order by [name])-1,'00:00:00')) TimeGrid from sys.all_objects) T1      
  5. cross join      
  6. (select distinct TransactionDate, StoreID from POSData) T2      
  7. )

t4

The query creates one row per hour for each store and each date. Again, usually you would use your existing dimension tables instead of the sys.all_objects table here.

Now, let’s bring both queries together:

  1. with      
  2.  
  3. Timeslot AS      
  4. (      
  5. .. – timeslot query from above      
  6. ),      
  7.  
  8. CashDeskTimesDetails as      
  9. (
  10.      
  11. select *,      
  12. sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])      
  13. CashdesksOpen      
  14. ,sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])*      
  15. TimeDeltaSeconds CashdeskOpenSeconds      
  16. ,convert(time, dateadd(hour, datediff(hour, 0, TransactionTime),0)) TransactionHour      
  17. from      
  18. (      
  19. select      
  20. TransactionDate      
  21. , StoreID      
  22. , TransactionTime      
  23. , coalesce(        
  24. datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by        
  25. TransactionDate, StoreID order by [TransactionTime]))        
  26. ,        
  27. datediff(s,[TransactionTime],dateadd(day,1,0)) -- fill seconds to end of day        
  28. ) TimeDeltaSeconds        
  29. , CashdeskID      
  30. , case TransactionType      
  31. when 'signon' then 1      
  32. when 'signoff' then -1      
  33. else 0      
  34. end CashdeskDelta      
  35. from (      
  36. select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType from      
  37. POSData where TransactionType in ('signon','signoff')      
  38. union all      
  39. select TransactionDate, StoreID, TimeGrid, 0, 'timeslot' from Timeslot      
  40. ) P      
  41. ) CheckoutDetails      
  42. )      
  43. select * from CashDeskTimesDetails      
  44. order by 1,2,3

The result shows the original data together with the fixed time frame (24 hours).

t5

Some things to pay special attention to:

  • The inserted timeslots are created with a Cashdesk delta value 0, so they do not modify the number of open/closed desks (column CashdeskOpen)
  • In blue: the last time slot has no subsequent timeslot. Therefore the window function returns null. Here we override this with the number of seconds until day’s end.
  • In red: We add the base hour to each row. This will be used for a group-by operation in the following step

Finally, we simply need to aggregate the last query result:

 

  1. select 
  2. TransactionDate,TransactionHour,StoreID,      
  3. Sum(convert(float,CashdeskOpenSeconds)) / Sum(convert(float,TimeDeltaSeconds)) CashdeskCount      
  4. from CashDeskTimesDetails      
  5. where TimeDeltaSeconds<>0      
  6. group by TransactionDate,TransactionHour,StoreID      
  7. order by 1,2,3      

 

Here is the final result for the sample data subset from above:

t6

Removing all filters (all dates, all stores) may result in a graph like this:

image

So this post showed how to transform event based data to a fixed time scale (hours in this case) to match a star schema join to the time dimension. Since we only used SQL this process can be easily used in an ELT loading scenario.

Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden