BLOG

Converting events to hourly based aggregations

16.03.2014 Hilmar Buchta

PDW 2012 | SQL Server 2012

For today’s post I didn’t find a really good title. Here’s what this post is about: Sometimes you’ll find event based data in your source system (something happens at a specific point in time) but for the data warehouse you want to transform this data to match a given time dimension. The situation is similar to an older post I wrote about SQL Server window functions .

There are some approaches to accomplish this task. For today’s post I’d like to show a SQL-based approach we’re currently using in an ELT process on a Parallel Data Warehouse (PDW) 2012.

Let’s assume you’re working for a retailer who is interested in the number of cash desks being open at a time. A subset of the event data (single day, single store) may look like this:

t1

In this case cash desk 2 opened at 06:35, then desk 1 opens at 8:27, then desk 2 closes at 11:58 and so on. The question is, how many cash desks are open for example from 08:00 to 09:00? If a cash desk is only open for half an our in the given time range, it should be counted as 0.5, so between 08:00 and 09:00 approximately 1.5 cash desks where open (desk 2 for the full hour and desk 1 for half of an hour).

In order to get the number of cash desk being open, we first convert the transaction type to a delta: +1 means a desk opens, –1 means a desk closes. Here is the query together with the result:

 

  1. select
  2.   TransactionDate
  3.   , StoreID
  4.   , TransactionTime
  5.   , CashdeskID
  6.   , case TransactionType
  7.     when ’signon‘ then 1
  8.     when ’signoff‘ then 1
  9.   else 0
  10.   end CashdeskDelta
  11. from (
  12.   select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType
  13.   from POSData where TransactionType in (’signon‘,’signoff‘)) P
  14. order by 1,2,3

The result (again a subset) may look like this:

t2

After transforming the transaction type to a numeric value, we can aggregate it using a window function. Therefore I’m using the query from above as a sub query:

 

  1. select *,
  2. sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime]) OpenCashdesks,
  3. datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by TransactionDate, StoreID order by [TransactionTime])) TimeDelta
  4. from
  5. (
  6. — query from above —      
  7. ) CheckoutDetails
  8. order by 1,2,3

Again, this shows the power of the window functions. The query gives us the number of open cash desks together with the number of seconds to the next event.

t3

For example, from 8:27 to 11:58, 12622 seconds passed. During this time, 2 cash desks were open. This is a huge step towards the solution but we still have no hour based time frame in the data. However, this can easily be created by cross joining the dimensions for store and time. For my example, I have no store or time dimension (as you should usually have), so I’m using the table sys.all_objects here to generate a sufficient number of data rows:

 

  1. with
  2. Timeslot AS (
  3. select T2.TransactionDate, T2.StoreID, T1.TimeGrid from
  4. (select top 24 convert(time,dateadd(hour,row_number () over (order by [name])1,’00:00:00′)) TimeGrid from sys.all_objects) T1
  5. cross join
  6. (select distinct TransactionDate, StoreID from POSData) T2
  7. )

t4

The query creates one row per hour for each store and each date. Again, usually you would use your existing dimension tables instead of the sys.all_objects table here.

Now, let’s bring both queries together:

  1. with
  2. Timeslot AS
  3. (
  4. .. – timeslot query from above
  5. ),
  6. CashDeskTimesDetails as
  7. (
  8. select *,
  9. sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])
  10. CashdesksOpen
  11. ,sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])*
  12. TimeDeltaSeconds CashdeskOpenSeconds
  13. ,convert(time, dateadd(hour, datediff(hour, 0, TransactionTime),0)) TransactionHour
  14. from
  15. (
  16. select
  17. TransactionDate
  18. , StoreID
  19. , TransactionTime
  20. , coalesce(
  21. datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by
  22. TransactionDate, StoreID order by [TransactionTime]))
  23. ,
  24. datediff(s,[TransactionTime],dateadd(day,1,0)) — fill seconds to end of day        
  25. ) TimeDeltaSeconds
  26. , CashdeskID
  27. , case TransactionType
  28. when ’signon‘ then 1
  29. when ’signoff‘ then 1
  30. else 0
  31. end CashdeskDelta
  32. from (
  33. select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType from
  34. POSData where TransactionType in (’signon‘,’signoff‘)
  35. union all
  36. select TransactionDate, StoreID, TimeGrid, 0, ‚timeslot‘ from Timeslot
  37. ) P
  38. ) CheckoutDetails
  39. )
  40. select * from CashDeskTimesDetails
  41. order by 1,2,3

The result shows the original data together with the fixed time frame (24 hours).

t5

Some things to pay special attention to:

  • The inserted timeslots are created with a Cashdesk delta value 0, so they do not modify the number of open/closed desks (column CashdeskOpen)
  • In blue: the last time slot has no subsequent timeslot. Therefore the window function returns null. Here we override this with the number of seconds until day’s end.
  • In red: We add the base hour to each row. This will be used for a group-by operation in the following step

Finally, we simply need to aggregate the last query result:

 

  1. select
  2. TransactionDate,TransactionHour,StoreID,
  3. Sum(convert(float,CashdeskOpenSeconds)) / Sum(convert(float,TimeDeltaSeconds)) CashdeskCount
  4. from CashDeskTimesDetails
  5. where TimeDeltaSeconds<>0
  6. group by TransactionDate,TransactionHour,StoreID
  7. order by 1,2,3

 

Here is the final result for the sample data subset from above:

t6

Removing all filters (all dates, all stores) may result in a graph like this:

image

So this post showed how to transform event based data to a fixed time scale (hours in this case) to match a star schema join to the time dimension. Since we only used SQL this process can be easily used in an ELT loading scenario.

Your email address will not be published. Required fields are marked *

Join #teamoraylispeople

Gestalte mit uns
die Welt der Daten