BLOG

Data Vault Satellite Loads explained

04.07.2016 Jörn Ebbers

With this blog post i want to continue the series on explaining Data Vault Loads. Today i want to describe the load pattern for Data Vault Satellite. In a Data Vault model satellites are used to store descriptive attributes associated with Hubs or Links with tracking of all historical values. Therefore the primary key of a satellite is a two part key build by the Surrogate Key of the Hub/Link, which it is attached to, extended by a datetime-stamp. With these two mandantory fields tracking of historical data is possible.

The default Satellite has pros and cons. On the one hand the load of this default satellite is very simple because it works „Insert-Only“ – similar to Hub Loads. On the other hand the further work with this default satellite is very complex. To reduce complexity in general and to achieve an improve of speed, reusability, parallelism and other reasons i recommend to extend the default satellite with additional fields to prepare it for the use with „End-Dating“. For this purpose two metadata-fields for effective date/time (ValidFrom) and expiry date/time (ValidTo) are added. Because use cases often require current information without history, an additional field is added to allow an easy identification or current data. Therefore in this post i focus on loading satellites with „End-Dating“.

Data Vault Satellite Load 1.0

Data Vault Satellite Load Process (DV 1.0)
Data Vault Satellite Load Process (DV 1.0)

The function of a satellite load is to select the unique list of attributes, lookup the Surrogate Key of the attached Hub/Link and check for changes against the current record in the satellite. If no change is detected, discard  the record from the pipeline. If a change is deteced, insert the new record into the satellite as current record and update the previous records end date.

Example

To explain satellite loads i am using a well-known example from the Adventure Works 2012 Data Base. The example load is supposed to load product information into the Data Vault Model. The destination table SAT_ProductDetails has the following structure:

Data Vault 1.0 Satellite Table

The following T-SQL and SSIS Implementations will illustrate the concept of Data Vault Satellite 1.0 Loads:

T-SQL Implementation

In this example a stored procedure  Load_SAT_ProductDetails  represents the Satellite Load Pattern. Basis of this approach is the SQL Merge command. The Parameter @LoadProcess has to be set by the ETL Process that executes the procedure. It can be generated by any tool or scheduler that is running your ETL-Workflow. If you use a SSIS Package to execute your stored procedure i recommend to use the ServerExecutionID within SSIS.

First some variables are declared for values that are used several times. The variable @DefaultValidFrom stores the initial Effective Date for the first occurance of a record. In this example January, 1st 1900 is sufficient to handle all use cases for the adventure works example. Of course this can be adjusted if it is required.

For the selection of unique attributes from stage, the query uses a distinct statement. This works for a non persistent staging area where is ensured, that a change can only occure once per record. Otherwise the merge statement returns an error, because it cannot update the same row of the target table multiple times. In this case you have to extend the query to meet this requirement. This can be done in many different ways, which I will not describe in detail here because it is beyond the scope of this post.

The unique attributes from source are looked up by the Hub/Link-Key Product_seq and IsCurrent=1 to lookup against the current record. For existing records a change detection based on a comparison of each attribute is used. In this example i assume all attributes never can become NULL. If NULL-Values can occure, the compare needs to be extended to handle them.

Alternatively the comparison can be done by CDC-Hashes.

For detected changes the old record in the target table will be outdated. A current record is inserted by the overlying insert-statement, which uses the MergeOutput as source. New records are added.

For purpose of integration wihtin a workflow the stored procedure is extended with an errorhandling, which returns the result of the execution. This resultset can be customized to deliver further information about the execution, like rowcounts for the several operations within the merge statement.

  1. CREATE PROCEDURE DataVault.Load_SAT_ProductDetails
  2.     @LoadProcess BIGINT
  3. As
  4.  
  5. DECLARE @RecordSource        nvarchar(100)
  6. DECLARE @DefaultValidFrom    datetime2(0)     –use datetime2(0) to remove milliseconds
  7. Declare @DefaultValidTo      datetime2(0)  
  8. DECLARE @LoadDateTime        datetime2(0)  
  9.  
  10. SET @RecordSource            = N’AdventureWorks2012.Product.Product‘
  11. SET @DefaultValidFrom        = ‚1900-01-01‘
  12. SET @DefaultValidTo          = ‚9999-12-31‘
  13. SET @LoadDateTime            = GETDATE()
  14.  
  15. BEGIN TRY
  16. Begin Transaction
  17.  
  18. –Insert new current records for changed records
  19. INSERT INTO DataVault.SAT_ProductDetails
  20. (
  21.      Product_Seq,LoadTimestamp,Name,ListPrice, LoadProcess, RecordSource, ValidFrom,ValidTo, IsCurrent
  22. )
  23. SELECT
  24.      Product_Seq,
  25.      @LoadDateTime,                          –LoadDatetimeStamp
  26.      Name,
  27.      ListPrice,
  28.      @LoadProcess as LoadProcess,
  29.      @RecordSource as RecordSource,    
  30.      @LoadDateTime,                          –Actual DateTimeStamp
  31.      @DefaultValidTo,                        –Default Expiry DateTimestamp
  32.      1                                       –IsCurrent Flag
  33. FROM
  34. (
  35.      MERGE DataVault.SAT_ProductDetails AS Target     –Target: Satellite
  36.      USING
  37.      (
  38.           — Query distinct set of attributes from source (stage)
  39.           — includes lookup of business key by left outer join referenced hub/link
  40.           SELECT distinct
  41.           hub.Product_Seq,
  42.           stage.Name,                          
  43.           stage.ListPrice
  44.           FROM stage.Product_Product_AdventureWorks2012 as stage
  45.           LEFT OUTER JOIN DataVault.Product_Hub as hub on stage.ProductNumber=hub.Product_BK
  46.           WHERE hub.Product_Seq is not null
  47.      ) AS Source
  48.      ON Target.Product_Seq = Source.Product_Seq         –Identify Columns by Hub/Link Surrogate Key
  49.      AND Target.IsCurrent = 1                           –and only merge against current records in the target
  50.      –when record already exists in satellite and an attribute value changed
  51.      WHEN MATCHED AND
  52.      (
  53.           Target.Name <> Source.Name
  54.           OR Target.ListPrice <> Source.ListPrice
  55.      )
  56.      — then outdate the existing record
  57.      THEN UPDATE SET
  58.           IsCurrent  = 0,
  59.           ValidTo    = @LoadDateTime
  60.      — when record not exists in satellite, insert the new record
  61.      WHEN NOT MATCHED BY TARGET
  62.      THEN INSERT
  63.      (
  64.           Product_Seq, LoadTimestamp, Name, ListPrice, LoadProcess, RecordSource, ValidFrom, ValidTo, IsCurrent
  65.      )
  66.      VALUES
  67.      (
  68.           Source.Product_Seq,
  69.           @LoadDateTime,
  70.           Source.Name,
  71.           Source.ListPrice,
  72.           @LoadProcess,
  73.           @RecordSource,
  74.           @DefaultValidFrom,     –Default Effective DateTimeStamp
  75.           @DefaultValidTo,       –Default Expiry DateTimeStamp
  76.           1                      –IsCurrent Flag
  77.      )
  78.      — Output changed records
  79.      OUTPUT
  80.           $action AS Action
  81.           ,Source.*
  82. ) AS MergeOutput
  83. WHERE MergeOutput.Action = ‚UPDATE‘
  84. AND Product_Seq IS NOT NULL;
  85.  
  86. Commit
  87.      SELECT
  88.           ‚Success‘ as ExecutionResult
  89.      RETURN;
  90. END TRY
  91.  
  92. BEGIN CATCH
  93.  
  94.      IF @@TRANCOUNT > 0
  95.      ROLLBACK
  96.  
  97.      SELECT
  98.           ‚Failure‘ as ExecutionResult,
  99.           ERROR_MESSAGE() AS ErrorMessage;
  100.      RETURN;
  101. END CATCH
  102.  
  103. GO

SQL Server Integration Services – Implementation

When we model Data Warehouse Loads, we want to use an ETL Tool like Microsoft SQL Server Integration Services. This sample shows how the for pattern loading Data Vault Satellites 1.0 can be implemented in Integration Services.

Data Vault 1.0 Satellite SSIS Pattern

In this example the whole load process is implemented wihtin one data flow task. First a unique list of attributes is selected from the stage by the following query:

  1. SELECT DISTINCT
  2. ProductNumber as Product_BK,
  3. Name,
  4. ListPrice
  5. FROM stage.Product_Product_AdventureWorks2012

This data set is extended by the following derived and fixed attributes:

Data Vault Satellite Load - derived and fixed attributes

For Data Vault Satellite Loads 1.0 we need to lookup the current Surrogate Key for the related Hub/Link. This is done in the fist lookup.  Non matching rows are discarded by the lookup, because we can not load a Satellite wihtout having a related Hub or Link. The Surrogate Key (Product_Seq) is added to the matching rows. For the matching rows the second lookup checks if the record already exists in the satellite. If it not exists in the Satellite a new record can be written. For the existing records in the satellite the third lookup compares all satellite attributes. If no match is found, that means there is noch change and the record can be discarded. If the lookup can not find a match, an attribute has changed. In this case we need two operations to perform the change. First we need to insert a new current record with the actual LoadDateTime as effective date. For the old record we have to perform an update and set the expiry date to the actual LoadDateTime and IsCurrent to „0“ (zero).

In this example i show a very simple version of an SSIS implementation. It is recommended to extend this pattern to your own requirements by adding additional functionalities like integration of metadata or logging.

Data Vault Satellite Load 2.0

Data Vault Satellite Load 2.0
Data Vault Satellite Load Process (2.0)

The main difference between Satellite Loads in Data Vault 1.0 and Data Vault 2.0 ist, that lookup of a Surrogate Key for the referencing Hub/Link is no longer necessary. This is usually a bottleneck and represents a synchronization point during ETL-processing. With Data Vault 2.0 the Satellite can be loaded without any lookup and without waiting for the referencing Hub/Link to be processed.

Example

The difference in modeling between Data Vault 1.0 and Data Vault 2.0 is the replacement of sequences by hash keys. As opposed to sequences, a hash-key is not generated by the data warehouse regardless of the underlying data. It is derived by the business key and computed when entering the staging-area of the data warehouse.

Data Vault Satellite Table 2.0
Data Vault Satellite Table 2.0

The following T-SQL and SSIS Implementations will illustrate the concept of Data Vault Satellite 2.0 Loads:

T-SQL Implementation

In this example a stored procedure  Load_SAT_ProductDetails_DV20  represents the Satellite Load Pattern for Data Vault 2.0. It works similar to the DV 1.0 stored procedure. The Sequences Product_Seq are replaced by hash keys. The main difference is the Using within the Merge-statement, which only selects the input data from the staging-area without performing a lookup to its referenced Hub/Link.

  1. CREATE PROCEDURE DataVault.Load_SAT_ProductDetails_DV20
  2.     @LoadProcess BIGINT
  3. As
  4.  
  5. DECLARE @RecordSource        nvarchar(100)
  6. DECLARE @DefaultValidFrom    datetime2(0)     –use datetime2(0) to remove milliseconds
  7. Declare @DefaultValidTo      datetime2(0)  
  8. DECLARE @LoadDateTime        datetime2(0)  
  9.  
  10. SET @RecordSource            = N’AdventureWorks2012.Product.Product‘
  11. SET @DefaultValidFrom        = ‚1900-01-01‘
  12. SET @DefaultValidTo          = ‚9999-12-31‘
  13. SET @LoadDateTime            = GETDATE()
  14.  
  15. BEGIN TRY
  16. Begin Transaction
  17.  
  18. –Insert new current records for changed records
  19. INSERT INTO DataVault.SAT_ProductDetails_DV20
  20. (
  21.      Product_Hsk,LoadTimestamp,Name,ListPrice, LoadProcess, RecordSource, ValidFrom,ValidTo, IsCurrent
  22. )
  23. SELECT
  24.      Product_Hsk,                            –Hash Key
  25.      @LoadDateTime,                          –LoadDatetimeStamp
  26.      Name,
  27.      ListPrice,
  28.      @LoadProcess as LoadProcess,
  29.      @RecordSource as RecordSource,    
  30.      @LoadDateTime,                          –Actual DateTimeStamp
  31.      @DefaultValidTo,                        –Default Expiry DateTimestamp
  32.      1                                       –IsCurrent Flag
  33. FROM
  34. (
  35.      MERGE DataVault.SAT_ProductDetails_DV20 AS Target     –Target: Satellite
  36.      USING
  37.      (
  38.           — Query distinct set of attributes from source (stage)
  39.           SELECT distinct
  40.           stage.Product_Hsk,
  41.           stage.Name,                          
  42.           stage.ListPrice
  43.           FROM stage.Product_Product_AdventureWorks2012_DV20 as stage
  44.      ) AS Source
  45.      ON Target.Product_Hsk = Source.Product_Hsk         –Identify Columns by Hub/Link Hash Key
  46.      AND Target.IsCurrent = 1                           –and only merge against current records in the target
  47.      –when record already exists in satellite and an attribute value changed
  48.      WHEN MATCHED AND
  49.      (
  50.           Target.Name <> Source.Name
  51.           OR Target.ListPrice <> Source.ListPrice
  52.      )
  53.      — then outdate the existing record
  54.      THEN UPDATE SET
  55.           IsCurrent  = 0,
  56.           ValidTo    = @LoadDateTime
  57.      — when record not exists in satellite, insert the new record
  58.      WHEN NOT MATCHED BY TARGET
  59.      THEN INSERT
  60.      (
  61.           Product_Hsk, LoadTimestamp, Name, ListPrice, LoadProcess, RecordSource, ValidFrom, ValidTo, IsCurrent
  62.      )
  63.      VALUES
  64.      (
  65.           Source.Product_Hsk,
  66.           @LoadDateTime,
  67.           Source.Name,
  68.           Source.ListPrice,
  69.           @LoadProcess,
  70.           @RecordSource,
  71.           @DefaultValidFrom,     –Default Effective DateTimeStamp
  72.           @DefaultValidTo,       –Default Expiry DateTimeStamp
  73.           1                      –IsCurrent Flag
  74.      )
  75.      — Output changed records
  76.      OUTPUT
  77.           $action AS Action
  78.           ,Source.*
  79. ) AS MergeOutput
  80. WHERE MergeOutput.Action = ‚UPDATE‘
  81. AND Product_Hsk IS NOT NULL;
  82.  
  83. Commit
  84.      SELECT
  85.           ‚Success‘ as ExecutionResult
  86.      RETURN;
  87. END TRY
  88.  
  89. BEGIN CATCH
  90.  
  91.      IF @@TRANCOUNT > 0
  92.      ROLLBACK
  93.  
  94.      SELECT
  95.           ‚Failure‘ as ExecutionResult,
  96.           ERROR_MESSAGE() AS ErrorMessage;
  97.      RETURN;
  98. END CATCH
  99.  
  100. GO

SQL Server Integration Services – Implementation

This sample shows how the for pattern loading Data Vault Satellites 2.0 can be implemented in Integration Services.

Data Vault 2.0 Satellite SSIS Pattern

It is similar to the loading pattern for a Data Vault Satellite 1.0. The main difference is, that in this implementation we only need two lookups. We do not have to lookup a Surrogate Key of the related Hub or Link table. For the remaining two lookups the Hash-Key (Product_Hsk) delivered by the source (stage) is used as lookup column. Again, a simple version of the implementation has been presented, which can be extended individually.

Conclusion

The shown implementations are examples to explain how a Satellite Load works. Each individual project will require individual implementations of these patterns.

A Satellite Load is a simple pattern, which can be easily repeated for every entity in the data model. However, it is not recommend to repeat this manually. To take the most advantage of the Data Vault characteristics it is highly recommend to generate the Data Vault Satellite Loads to achieve a high degree of scalability and procutivity.

Your email address will not be published. Required fields are marked *

Join #teamoraylispeople

Gestalte mit uns
die Welt der Daten