Data Vault - Link Loads explained - Part One
The Data Vault Model is known for its flexibility and its ability to scale out. One of the reasons which makes the model so scalable is the way relationships are realized. Despite how the relationship is modelled in the source system, it will be a Many-to-Many-Table in the Data Vault Model. This Many-To-Many Tables are called Links. Links contain Surrogate Keys of two or more Hub Tables. They store a distinct list of Key Combinations of the relationship, they represent.
In today’s post I choose a source table SalesOrderDetail modelled as a Link Table SalesOrderDetail_Lnk. The SalesOrderDetail from the Adventure Works 2012 has a relationship to the Product and to the SalesOrder. The Link-Table SalesOrderDetail _Lnk is referencing the Product_Hub and the SalesOrder_Hub. Therefore the foreign keys are Product_Seq and SalesOrder_Seq. Like the Hub Table also a Link Table needs some Metadata Columns. So I added the columns LoadTimestamp, LoadProcess and the RecordSource to the Table. In my article about Hub Loads this columns are descripted more detailed.
Data Vault 1.0 Link Load
To load a Data Vault 1.0 Link Table a distinct list of Business Keys have to be selected from the Stage. More precisely the distinct list of Business Key - combinations have to be selected. For every single of those Business Keys a Lookup has to be made to get the Surrogate Key of the related Hub Table. Business Key combinations which already exist inside the Link Table will be dropped from the Data Flow. Before a new Row will be inserted into the Link Table, a Surrogate Sequence Key will be added to the Row.
A SQL Server Integration Services Package loads the SalesOrderDetail-Data into a Stage Table SalesOrderDetail _Sales_AdventureWorks2012. SalesOrderDetail (Source) >> SalesOrderDetail_Sales_AdventureWorks2012 (Stage) The Link Load Pattern is used to load the Data into the Table SalesOrderDetail_Lnk. SalesOrderDetail _Sales_AdventureWorks2012 (Stage) >> SalesOrderDetail _Lnk (Destination)
T-SQL – Implementation
The stored procedure Load_SalesOrderDetail_Link represents the Data Vault 1.0 Link Load Pattern. In here i use a Sequence as Default Constrain to generate the Surrogate Key for the Link Table. The Parameter @LoadProcess has to be set by the ETL Process that executes the procedure. It can be generated by any tool or scheduler that is running your ETL-Workflow. If you use a SSIS Package to execute your stored procedure i recommend to use the ServerExecutionID within SSIS.
SQL Server Integration Services – Implementation
The implementation in SQL Server Integration Service displays more visual the simplicity of this load pattern. Like in the T-SQL example above the Surrogate Key is generated by a Sequence used as Default Constrain. This method is described more detailed in my previous post about Hub Loads .
The implementations are examples to illustrate the pattern for loading a Data Vault 1.0 Link Table. Individual project conditions will lead to individual implementation of this pattern. The examples show how simple this pattern can be realized. The benefit of having simple pattern is the ability to automate those processes. Standardised simple and repeatable pattern are making ETL generating possible.
Link Tables form together with the Hub Tables the skeleton of the Data Vault model. The use of Many-To-Many Tables makes the Data Vault Model very flexible. When the Data Vault Model gets extended there will be almost zero refactoring effort. This abilities are making Enterprise Data Warehouse projects more agile and flexible.
Sources: Dan Linstedt , Data Vault 2.0 boot camp class