08.12.2014

Data Vault - Hub Loads explained

Technical Value

In a Data Vault Model, Business Keys are used to integrate data. A Hub Table contains a distinct list of Business Keys for one entity. This makes the Hub Table the "Key Player" of the Data Vault Data Warehouse.This blog post explains the pattern for loading a Hub Table. Moreover I will explain the differences between Data Vault 1.0 Hub Loads and Data Vault 2.0 Hub Loads.

Data Vault 1.0 Hub Load

  HubDV1

The function of a Hub Load is to load a distinct list of new Business Keys into the Hub Table. In Data Vault 1.0 it also generates a Sequence as Surrogate Key. Previously loaded Business Keys will be dropped from the Data Flow.

Example

To explain Hub Loads I am using a well-known example from the Adventure Works 2012 Data Base. The example load is supposed to load product information into the Data Vault Model.

The destination table Product_Hub has the following structure.

Column Product_Seq
Surrogate Key for the Product

Column Product_Bk
Business Key of the Product - In this case we use the Product Number. Hub Tables have at least one column for the Business Key. When composite Business Keys are needed there can be more.

2Hub

Column LoadTimestamp
Time of the Load - This column indicates the data currency.

Column LoadProcess
ID of the ETL Load that has loaded the record

Column RecordSource
Source of the Record - At least it needs the source system. But also you can add more specific values like the source table. The column could also contain the name of a business unit, where the data was originated. Choose what suites your requirements.

The Data Source is the table Product.Product. This data will be loaded without any transformation into the Stage Table Product_Product_AdventureWorks2012. The Stage Load is performed by a SQL Server Integration Services Package.

Product (Source Table) >> Product_Product_AdventureWorks2012 (Stage Table)

From the Stage, the Hub Load Pattern is used to load the data into the destination table.

Product_Product_AdventureWorks2012 (Stage Table) >> Product_Hub (Destination Table)

The following T-SQL and a SSIS implementations will illustrate the concept of Hub Loads.

T-SQL – Implementation

In this example a stored procedure  Load_Product_Hub  represents the Hub Load Pattern. To generate the Surrogate Sequence Number, the procedure is using a SQL Server Sequence.

Creating a Sequence in T-SQL

3HUb

The stored procedure Load_Product_Hub shows how the load pattern could be implemented in T-SQL.

4Hub

The command NEXT VALUE FOR gets the new Sequence Number for each row.

SQL Server Integration Services – Implementation

When we model Data Warehouse Loads, we want to use an ETL Tool like Microsoft SQL Server Integration Services.

This sample shows how the pattern can be implemented in Integration Services.

5HUb

Generation of a Surrogate Sequence Key when using Integration Services

There are many ways to generate keys within SSIS. You can write your own Key Generator in C# or apply a 3th Party Vendor - Generator. The database itself has some capabilities to generate keys. It is common to use an Identity Column to generate keys. But if you truncate the table the Identity Column will be reset. This can depending on your ETL Architecture cause duplicate key issues.

Recommendation
To avoid this behavior I recommend using a Sequence in a Default Constrain on the destination table. The benefits of a Sequence is that you do not have to deal with Key Generation within your ETL. Leave this task to the database. Keys are independent from the loading process. This makes it possible to exchange the ETL Process or the ETL Tool if that is necessary. The Sequence keeps incrementing also when the table gets truncated.

Using the Sequence as a Default Constrain on the destination table

6Hub1

Every new row will get a new Sequence Number by default. But still you can set the value within your ETL Process, if that is needed.

Data Vault 2.0 Hub Load

The main difference between a Hub Load in Data Vault 1.0 and Data Vault 2.0 are Hash Keys.

Hash Keys are a controversial topic. It is true that collisions can occur when using Hash Keys. On the upside, using Hash Keys can solve issues with late arriving data. More importantly they can increase the load performance of the data model significantly. They enable to load the data full parallel into the Data Vault Model. This can be achieved because the Business Keys are the base for the Hash Keys.

But pros and cons of Hash Keys should not be the matter of this article. In later posts we will investigate how Data Warehouse Loads can benefit from Hash Keys.

The load pattern for loading a Data Vault 2.0 Table is basically the same like in Data Vault 1.0. Just the Surrogate Sequence Key Generator gets replaced by a Hash Generator. HubDV2

In Data Vault 2.0 Hash Keys replacing sequence keys. Therefore we have to modify our data model a little. The Product_Seq column has to be replaced by a column Product_Hsk. Using different suffixes here helps to differentiate Data Vault 1.0 and Data Vault 2.0 tables. The Data Type of the column has to be changed as well. It is recommended to use a Char (32) field to store a MD5 Hash Key.

8Hub

T-SQL – Implementation

Generating a MD5 Hash Key in T-SQL
To implement the modified Hub Load Pattern a Hash Key Generator is needed. In T-SQL I have implemented a custom function that returns the Hash Key of a given Business Key. The function is using the SQL Server HASHBYTE function to generate the Hash Key. 9hub1

In the Procedure Load_Product_Hub the “SQL Server Sequence - Call” has been substituted by a call of the new Hash Generator function “GetHashKey”. 10hub1  

SQL Server Integration Services – Implementation

In SSIS the Data Flow Task has to be extended by a Script Component that generates the Hash Key. HubSSISNEU

Generating a MD5 Hash Key in Integration Services

Within the Data Flow Script Component “Generate Hash Key” I added the following C# script. 12hub1

The script is using the System.Security.Cryptography.MD5CryptoServiceProvider to build the MD5 Hash.

Conclusion

The shown implementations are examples to explain how a Hub Load works. Each individual project will require individual implementations of these patterns.

A Hub Load is a simple pattern, which can be easily repeated for every entity in the data model. This is one reason, which makes the Data Vault Model so scalable.

Because Data Vault Loads are standardised, they can be generated and developed with a high degree of automation. As a result, Enterprise Data Warehouse Projects can be developed more agile and fast.

Sources:  Dan Linstedt , Data Vault 2.0 boot camp class.

Kommentare

Christian Gräfe
Mi, 10.12.2014 - 11:56

Hallo,

wir nutzen in unserem DWH auch die Modellierungstechnik DataVault. Der Ansatz mit Hash-Keys kannte ich noch nicht und mir ist nicht bewusst worin hier der Vorteil liegen soll.
Betreiben tun wir da Ganze auf einer Oracle Exadata (analog PDW) mit dem ODI als ELT - Tool, welches die Transformationen direkt auf der Zieldatenbank durchführt. Es gibt hier also im Gegensatz zum MS SQL Server keine explizite Runtime, sondern nur Ausfühungsagents.
Generell komme ich aus de MS SQL Server - Ecke und bin erfreut das DataVault auch im SQL Server Anwendung findet, da ich diese als sehr leistungsfähig erachte. Gerade in einem sich ständig ändernden Umfeld kann man via neuen Links/Sats sich den neuen Gegebenheiten anpassen, ohne das komplette Modell anzupassen.

Gruß
Christian

D Piatkowski
Daniel
Piatkowski
Mi, 10.12.2014 - 22:54

Hallo Christian,

ich stimme dir zu das die Data Vault insbesondere im EDWH Umfeld sehr leistungsfähig und flexibel ist.
Was die Verwendung von Hash Keys angeht, so hat sie eine Reihe von Vorteilen. Der wesentliche Vorteil, für mich, ist aber die Reduktion von Abhängigkeiten bei der Beladung. Lookups in Hubtabellen um die Surrogate Keys zu ermitteln entfallen. Hubs, Links und Satelliten können parallel beladen werden. Wie dass im Detail funktioniert werde ich gern im nächsten Blog-Beitrag über Link Loads erläutern.

Gruß

Daniel

Christian Gräfe
Do, 11.12.2014 - 11:55

Hallo Daniel,

gibt es handliche Ansätze zur automatisierten Erstellung von SSIS-Paketen zur Beladung von Hubs/Links und Sats? Durch die standardisierte Beladung kann eine Menge automatisiert werden.
Leider haben wir damals eine Entscheidung gegen SQL Server treffen müssen, da es nicht dem Konzern-Standard entsprach. Im Nachhinein war dies wohl nicht die beste Entscheidung, da gerade die Administration jetzt enorm viel Zeit frisst.

Gruß
Christian

D Piatkowski
Daniel
Piatkowski
So, 14.12.2014 - 10:47

Hallo Christian,

zur Generierung von ETL für Data Vault Modelle gibt einige Ansätze. Als handlich würde ich sie aber alle nicht bezeichnen.
Wenn es um die Generierung von SSIS-Paketen geht, kann BIML ein Weg sein. Erst kürzlich bin ich auf eine Video Serie zum Thema BIML und Data Vault aufmerksam geworden.
https://www.youtube.com/watch?v=QVS79-YcPsA Diese Ansätze haben aber einen primär technischen Fokus. In einem echten Projekt mit fachlichen Herausforderungen müssen sich diese Ansätze erst beweisen.

Gruß
Daniel

Tim
Mo, 02.03.2015 - 23:17

Hello Dan

Do you have any patterns for loading Satellites and Links

Regards

Tim

D Piatkowski
Daniel
Piatkowski
Fr, 06.03.2015 - 13:53

Hello Tim

Next week I will publish the pattern for Link Loads.
Later you will find also the pattern for Satellites here.

If you have any specific question you are welcome to send an email.

Regards

Daniel

Ram
Fr, 08.05.2015 - 06:59

Hi,

In this blog, you explained single hub load process.I want to load all the hubs at a time. Is there any way to do this.

Regards,
Ram

D Piatkowski
Daniel
Piatkowski
Mi, 13.05.2015 - 08:56

Hello Ram

There are ways to load all hubs at a time. You need a propper ETL Tool like SQL Server Integration Services. This alows you to run multiple of this single pattern in parallel. How many of this pattern you can run in a time is limited by the performance of your infrastructure.

Regards

Daniel

Dan Linstedt
Do, 14.05.2015 - 23:24

Hi Daniel,

This is a nice overview, but please - next time you put my training materials on the web, please reference the source (ie: me).

Also, note that your pattern for DV2 load is not "quite" right - you put the hash key computation on the way to the Hub, this is too slow for most systems. The hashes are computed on the way IN to the STAGING tables, not on the way to the Data Vault.

On another note, I did enjoy your article, it was concise and clear. Nice Job! :)

To all others, I do cover these aspects and all the patterns in their complete and full notions in my Data Vault 2.0 boot camp class which will be available on-line very soon at http://KeyLDV.com

Cheers,
Dan Linstedt

Neuen Kommentar schreiben

Der Inhalt dieses Feldes wird nicht öffentlich zugänglich angezeigt.

Klartext

  • Keine HTML-Tags erlaubt.
  • HTML - Zeilenumbrüche und Absätze werden automatisch erzeugt.
  • Web page addresses and email addresses turn into links automatically.
Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden

Unsere Website benutzt Cookies, um bestmögliche Funktionalitäten bieten zu können. Durch die Nutzung unserer Website, erklären Sie sich mit der Verwendung von Cookies einverstanden. In unserer Datenschutzerklärung finden Sie unter §5 Informationen und Links, wie Sie diesen Maßnahmen jederzeit widersprechen können.

Datenschutzerklärung