Hash function for dimension updates
SQL Server 2005-2012
While fact tables are usually tables with many rows but fewer columns, dimensions are just the opposite. Because of the denormalization we usually have dimension tables with quite a lot of columns. By this, end users can benefit from a large set of fields/attributes by which they can analyze their data. For example, for a date dimension we would have columns like Year, Month, Week, Day, Day Number of Year, Weekday, Holiday etc., all for the calendar and then again for the fiscal calendar. And if the DWH also supports more than one language, we may end up with different columns for each language (in star schema). So usually we have a large amount of columns in our dimension tables.
One way to speed up dimension handling with many columns and rows comparing with the standard SSIS component for handling slowly changing dimensions is to use a separate data base table (stage), load new data to this table and to use a SQL merge command to update the original dimension table. An SCD 1 update may look like this:
- MERGE dbo.DimTest as t
- USING stage.DimTest as s
- ON t.bk = s.bk
- WHEN MATCHED AND (t.attribute1 != s.attribute1 or t.attribute2 != s.attribute2 or t.attribute3 != s.attribute3)
- update set t.bk = s.bk, t.attribute1 = s.attribute1, t.attribute2 = s.attribute2, t.attribute3 = s.attribute3
- WHEN NOT MATCHED
- insert (bk, attribute1, attribute2)
- values (s.bk, s.attribute1, s.attribute2);
You can find more examples (and also a merge command for an SCD 2 update in Sascha Lorenz’ blog here ).
Let’s focus on the line in red: Here we have to compare each changing attribute in order to determine if a change has occurred. In the example above, we have a dimension with a business key (bk) and three attributes. But what can we do, if there are 20, 30 or even more attributes? One proposed solution is to use a hash code to determine the change. Thinking of a hash function, the first thing that comes into view is the SQL server built-in hash functions, checksum or binary_checksum. Pinal Dave wrote a good explanation about these two functions . For our scenerio, binary_checksum is a candidate. In order to use this, we could add a new computed column to our stage and dimension table from above:
- [SCDhash] AS (binary_checksum([attribute1],[attribute2],[attribute3]))
- Then, the red line could by changed to
- WHEN MATCHED AND (t.SCDhash != s.SCDhash)
But wait, is this a reliable and secure method? As we know, check sum functions in general are not collision-free. Two different input values may lead to the same check sum. If this happens here, we wouldn’t be able to detect the attribute change. However, is this really a problem in this case? Let’s find out.
The binary_hash function maps its input to a 32 bit integer value. This means that any input value is mapped to one of approx. 4.29 billion hash values. Assuming the binary hash function has a good distribution and therefore the chance to get any of those 4.29 billion values is equal, how likely are collisions?
4,29 billion seems to be a huge value. But when looking at collisions, the situation is different and also not intuitive. A similar topic is known as a the “birthday problem ”. One result is that within 23 people the probability that two of them have the same birthday (day/month) is already 50%, although there are 365 possible birthdays. What does this mean for our binary_checksum function with 4.29 billion possible results (instead of 365 possible birthdays)? The following table shows the probability for collisions, depending on the number of rows.
For example, if you compute a binary_checksum over 150,000 rows, the probability that two (or more) rows have the same checkum is already 92.7%. I’m always surprised by such results. Although a collision probability of exactly 100% is reached at 4.29 billion rows because of the pigeonhole principle , we almost get the same probabilty with only 400,000 rows!
So, with binary_checksum, collisions are likely to happen. But does this mean that binary_checksum cannot be used as a hash function for dimension updates?
There was a discussion about this topic in the Kimball Group’s forum in this this blog-entry . Especially interesting is the comment by user ngalemmo. The above probabilities for collisions are one thing, but in a change detection process we have already identified the source and destination row (by business key in SCD 1 and business key plus date from/to or current row indicator in SCD 2), so we’re only comparing two rows. This reduces the probability of errors a lot. For a single update, the chance of getting the same check sum is 1:4.29 billion, very low indeed.
Since the error probability in this case depends on the number of updates (not on the number of rows in the dimension table), the following table shows the probability for a checksum error depending on the number of updates:
For example, since the probability of a collision after 50 million updates is 1.15%, we would expect to have one out of 85 companies with 50 million updates on a dimension table, to actually get an error. Now, 50 million updates are really a lot for our “slowly” changing dimensions, so after all, this doesn’t sound much like a problem. Even 100,000 updates of dimension rows (“slowly changing”) are a lot and in this case, only one out of 40,000 companies would have a problem of a single dimension row not being updated.
Well, all these probability estimations above are pure theory. They are based on an ideal distribution of the binary_checksum function. The “real” probabilities depend on your data, the structure of your updates and the real distribution of binary_checksum. To demonstrate this, I took a list of US cities (43,628 rows in total) from http://www.geonames.org/ . According to the table from above, the probability of two of those cities sharing the same binary hash is about 20% (exactly 19.87%). The chance to find a collision in this data set is close to role a 6 with a dice in the first try. It may happen, but you wouldn’t be too much surprised if not.
After importing the list to a SQL server table, I used this query to determine the collisions:
- select binary_checksum(City), City, State, StateShort from USCities
- where BINARY_CHECKSUM(City) in (
- select BINARY_CHECKSUM(City) from dbo.USCities
- group by BINARY_CHECKSUM(City) having COUNT(*)>1 and MIN(City)<>MAX(City)
- order by binary_checksum(City)
Using this query I got 21 cases, where the binary checksum is identical but the city is different. This is much more, than I would have expected. Here are the cases:
What we can see is, that binary_checksum seems to be especially weak to separate minor changes, especially changes of two letters, like in Hennessey and Tennessee. Another interesting observation is, that the value of the binary checksum is dependent on the input length: Short words result in smaller values thus raising the probability for collisions.
But of course we would not only take the city name into the key. However, with more fields, changes in one field could potentially still be compensated by a change in another field. Here are some combinations:
|City, State||Three collisions:
|City, ZIP||Seven collisions:
|City, ZIP, State||Two collisions:
For example, when using City and ZIP code for the hash (binary_checksum), a relocation from Hood, 95639 to Erie, 58029 would not be detected.
And what we can see from the City+State checksum is also a weakness of the binary_checksum function. In certain cases, moving a word from one field to another, still results in the same checksum. In this case “Augusta, West Virginia” and “West Augusta, Virginia” result in the same check sum. Since binary_checksum weights characters depending on their position, this depends on the length of the words.
To continue my investigations I also tried a list of German street names. In some cases, even hyphens between words are ignored like in “Sankt Florian Straße” and “Sankt-Florian-Straße”. Also, totally different street names may result in the same check sum, for example “Heinrich-Heyd-Straße” and “Dardesheimer Straße”.
It seems that binary_checksum gives more collisions than expected when being used with real world data sets. Don’t be dazzled by the amazing 4.29 billion possible output values! Since update errors are hard to find for end users and even harder to be tracked down by the developers, I wouldn’t recommend using binary_checksum as the checksum method for dimension updates. If you really have a large dimension (where update performance counts) with lots of columns, you may want to use a more reliable hash function like MD5 or SHA-1(using the hashbytes-function) or go for a field-by-field comparison.