Hierarchies with text fields (attribute relationship)
SQL Server 2005-2012
When setting up hierarchies for multidimensional SSAS attribute relationships between the levels are highly recommended. There might be scenarios in which the source data breaks those attribute relationships and usually this can be solved by choosing a proper key, in some cases combined keys. I wrote about this topic in earlier posts (see here and here ).
In general you are in a good situation if the attribute relationship is modeled as a 1:n relationship between each corresponding level in the relational model of the source system. In this case you can rely on the keys of those tables. For example, in AdventureWorks we find tables like Product, ProductSubcategory and ProductCategory for the product hierarchy. Each table is a foreign key relationship to the next level, so we can be absolutely sure that we have true 1:n relation between products and their sub category and between the sub category and the category (using the keys from the relational databases). In our attribute relationship we can rely on those keys and no matter how the content of text fields changes in these tables, the attribute relationship for our hierarchies will still work.
However, and this is the topic of this post, you might be in a situation where the source systems only delivers some text fields and you have no information how these fields are linked in the source system. Maybe there isn’t any link at all and the text fields are nothing but additional columns at the same grain as your dimension key. If it’s a requirement to build a hierarchy using these text fields and if you want to set up attribute relationship correctly, changes in the source data might break your attribute relationship and therefore cause the dimension processing to fail, if you use the text fields as attribute key for these attributes.
To illustrate this situation, let’s take a look at the following data table which is just an example of an interface file (maybe a csv-file):
The requirement is to create the following drill down path (hierarchy) based on this data:
Manager —> Sales Territory —> State —> City —> Office
Checking the existing rows of data, the attribute relationship can be set up identically to this drill down path, as all of the following relationships are one-to-many:
- Office Id to city
- city to state
- state to sales territory
- sales territory to manager
So you could be tempted to create the attribute relationship based on those text fields and to use the text field itself as the key for the given attribute. The dimension will process correctly and everything seems to be fine. But in this case, with no proper keys from the source system, we have no guarantee at all, that the cardinality between those attributes will always be like this. For example, let’s assume, the following 3 new office are established:
The first row is a second office in New York. This line does not break any of the above relationship settings. But office 7 is managed by Linda Randall instead of Barbara Moreland. So the relationship between manager and sales territory is no longer 1:n but now n:m. And office 8 is newly-found in Florence, Kentucky, which is different from Florence, Alabama and therefore breaks the relationship between state and city. Within our four attribute relationships from above only two are still valid after inserting the rows for office 6-8. It may be possible to process the dimension correctly for some time, but at any time there is the risk that rows show up which break the relationship (offices 7 and 8 from the example above) causing the dimension process to fail with a duplicate key error.
The actual error will be somewhat like this:
Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_Office', Column: 'City', Value: 'Florence'. The attribute is 'City'.
You will find more information about his kind of error and also about other possible causes for this to happen here .
So, how can we prevent this to happen and how can we immunize our hierarchy against changes in the data? The solution is quite simple. If we don’t get reliable keys from our source system, we have to compute them by ourselves. One idea might be to use combined text fields, which would work, but for performance reasons we prefer integer keys for the attributes.
The easiest way to create these keys, is to establish a lookup table for each attribute as shown below. For each attribute, we build a look up key that contains all dependent attributes (from our hierarchy) and compute an artificial key (maybe by using an identity column). In our example, for the city, we have to include city, state, sales territory and manager in the lookup. I’m using a simple text field in this example, but in general a hash code (like MD5) would be a better solution.
Whenever a look up key is not found, a new row is inserted, thus resulting in a new key.
Using this look up table, we can enrich our original data with reliable keys as shown below:
And no matter how the data changes in our source table, we will still be able to create attribute keys that fulfill the 1:n requirement for attribute relationship.
Ok, I’m hearing you say, that this was an easy case. What can I do if I have multiple parallel hierarchies instead of only one? How can the keys be computed in such a case. Well, maybe you have already guessed it: Usually, if there are no reliable keys generated from the source system, there is no work around but to create the attributes per hierarchy. For example, if the users also wanted a simplified hierarchy like
we end up with a second attribute (column) for the sales territory. If you don’t create separate attributes you might end up with duplicated entries confusing the user of the hierarchy. This happens because elements in our dimension/hierarchy are shown by key. With a single column, our full hierarchy from above would be correct, while the simplified hierarchy would happen to look like this:
- Office 5
- Office 8
- Office 7
The reason for this is that using the hierarchy key, we end up the two different versions of “Southeast” (one for manager Barbara and one for Linda, in order to preserve the 1:n relationship to the manager). But of course, what the end user would expect is somewhat like this:
- Office 5
- Office 7
- Office 8
This explains, why we need separate attributes (and separate keys) for each of the hierarchies.
And again, all of the above is only necessary if we cannot fully rely on the 1:n relationship.