Automated Data Quality checking Part 2
As mentioned in my previous article about the automated data quality checking in a Bi environment, I recently encountered the need to compare quite a number of sources with various restrictions / filters, but with very similar queries.
In order to reduce the effort required to maintain and perform the comparisons, I decided to design a solution, which allowed the definition of highly parametrized queries (incl. source tables, column sets, etc.). These could be reused by simply setting the corresponding parameters.
Given the fact, that – except the already mentioned queries – various other settings and values had to be configured and stored, I defined a relational database model, which covered the whole configuration of the measuring process. This included tables for connections, queries, violation rules and - beside further other settings - the results of the performed checks. Additionally I created a small configuration tool which helped to maintain all Settings and their dependencies. First of all I maintained the connections for each environment which had to be considered during the comparisons. Then I continued by maintaining the queries (either in SQL or in MDX) and their parameters. With this done, I defined “measurements”, which represented the mapping between the defined queries and connections and which stored the required parameter values, such as source table, timespan or certain flags / statuses / events which had to be considered.
In the next step I maintained comparisons, either between two measurements or between a measurement and an expected value (e.g. average of previous month) and defined which deviations (absolute or relative) would still be acceptable and which would have to be considered as violation of a certain severity. In order to bundle comparisons which technically belonged together, I defined groups. Each group was assigned to a reporting cluster, which was later on used within one of the reports.
With this done, I defined an execution plan which held information about which group of comparisons had to be executed on which day of the week at which time of day.
So far so good – the basic configuration was completed. Now the comparison had to be executed. In order to do so, I created SSIS packages, which were called by a SQL Agent job (the job was executed every 30 minutes on each day of the week). At the beginning of the master package, all comparison groups were evaluated, which had to be executed at the current time. The evaluated group IDs were then passed to the corresponding child package, which evaluated the required queries, their parameters and connection strings and then executed the queries and stored the results in the corresponding table. After executing all measurements a procedure was called, which calculated and stored the differences between the defined measurements in the table as well. Finally another procedure was called which compared the calculated differences against the defined limits and - in case a violation occurred - evaluated the severity and updated the formerly created deviation record accordingly.
In order to have a quick overview of the results, I created two parameterized reports in SSRS. The first report was the high severity dashboard, which displayed the number of occurred violations of the highest severity, grouped by defined reporting cluster. Furthermore it included an overview of the number of occurrences in the past 31 days as well as a tendency line per reporting Cluster. The second report displayed all results of the comparisons in detail, grouped by severity. The order of the records within a severity group could individually be modified by maintaining priorities on certain comparisons. By defining a subscription on this report, I received this report at certain times of the day via mail. These subscriptions were parameterized as well, so I received the report only with the results of the last executed measuring Job.
These reports were of great help when it came to analysing missing or faulty data – and with the mails being sent automatically, the required information was available in time.