Understanding incompatible aggregations within PDW
One of the great benefits of the PDW is the parallel execution on each compute node. With the right distribution and a matching query PDW can execute the query and return the data right away. But there might be some scenarios where your query groups by non distribution columns and PDW starts to shuffle the data.
This time we are going to understand the reason why a shuffle move is necessary. In our example we image a Fact Table which we are going to aggregate and group. The Table shows a Clienttable with ID, Name and Town of the particular customer.
When we are going to execute the following query, one could think that it is going be fully resolved on each compute node since no movement is necessary, but this is not the case.
- Select count(*), ClientName, Town from Fakt_A group by ClientName, Town
What happens is the following:
PDW can not correlate the two group by columns and imagines the highest possible amount of cardinal values which would be 6 distinct client names and 5 distinct town names.
That the actual amount of distinct values between this combination might be way lower is not considered. What the PDW does, is create a table via CTAS on each node and aggregate for the different combination on the available compute nodes.
After the aggregation is done, it can send the rows towards the compute node. In most cases this is way faster, since the data doesn't need to be pre aggregated on each node and aggregated again on the compute node, but be directly streamed towards the control node, where it can be processed right away.