Optimize PDW Queries with Query Hints (Reading Explain Plans)
Since the PDW is a fairly new product, at some points you have to help the optimizer to build better execution plans. Within this article we will learn how to read the explain plan and give query hints to speed up the select Statement. Requirement is a working copy of the AdventureWorksDW Database for PDW. Preparing the Table by adding some additional data Within a DWH it happens quite often that a Fact table references to the same dimension multiple times. This might occur when you buy a piece of hardware in a computer chain. There is always the initial store where you bought your computer and for the follow up, like repair stuff or modification, you can go to another store. Therefore we add the SalesTerritoryKey again and fill it up with some random Data. We also modify our shipdatekey with some random information.
- ALTER TABLE factinternetsales
- ADD salesterritorykey2 INT
- UPDATE factinternetsales
- SET salesterritorykey2 = CASE
- WHEN salesterritorykey IN ( 5, 8 ) THEN
- salesterritorykey
- WHEN salesterritorykey IN
- (SELECT Max(salesterritorykey)
- FROM factinternetsales) THEN
- salesterritorykey - 1
- ELSE salesterritorykey + 1
- END
- UPDATE factinternetsales
- SET shipdatekey = CASE
- WHEN shipdatekey % 10 = 1 THEN 20130501
- WHEN shipdatekey % 10 = 2 THEN 20130601
- WHEN shipdatekey % 10 = 3 THEN 20130701
- ELSE 20130801
- END
For the following analysis and optimization we update the statistics first, so we can make sure the query optimizer has the latest information.
- UPDATE STATISTICS factinternetsales
- UPDATE STATISTICS dimsalesterritory
- UPDATE STATISTICS dimpromotion
Querying the PDW Now, that all of the preparation is done, lets start with a simple query we would find in our daily business.
- SELECT Count(*),
- Sum(fis.salesamount),
- dst1.salesterritorycountry AS CountryFrom,
- dst2.salesterritorycountry AS CountryTo,
- dp.englishpromotiontype
- FROM factinternetsales fis
- INNER JOIN dimsalesterritory dst1
- ON fis.salesterritorykey = dst1.salesterritorykey
- INNER JOIN dimsalesterritory dst2
- ON fis.salesterritorykey2 = dst1.salesterritorykey
- INNER JOIN dimpromotion dp
- ON fis.promotionkey = dp.promotionkey
- GROUP BY dst1.salesterritorycountry,
- dst2.salesterritorycountry,
- dp.englishpromotiontype,
- fis.shipdatekey,
- fis.discountamount,
- fis.customerponumber,
- fis.customerkey
As we can see, the statement includes a couple joins, aggregations and groupings. The query itself executes in 3 seconds. If we check the execution plan, we will find something very interesting. In our scenario, the optimizer decides to use the DiscountAmount Column as Shuffle Column. A simple count and group query shows us that the cardinality of this column is 1, which lets the whole shuffle operation take place on 1 node.
- SELECT Count(*),
- discountamount
- FROM factinternetsales
- GROUP BY discountamount
Now in this case it doesn't really matter since it is just 3 seconds, but in other cases it could lead to minutes if not even hours. Let's use a different column with a better cardinality e.g. the CustomerKey column. A quick test shows that it is a great column for distribution.
- SELECT Count(*),
- customerkey
- FROM factinternetsales
- GROUP BY customerkey
PDW offers some great query hints to influence the execution of statements. Therefore we use the "DISTRIBUTED_AGG" hint within the group by clause to tell the PDW what column to shuffle on.
- SELECT Count(*),
- Sum(fis.salesamount),
- dst1.salesterritorycountry AS CountryFrom,
- dst2.salesterritorycountry AS CountryTo,
- dp.englishpromotiontype
- FROM factinternetsales fis
- INNER JOIN dimsalesterritory dst1
- ON fis.salesterritorykey = dst1.salesterritorykey
- INNER JOIN dimsalesterritory dst2
- ON fis.salesterritorykey2 = dst1.salesterritorykey
- INNER JOIN dimpromotion dp
- ON fis.promotionkey = dp.promotionkey
- GROUP BY dst1.salesterritorycountry,
- dst2.salesterritorycountry,
- dp.englishpromotiontype,
- fis.shipdatekey,
- fis.discountamount,
- fis.customerponumber,
- fis.customerkey with (distribution_agg)
This is what the newly created execution plan looks like. After doing so, the execution plan shows that it has changed the column and the execution doubled the performance.
Neuen Kommentar schreiben