26.06.2013

Optimize PDW Queries with Query Hints (Reading Explain Plans)

Technical Value

Since the PDW is a fairly new product, at some points you have to help the optimizer to build better execution plans. Within this article we will learn how to read the explain plan and give query hints to speed up the select Statement. Requirement is a working copy of the AdventureWorksDW Database for PDW. Preparing the Table by adding some additional data Within a DWH it happens quite often that a Fact table references to the same dimension multiple times. This might occur when you buy a piece of hardware in a computer chain. There is always the initial store where you bought your computer and for the follow up, like repair stuff or modification, you can go to another store. Therefore we add the SalesTerritoryKey again and fill it up with some random Data. We also modify our shipdatekey with some random information.

  1. ALTER TABLE factinternetsales
  2.   ADD salesterritorykey2 INT
  3.  
  4. UPDATE factinternetsales
  5. SET    salesterritorykey2 = CASE
  6.                               WHEN salesterritorykey IN ( 5, 8 ) THEN
  7.                               salesterritorykey
  8.                               WHEN salesterritorykey IN
  9.                                    (SELECT Max(salesterritorykey)
  10.                                     FROM   factinternetsales) THEN
  11.                               salesterritorykey - 1
  12.                               ELSE salesterritorykey + 1
  13.                             END
  14.  
  15. UPDATE factinternetsales
  16. SET    shipdatekey = CASE
  17.                        WHEN shipdatekey % 10 = 1 THEN 20130501
  18.                        WHEN shipdatekey % 10 = 2 THEN 20130601
  19.                        WHEN shipdatekey % 10 = 3 THEN 20130701
  20.                        ELSE 20130801
  21.                      END 

For the following analysis and optimization we update the statistics first, so we can make sure the query optimizer has the latest information.

  1. UPDATE STATISTICS factinternetsales
  2. UPDATE STATISTICS dimsalesterritory
  3. UPDATE STATISTICS dimpromotion 

Querying the PDW Now, that all of the preparation is done, lets start with a simple query we would find in our daily business.

  1. SELECT Count(*),
  2.        Sum(fis.salesamount),
  3.        dst1.salesterritorycountry AS CountryFrom,
  4.        dst2.salesterritorycountry AS CountryTo,
  5.        dp.englishpromotiontype
  6. FROM   factinternetsales fis
  7.        INNER JOIN dimsalesterritory dst1
  8.                ON fis.salesterritorykey = dst1.salesterritorykey
  9.        INNER JOIN dimsalesterritory dst2
  10.                ON fis.salesterritorykey2 = dst1.salesterritorykey
  11.        INNER JOIN dimpromotion dp
  12.                ON fis.promotionkey = dp.promotionkey
  13. GROUP  BY dst1.salesterritorycountry,
  14.           dst2.salesterritorycountry,
  15.           dp.englishpromotiontype,
  16.           fis.shipdatekey,
  17.           fis.discountamount,
  18.           fis.customerponumber,
  19.           fis.customerkey 

As we can see, the statement includes a couple joins, aggregations and groupings. The query itself executes in 3 seconds. If we check the execution plan, we will find something very interesting. Key1 In our scenario, the optimizer decides to use the DiscountAmount Column as Shuffle Column. A simple count and group query shows us that the cardinality of this column is 1, which lets the whole shuffle operation take place on 1 node.

  1. SELECT Count(*),
  2.        discountamount
  3. FROM   factinternetsales
  4. GROUP  BY discountamount 

Key2 Now in this case it doesn't really matter since it is just 3 seconds, but in other cases it could lead to minutes if not even hours. Let's use a different column with a better cardinality e.g. the CustomerKey column. A quick test shows that it is a great column for distribution.

  1. SELECT Count(*),
  2.        customerkey
  3. FROM   factinternetsales
  4. GROUP  BY customerkey 

Key3 PDW offers some great query hints to influence the execution of statements. Therefore we use the "DISTRIBUTED_AGG" hint within the group by clause to tell the PDW what column to shuffle on.

  1. SELECT Count(*),
  2.        Sum(fis.salesamount),
  3.        dst1.salesterritorycountry AS CountryFrom,
  4.        dst2.salesterritorycountry AS CountryTo,
  5.        dp.englishpromotiontype
  6. FROM   factinternetsales fis
  7.        INNER JOIN dimsalesterritory dst1
  8.                ON fis.salesterritorykey = dst1.salesterritorykey
  9.        INNER JOIN dimsalesterritory dst2
  10.                ON fis.salesterritorykey2 = dst1.salesterritorykey
  11.        INNER JOIN dimpromotion dp
  12.                ON fis.promotionkey = dp.promotionkey
  13. GROUP  BY dst1.salesterritorycountry,
  14.           dst2.salesterritorycountry,
  15.           dp.englishpromotiontype,
  16.           fis.shipdatekey,
  17.           fis.discountamount,
  18.           fis.customerponumber,
  19.           fis.customerkey with (distribution_agg) 

This is what the newly created execution plan looks like. Key4 After doing so, the execution plan shows that it has changed the column and the execution doubled the performance.

Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden