High Performance String Aggregation on Microsoft APS (PDW)
On normal SMP-SQL Servers there are several ways to perform string aggregations via XML or Pivot and dynamic SQL, some more or less efficient.
On APS/PDW you have to do something different. We want to have a rapid solution so we like “JOIN” and simple additions and we hate anything like “CASE”, or “ISNULL”.
Let’s start with a simple table:
We have several items with their rownumbers in one table. In order to get them side-by-side we create a matrix/table which has an ID, an empty string at its diagonal und a NULL-string in all other fields. Next we perform a join and get a table with the items in its diagonal because addition with the NULL-value results in a NULL value and addition of empty string and item results in the item.
One max-aggregation and one concat-operation later we have the result: one row and all items concatinated.
Click to enlarge: Now we modify this attempt to perform an aggregation of items in multiple lines.
The matrix gets a space-character instead of the empty string from row 2 on (or any other delimeter). Next we insert a column with a partitioned rownumber and join the matrix by this. By keeping the line information we can group by this linenumber. Next the concat and we have our little christmas-song.
Click to enlarge: Thanks to Miriam Funke who implemented and tested it on the PDW.
We used this concept on a 4-node-pdw to aggregate receipt-items for side-by-side analysis. The matrix allows up to 200 items per receipt and we reduced about 60 mio lines to 5 mio lines in about 2 minutes even though we needed to perform a dense_rank and a row_number at once to eliminate same items of the receipt in one step.
Merry Christmas!
Here is a query to test the concept on standard smp including creation of sample data (remove "--" in last line to execute query):
- DECLARE @i INT
- DECLARE @i_max INT = 100 --max aggregate items
- DECLARE @query NVARCHAR(max)
- /***** 1 create matrix for join *****/
- SET @query =
- ,
- CREATE TABLE #T1
- (
- rowid INT
- ,
- SET @i = 1
- WHILE @i <= @i_max
- BEGIN
- SET @query +=
- ,
- 'T'+CAST(@i AS NVARCHAR(10)) + ,NVARCHAR(1) ,
- SET @i += 1
- END
- SET @query +=
- ,
- )
- ,
- SET @i = 1
- WHILE @i <= @i_max
- BEGIN
- SET @query +=
- ,
- INSERT INTO #T1 (rowid, T'+CAST(@i AS NVARCHAR(10))+') SELECT '+CAST(@i AS NVARCHAR(10)) + ',''''
- ,
- SET @i+= 1
- END
- /***** CREATE SAMPLE DATA *****/
- SET @query +=
- ,
- CREATE TABLE #T2
- (
- ID INT IDENTITY(1,1),
- VBELN INT,
- POSNR NVARCHAR(10)
- )
- DECLARE @i INT = 1
- DECLARE @imax INT = 1000000
- WHILE @i <= @imax
- BEGIN
- INSERT INTO #T2
- SELECT ROUND(RAND(@i) * 30000,0), NULL
- SET @i+= 1
- END
- UPDATE T2 SET POSNR = POSNR_SOLL
- FROM #T2 AS T2
- JOIN
- (
- SELECT ID, POSNR_SOLL = CAST(ROW_NUMBER() OVER (PARTITION BY VBELN ORDER BY VBELN) AS nvarchar(10))
- FROM #T2 AS T2
- ) AS SOLL
- ON
- SOLL.ID = T2.ID
- ,
- /*************** MAIN TASK ***********/
- SET @query +=
- ,
- SELECT
- SINGULAR.VBELN,
- C_POSNR = CONCAT(
- ,
- SET @i = 1
- WHILE @i <= @i_max
- BEGIN
- IF @i = 1 SET @query += 'MAX(T'+CAST(@i AS NVARCHAR(10))+'+POSNR)'
- ELSE SET @query += ',MAX(+'',''+T'+CAST(@i AS NVARCHAR(10))+'+POSNR)'
- SET @i += 1
- END
- SET @query +=,
- )
- FROM
- (
- SELECT *, RN = ROW_NUMBER() OVER (PARTITION BY VBELN ORDER BY POSNR) FROM #T2
- ) AS SING
- JOIN #T1 AS PIV
- ON
- PIV.ROWID = SING.RN
- GROUP BY
- SINGULAR.VBELN
- ORDER BY
- SINGULAR.VBELN
- ,
- SET @query +=
- ,
- DROP TABLE #T1
- DROP TABLE #T2
- ,
- SELECT @query
- EXEC (@query)
Kommentare
My hat is off to you. This is very clever.
Neuen Kommentar schreiben