BLOG

Removing duplicate rows in large tables

09.02.2014 Hilmar Buchta

PDW 2012 | SQL Server 2012

Removing duplicate rows in large tables can be a challenging operation. As long as the table does not contain too many rows, several approaches will lead to the desired result. Recently, I had a table of about 2.5 billion rows and 125 columns. The table is located on a Microsoft Parallel Data Warehouse (PDW), so I had to perform the cleaning of the table using distribute SQL (DSQL).

For simplicity, let’s assume the table has columns key1, key2, key3, key4, key5 and col1, … col120. The table is distributed by key1. The rows should be unique regarding the five key columns but this is not guaranteed, since the key columns do not form a database enforced primary key.

Such tasks are usually easy to solve on a PDW using a CTAS (create-table-as-select) statement. For my example I used the following statement:

 

  1. create table new_table      
  2. with (distribution = hash (key1), clustered columnstore index)      
  3. as select * from      
  4. (select *      
  5. , row_number() over (partition by key1,key2,key3,key4,key5 order by col1) RowNr      
  6. from my_table      
  7. ) X      
  8. Where X.RowNr = 1

One remark regarding the window function (row_number over). Make sure, that the distribution key is the first column in the partition list, so the computation of the row number can happen on each compute node separately without re-distributing the data. And another remark regarding the “select * …”. Usually it’s not a good style of writing SQL code but in this case we really want to transfer all columns and the DSQL pre- processor expands the * to the corresponding field list before sending the query to the compute nodes (this can be seen in the query log).

Running this query on the full set of 2.5 billion rows took about 1 hour 15 minutes to complete.

Analyzing the query details showed that the sub-query for finding the duplicate rows was easy to solve for the PDW. Wrapping the sub-select in a count returned after about 1 minute giving the exact number of unique rows:

  1. Select count(*) from      
  2. (select *      
  3. , row_number() over (      
  4. partition by key1,key2,key3,key4,key5      
  5. order by col1) RowNr      
  6. from my_table      
  7. ) X      
  8. Where X.RowNr = 1

So most of the time for this query originates from copying data from one table to another. While usually it’s a great idea to CTAS into a new table, it could be a bad choice for this scenario when you have a lot of rows with many columns and relatively few duplicates.

Here is the approach, I finally used:

  1. Write the keys of the duplicate rows to a new temporary table #DupKeyStore
  2. CTAS from the original table to another temporary table #UniqueRowStore only the rows that also exist in the #DupKeyStore by using the approach from above.
  3. Delete all rows from the original table that match the keys in the #DupKeyStore table. The original table does not contain any duplicate rows anymore – the rows that had duplicates are removed.
  4. Insert the rows of the table #UniqueRowStore back into the original table

Here is the code in more detail.

Step 1

  1. create table #DupKeyStore      
  2. with (LOCATION = USER_DB, distribution=hash(key1))      
  3. as      
  4. select      
  5. key1, key2, key3, key4, key5      
  6. , convert(nvarchar,key1)      
  7. +‚-‚+convert(nvarchar,key2)      
  8. +‚-‚+convert(nvarchar,key3)      
  9. +‚-‚+convert(nvarchar,key4)      
  10. +‚-‚+convert(nvarchar,key5) KeyLookup      
  11. from      
  12. (select key1, key2, key3, key4, key5      
  13. from mytable      
  14. group by      
  15. key1, key2, key3, key4, key5having count(*) > 1      
  16. ) X

The query took 1 minute to complete on the full data set of 2.5 billion rows. The ‘KeyLookup’ column looks strange but I will explain this later in the process.

Step 2

  1. create table #UniqueRowStore      
  2. with (LOCATION = USER_DB, distribution=hash(key1))      
  3. as      
  4. select * from      
  5. (select Q.*,      
  6. row_number() over (partition by      
  7. Q.key1, Q.key2, Q.key3, Q.key4, Q.key5      
  8. order by Q.key5) RowNr      
  9. from mytable Q      
  10. inner join #DupKeyStore K      
  11. on      
  12. Q.key1=K.key1      
  13. And Q.key2=K.key2      
  14. And Q.key3=K.key3      
  15. And Q.key4=K.key4      
  16. And Q.key5=K.key5      
  17. ) X      
  18. where RowNr=1

Again, this query took about 1.5 minutes to transfer the duplicate rows into a new table while removing duplicates. The main performance gain here is that the query does not need to transfer all rows, but only the duplicate ones. Another thing to notice here is that I’m using a heap for this intermediate table (not a columnstore index).

Step 3

  1. SET XACT_ABORT ON
  2.  
  3.  
  4. BEGIN TRANSACTION
  5.  
  6.  
  7. delete from mytable      
  8. where      
  9. convert(nvarchar,key1)      
  10. +‚-‚+convert(nvarchar,key2)      
  11. +‚-‚+convert(nvarchar,key3)      
  12. +‚-‚+convert(nvarchar,key4)      
  13. +‚-‚+convert(nvarchar,key5)      
  14. IN (select KeyLookup from #DupKeyStore)

Ok, this query needs some remarks. First, we use the concatenated KeyLookup-column from above. I must admit that this looks somewhat ugly. It would be better to use a join in the select but this is currently not supported on the PDW. Another option would be to use EXISTING(select….) in the where condition which would remove the ugly concatenation of the keys. However, during my tests, this approaches took a lot more time to complete. So I ended up with the statement from above.

For my example, the query took about 1 minute to complete.

Step 4

  1. insert into mytable      
  2. select key1, key2, key3, key4, key5, col1, col2, … col120      
  3. from # UniqueRowStore      
  4. COMMIT;

The insert was quite fast, only taking 30 seconds.

At a total of 4 minutes the 2.5 billion rows were cleaned from duplicates. Compared to the 1 hour 15 minutes using the first approach, this is a huge performance boost. And having more compute nodes would result in a much shorter query time for this task.

Before extending this solution to even more rows, consider the following topics:

  1. Check the number of duplicates first (table #DupKeyStore). If there are too many duplicates (I would say more than 5% of the total rows) consider the CTAS operation instead of the delete/insert operation.
  2. The delete/insert operations require tempdb space. For a larger amount of rows, these statements should be split to batches having one transaction for a bunch of keys.
    Since delete top(nnn) is not supported on PDW, SET ROWCOUNT does not work either here and you also don’t have a @@ROWCOUNT a good approach is add “cluster-column” (e.g. row_number() modulo something) into #DupKeyStore and to use this key for splitting into batches. Remember to wrap the delete and the insert into a single transaction. See this post by Stephan Köppen for details.
  3. CTAS operations writing to clustered columnstore index tables perform better using a higher resource class. This could also be considered when using workarounds like the one from above.

So, that was it for today’s post. Ah, no, I’ve forgotten something:

  1. drop table #DupKeyStore;      
  2. drop table #UniqueRowStore;

 

Your email address will not be published. Required fields are marked *

Join #teamoraylispeople

Gestalte mit uns
die Welt der Daten