21.04.2016

SQL Performance Tuning by avoiding not-in-clauses

Technical Value

I have already written about left outer joins. Now I want to show you an example how to use left outer joins in order to write efficient SQL.

At one of our customers we have the following query that runs several hours:

  1. SELECT DISTINCT [INSERTTIMESTAMP] FROM DATATABLE WHERE [INSERTTIMESTAMP] NOT IN (SELECT [INSERTTIMESTAMP] FROM CTRLTABLE WHERE TARGET_TABLENAME = 'source_01' )

The query gets the distinct insert timestamps of DATATABLE that has not already been used in CTRLTABLE. The query is used for determining in a process flow which data has not already been processed. After processing the data the processed data would be inserted in the CTRLTABLE. The DATATABLE has about 40 million rows. The CTRLTABLE has about 400 rows. The execution plan of this query is shown below: SQL Performance Tuning In my opinion it is not advisable to use a not-in-construct in a where clause. If you can avoid it, you should avoid it. In this case you can rewrite the query by using a left outer join:

  1. SELECT DATATABLE.[INSERTTIMESTAMP] FROM DATATABLE LEFT OUTER JOIN CTRLTABLE ON DATATABLE.[INSERTTIMESTAMP] = CTRLTABLE.[INSERTTIMESTAMP] AND CTRLTABLE.TARGET_TABLENAME = 'source_01' WHERE CTRLTABLE.[INSERTTIMESTAMP] IS NULL GROUP BY DATATABLE.[INSERTTIMESTAMP]

Please note that the filter condition for target_tablename must be used in the on-clause and not in the where-clause. Otherwise we would always get an empty answer set because target_tablename cannot be null and „source_01“ at the same time.

The group by clause is used in order to get distinct values.

The execution plan for the rewritten query looks like this: SQL Performance Tuning The new execution plan looks simpler and the query runs only in about 10 seconds instead of several hours.

Kommentare

Andrej Kuklin
Mo, 25.04.2016 - 14:14

There is like a bazillion articles on the internet about NOT IN vs NOT EXISTS vs LEFT OUTER JOIN vs OUTER APPLY and so on.
This one from Aaron Bertrand is pretty good http://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join

NOT IN is indeed a suboptimal choice but more because of NULL semantics which is different than in other cases.
I bet your problem was actually a statistics misestimate on DATATABLE.INSERTTIMESTAMP column. You have masked it by "forcing" a (hash) join for the whole data first + filtering of not matched records afterwards. This may be fine for your current data constellation but is less optimal if your data distribution changes or if you create indexes for your TIMESTAMP columns (then the nested loops plan may become a better alternative).

That's why the NOT EXISTS solution is the one Aaron is suggesting. It can switch between different join types and also provides a much better understanding what the code intention actually is.

Another hint: the free Plan Explorer produces much better execution plans diagrams than Management Studio. For instance, it shows the estimated or actual row counts between operators. It's also a good idea to state explicitly whether the plan you're showing is an estimated or an actual one (the plans in the blog post should be actual).

Viele Grüße aus Frankfurt

Andrej

Neuen Kommentar schreiben

Der Inhalt dieses Feldes wird nicht öffentlich zugänglich angezeigt.

Klartext

  • Keine HTML-Tags erlaubt.
  • HTML - Zeilenumbrüche und Absätze werden automatisch erzeugt.
  • Web page addresses and email addresses turn into links automatically.
Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden

Unsere Website benutzt Cookies, um bestmögliche Funktionalitäten bieten zu können. Durch die Nutzung unserer Website, erklären Sie sich mit der Verwendung von Cookies einverstanden. In unserer Datenschutzerklärung finden Sie unter §5 Informationen und Links, wie Sie diesen Maßnahmen jederzeit widersprechen können.