2012-02-20

TSQL DISTINCT qualifier

Using the DISTINCT qualifier in the SELECT clause typically causes a late sort in SQL Server's execution plan.  Since that sort contains duplicated rows, it is inefficient.  There are alternatives that may be more efficient.
1. Using GROUP BY - this strategy uses GROUP BY across all of the items in the SELECT clause.  Duplicates will be aggregated, with only one occurrence in the result set.  The execution plan must be verified; specifically, that the late DISTINCT SORT has now moved to an earlier SORT.
2. Using the EXISTS() function - This works best if the items in the SELECT clause are from one table.  It can sometimes be done across multiple tables, though I don't know of a generalized form for re-writing the multi-table query. In the single-table case, the form is:

SELECT DISTINCT S.c1, S.c2, ..., S.cn
FROM tableName1 S
    JOIN tableName2 T ON S.i1 = T.i1
    ...
WHERE (... other constraints ...)

<<  which  becomes  >>

SELECT S.c1, S.c2, ..., S.cn
FROM tableName1 S
WHERE EXISTS(SELECT 1
  FROM tableName1 S2 
     JOIN tableName2 T ON S2.i1 = T.i1
     ...  
  WHERE S.PK = S2.PK
     AND (... other constraints ...) )


The efficiency of this re-write is best demonstrated by the actual execution time of the query.  The execution plan doesn't have a way to show that the query within the EXISTS function is actually a series of short-circuit ORs.  Once any one is made, it does not need to further evaluate the condition.  The PK field in the above query represents the primary key field of table S.

-- Steve

No comments:

Post a Comment