Monday, 22 March 2010

SQL Server: Deleting duplicates using a Common Table Expression (CTE)

In a recent project I had to figure out how to delete a series of duplicates from a tablet to prevent a report becoming bloated with them.

There's more than one way to skin a cat, but I think I have found a pretty efficient way of deleting dupes. Of course it would be preferable to avoid creating the duplicate rows in the first place, but sometimes you just have to delete them after they have been created.

A recent development in the SQL Server world is that of Common Table Expressions (CTEs). These can be used to help you delete duplicates using a single command, rather than using a temporary table or too many joins.

The CTE is formed by selecting the table into a temporary result set which you can then instantly relate back to the original table to find which rows to delete. You could use syntax similar to the following:

WITH Duplicates_CTE
(FirstName, LastName, Email, Phone, id)
AS (
SELECT FirstName, LastName, Email, Phone, MIN(PersonId) id
FROM tblPerson GROUP BY FirstName, LastName, Email, Phone
SELECT id FROM Duplicates_CTE

The first select statement gets a distinct result set (without duplicate rows) from the problem table by using the MIN function and a GROUP BY clause. Then you can join back to the original table from the CTE result set and delete every row that does not correlate to the CTE - which means that you're left without any duplicate rows.

Et voila, simples! No duplicates and it's all done in one command.