Published on

September 9, 2007

De-Dupifying Record Sets in SQL Server 2005

Have you ever encountered a situation where you needed to remove duplicate rows from a record set in SQL Server? It can be a challenging task, especially when dealing with multiple duplicates or ragged numbers. In this article, we will explore a solution using the ROW_NUMBER() function and common table expressions in SQL Server 2005.

Let’s start by understanding the problem. Imagine you have a table called EMPLOYEE with columns EMPID, FNAME, LNAME, and REFDATE. This table contains duplicate rows, and you want to reduce them to a single row while keeping the earliest or latest duplicate record.

To demonstrate the solution, we will create an EMPLOYEE table in the TEMPDB database. Here is the SQL code to create and populate the table:

USE [TEMPDB]
GO

IF EXISTS (SELECT * FROM SYS.OBJECTS WHERE OBJECT_ID = OBJECT_ID(N'[DBO].[EMPLOYEE]') AND TYPE IN (N'U'))
    DROP TABLE [DBO].[EMPLOYEE]
GO

CREATE TABLE EMPLOYEE (
    EMPID INT,
    FNAME VARCHAR(50),
    LNAME VARCHAR(50),
    REFDATE DATETIME
)

INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME, REFDATE)
VALUES (2021110, 'MICHAEL', 'POLAND', '2006-01-18 09:41:08.297')

-- Insert more rows...

Now that we have our EMPLOYEE table set up, let’s use the ROW_NUMBER() function to assign a unique row number to each row, ordered by EMPID and REFDATE. This will help us identify the duplicates. Here is the SQL code to achieve this:

SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS ROWID, *
FROM EMPLOYEE

The result will be a new column ROWID that represents the row number for each row. We can now use this information to de-dupify the record set. Here is the SQL code to accomplish that:

WITH [EMPLOYEE ORDERED BY ROWID] AS (
    SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS ROWID, *
    FROM EMPLOYEE
    WHERE 1 = 1
)
DELETE FROM [EMPLOYEE ORDERED BY ROWID]
WHERE ROWID NOT IN (
    SELECT MIN(ROWID)
    FROM [EMPLOYEE ORDERED BY ROWID]
    GROUP BY EMPID, FNAME, LNAME
)

SELECT ROW_NUMBER() OVER (ORDER BY EMPID ASC, REFDATE ASC) AS ROWID, *
FROM EMPLOYEE

In the above code, we first create a common table expression called [EMPLOYEE ORDERED BY ROWID] that contains the row number and all columns from the EMPLOYEE table. We then delete the rows from this common table expression that have duplicate row numbers, keeping only the earliest record for each set of duplicates.

After the deletion, we can select the final record set to see the changes. The result will be a de-dupified record set with only the unique rows remaining.

It’s important to note that the use of semi-colons in the SQL code is crucial, especially when working with common table expressions. SQL Server 2005 requires the use of semi-colons to separate certain queries. Failure to include them may result in syntax errors.

In conclusion, the ROW_NUMBER() function and common table expressions in SQL Server 2005 provide a powerful solution for de-dupifying record sets. By leveraging these features, you can efficiently remove duplicate rows and retain the desired records based on your criteria. Give it a try and see how it simplifies your data manipulation tasks!

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.