Published on

July 8, 2010

SQL Server: Deduplicating Data Using Ranking Functions

SQL Server provides powerful functions that allow you to generate listings with generated numbers based on sort orders, providing keys such as row numbers and ranks. These functions can be used to deduplicate data in a simple and efficient way.

Let’s consider a simple example of a list of characters:

A
A
B
B
C
D
D
E

We can put this list into a table in SQL Server:

CREATE TABLE AlphaList (
    AlphaKey CHAR
);

INSERT INTO AlphaList (AlphaKey) VALUES ('A');
INSERT INTO AlphaList (AlphaKey) VALUES ('A');
INSERT INTO AlphaList (AlphaKey) VALUES ('B');
INSERT INTO AlphaList (AlphaKey) VALUES ('B');
INSERT INTO AlphaList (AlphaKey) VALUES ('C');
INSERT INTO AlphaList (AlphaKey) VALUES ('D');
INSERT INTO AlphaList (AlphaKey) VALUES ('D');
INSERT INTO AlphaList (AlphaKey) VALUES ('E');

If we use the ROW_NUMBER function, we can get a listing of the AlphaList table with a sequential list of row numbers:

SELECT ROW_NUMBER() OVER (ORDER BY AlphaKey) AS RowNumber, AlphaKey
FROM AlphaList;

This will give us the following result:

RowNumber   AlphaKey
1           A
2           A
3           B
4           B
5           C
6           D
7           D
8           E

Now, if we add the RANK function to this query, we can get an additional column ranking the data. The values of ‘A’ will be assigned a rank of 1 as they are equal, the values of ‘B’ will be assigned the next rank, and so on:

SELECT RANK() OVER (ORDER BY AlphaKey) AS Rank, ROW_NUMBER() OVER (ORDER BY AlphaKey) AS RowNumber, AlphaKey
FROM AlphaList;

This will give us the following result:

Rank    RowNumber   AlphaKey
1       1           A
1       2           A
3       3           B
3       4           B
5       5           C
6       6           D
6       7           D
8       8           E

From this, we can see that the rows containing the AlphaKey column values of ‘A’, ‘B’, and ‘D’ have the same values generated by the RANK and ROW_NUMBER functions, giving us a simple way of deduplicating the data.

This process can be very useful in practical scenarios. For example, it can be used to extract missing data from tables and backfill dimension tables. Let’s say we have an Item dimension table and we want to extract all the missing Item numbers from the SalesInvoices table:

SELECT ItemNumber, UnitCost
INTO MissingItems
FROM SalesInvoices
WHERE ItemNumber NOT IN (SELECT ItemNumber FROM dimItem);

This will produce a table containing all the missing item numbers together with their unit cost. The extracted data could look something like this:

ItemNumber   UnitCost
777          10.10
777          11.11
777          12.12
888          13.13
888          14.14
888          15.15
999          16.16
999          17.17
999          18.18

We can then use the RANK and ROW_NUMBER functions to deduplicate the data and obtain the lowest unit price for each item number:

WITH GetMissingItems AS (
    SELECT RANK() OVER (ORDER BY ItemNumber, UnitCost) AS Rank, ROW_NUMBER() OVER (ORDER BY ItemNumber, UnitCost) AS RowNumber, ItemNumber, UnitCost
    FROM MissingItems
)
SELECT ItemNumber, UnitCost
FROM GetMissingItems
WHERE Rank = RowNumber;

This will give us the following result:

ItemNumber   UnitCost
777          10.10
888          13.13
999          16.16

In conclusion, using ranking functions in SQL Server is a simple and efficient way to deduplicate data. It allows you to order and classify data simultaneously, making it easy to identify and remove duplicate values. This process can be applied to various scenarios, such as backfilling dimension tables or extracting missing data. By leveraging the power of SQL Server’s ranking functions, you can streamline your data deduplication tasks and improve the performance of your queries.

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.