SQL Server provides powerful functions that allow you to generate listings with generated numbers based on sort orders, providing keys such as row numbers and ranks. These functions can be used to deduplicate data in a simple and efficient way.
Let’s consider a simple example of a list of characters:
A A B B C D D E
We can put this list into a table in SQL Server:
CREATE TABLE AlphaList (
AlphaKey CHAR
);
INSERT INTO AlphaList (AlphaKey) VALUES ('A');
INSERT INTO AlphaList (AlphaKey) VALUES ('A');
INSERT INTO AlphaList (AlphaKey) VALUES ('B');
INSERT INTO AlphaList (AlphaKey) VALUES ('B');
INSERT INTO AlphaList (AlphaKey) VALUES ('C');
INSERT INTO AlphaList (AlphaKey) VALUES ('D');
INSERT INTO AlphaList (AlphaKey) VALUES ('D');
INSERT INTO AlphaList (AlphaKey) VALUES ('E');
If we use the ROW_NUMBER function, we can get a listing of the AlphaList table with a sequential list of row numbers:
SELECT ROW_NUMBER() OVER (ORDER BY AlphaKey) AS RowNumber, AlphaKey FROM AlphaList;
This will give us the following result:
RowNumber AlphaKey 1 A 2 A 3 B 4 B 5 C 6 D 7 D 8 E
Now, if we add the RANK function to this query, we can get an additional column ranking the data. The values of ‘A’ will be assigned a rank of 1 as they are equal, the values of ‘B’ will be assigned the next rank, and so on:
SELECT RANK() OVER (ORDER BY AlphaKey) AS Rank, ROW_NUMBER() OVER (ORDER BY AlphaKey) AS RowNumber, AlphaKey FROM AlphaList;
This will give us the following result:
Rank RowNumber AlphaKey 1 1 A 1 2 A 3 3 B 3 4 B 5 5 C 6 6 D 6 7 D 8 8 E
From this, we can see that the rows containing the AlphaKey column values of ‘A’, ‘B’, and ‘D’ have the same values generated by the RANK and ROW_NUMBER functions, giving us a simple way of deduplicating the data.
This process can be very useful in practical scenarios. For example, it can be used to extract missing data from tables and backfill dimension tables. Let’s say we have an Item dimension table and we want to extract all the missing Item numbers from the SalesInvoices table:
SELECT ItemNumber, UnitCost INTO MissingItems FROM SalesInvoices WHERE ItemNumber NOT IN (SELECT ItemNumber FROM dimItem);
This will produce a table containing all the missing item numbers together with their unit cost. The extracted data could look something like this:
ItemNumber UnitCost 777 10.10 777 11.11 777 12.12 888 13.13 888 14.14 888 15.15 999 16.16 999 17.17 999 18.18
We can then use the RANK and ROW_NUMBER functions to deduplicate the data and obtain the lowest unit price for each item number:
WITH GetMissingItems AS (
SELECT RANK() OVER (ORDER BY ItemNumber, UnitCost) AS Rank, ROW_NUMBER() OVER (ORDER BY ItemNumber, UnitCost) AS RowNumber, ItemNumber, UnitCost
FROM MissingItems
)
SELECT ItemNumber, UnitCost
FROM GetMissingItems
WHERE Rank = RowNumber;
This will give us the following result:
ItemNumber UnitCost 777 10.10 888 13.13 999 16.16
In conclusion, using ranking functions in SQL Server is a simple and efficient way to deduplicate data. It allows you to order and classify data simultaneously, making it easy to identify and remove duplicate values. This process can be applied to various scenarios, such as backfilling dimension tables or extracting missing data. By leveraging the power of SQL Server’s ranking functions, you can streamline your data deduplication tasks and improve the performance of your queries.