Published on

December 29, 2009

Exploring SQL Server Concepts: Ranking and Windowing Functions

SQL Server 2005 offers a range of ranking and windowing functions that can be used to evaluate data and return specific rows. These functions are particularly useful when analyzing the history of product releases or any other scenario where you need to identify the latest or maximum values for specific attributes.

Let’s consider a scenario where we have a table called Production.ProductVersion that stores information about different versions of a product. Each version is identified by a combination of attributes: Version, MinorVersion, and ReleaseVersion. We want to retrieve the latest version of each product along with its associated attributes.

One approach to solving this problem is to use an ORDER BY clause in a query. However, this approach would return all the versions of a product, not just the latest one. To overcome this limitation, we can utilize the ROW_NUMBER() windowing function.

Let’s take a look at an example using the AdventureWorks database. We’ll use the Production.ProductCostHistory table to create a mockup of product versions and costs. We’ll then populate the Production.ProductVersion table with this data.

CREATE TABLE Production.ProductVersion
(
	ProductID int NOT NULL,
	Version int NOT NULL,
	MinorVersion int NOT NULL,
	ReleaseVersion int NOT NULL,
	StandardCost numeric(30, 4) NOT NULL,
	CONSTRAINT PK_ProductVersion PRIMARY KEY CLUSTERED 
	(
		ProductID ASC,
		Version ASC,
		MinorVersion ASC,
		ReleaseVersion ASC
	)
);

-- Populate the table with randomized data
WITH ProductVersion AS
(
	SELECT ProductID,
		1 AS Version,
		CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
		CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
		CAST(StandardCost AS NUMERIC(30,4)) AS StandardCost
	FROM Production.ProductCostHistory WITH (NOLOCK)
	UNION ALL
	SELECT
		ProductID,
		ABS(CHECKSUM(NEWID())) % 3 AS Version,
		CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
		CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
		CAST((CAST(StandardCost AS NUMERIC(30,4)) * 1.10) AS NUMERIC(30,4)) AS StandardCost
	FROM Production.ProductCostHistory WITH (NOLOCK)
	UNION ALL
	SELECT
		ProductID,
		ABS(CHECKSUM(NEWID())) % 5 AS Version,
		CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
		CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
		CAST((CAST(StandardCost AS NUMERIC(30,4)) * 2.10) AS NUMERIC(30,4)) AS StandardCost
	FROM Production.ProductCostHistory WITH (NOLOCK)
)
INSERT INTO Production.ProductVersion
SELECT ProductID,
	Version,
	MinorVersion,
	ReleaseVersion,
	MAX(StandardCost) AS StandardCost
FROM ProductVersion
GROUP BY ProductID,
	Version,
	MinorVersion,
	ReleaseVersion;

Now that we have our data, let’s compare two different approaches to retrieve the latest version of each product.

Approach 1: Subquery Implementation

One way to achieve our goal is by using nested subqueries. This approach involves multiple subqueries to retrieve the maximum values for each attribute.

SELECT ProductID,
	(
		SELECT MAX(Version)
		FROM Production.ProductVersion pv2 WITH (NOLOCK)
		WHERE pv2.ProductID = pv.ProductID
	) AS Version,
	(
		SELECT MAX(MinorVersion)
		FROM Production.ProductVersion pv3 WITH (NOLOCK)
		WHERE pv3.ProductID = pv.ProductID
			AND pv3.Version = (
				SELECT MAX(Version)
				FROM Production.ProductVersion pv2 WITH (NOLOCK)
				WHERE pv2.ProductID = pv.ProductID
			)
	) AS MinorVersion,
	(
		SELECT MAX(ReleaseVersion)
		FROM Production.ProductVersion pv4 WITH (NOLOCK)
		WHERE pv4.ProductID = pv.ProductID
			AND pv4.Version = (
				SELECT MAX(Version)
				FROM Production.ProductVersion pv2 WITH (NOLOCK)
				WHERE pv2.ProductID = pv.ProductID
			)
			AND pv4.MinorVersion = (
				SELECT MAX(MinorVersion)
				FROM Production.ProductVersion pv3 WITH (NOLOCK) 
				WHERE pv3.ProductID = pv.ProductID
					AND pv3.Version = (
						SELECT MAX(Version)
						FROM Production.ProductVersion pv2 WITH (NOLOCK)
						WHERE pv2.ProductID = pv.ProductID
					)
			)
	) AS ReleaseVersion,
	(
		SELECT StandardCost
		FROM Production.ProductVersion pv5 WITH (NOLOCK)
		WHERE pv5.ProductID = pv.ProductID
			AND pv5.Version = (
				SELECT MAX(Version)
				FROM Production.ProductVersion pv2 WITH (NOLOCK)
				WHERE pv2.ProductID = pv.ProductID
			)
			AND pv5.MinorVersion = (
				SELECT MAX(MinorVersion)
				FROM Production.ProductVersion pv3 WITH (NOLOCK)
				WHERE pv3.ProductID = pv.ProductID
					AND pv3.Version = (
						SELECT MAX(Version)
						FROM Production.ProductVersion pv2 WITH (NOLOCK)
						WHERE pv2.ProductID = pv.ProductID
					)
			)
			AND pv5.ReleaseVersion = (
				SELECT MAX(ReleaseVersion)
				FROM Production.ProductVersion pv4 WITH (NOLOCK)
				WHERE pv4.ProductID = pv.ProductID
					AND pv4.Version = (
						SELECT MAX(Version)
						FROM Production.ProductVersion pv2 WITH (NOLOCK)
						WHERE pv2.ProductID = pv.ProductID
					)
					AND pv4.MinorVersion = (
						SELECT MAX(MinorVersion)
						FROM Production.ProductVersion pv3 WITH (NOLOCK)
						WHERE pv3.ProductID = pv.ProductID
							AND pv3.Version = (
								SELECT MAX(Version)
								FROM Production.ProductVersion pv2 WITH (NOLOCK)
								WHERE pv2.ProductID = pv.ProductID
							)
					)
			)
	) AS StandardCost
FROM Production.ProductVersion pv WITH (NOLOCK)
GROUP BY ProductID;

Approach 2: ROW_NUMBER() Implementation

An alternative and more efficient approach is to use the ROW_NUMBER() function. This function assigns a unique row number to each row based on the specified ordering. By partitioning the rows by the ProductID and ordering them by the attributes in descending order, we can easily retrieve the latest version for each product.

WITH RowExample1 AS
(
	SELECT ROW_NUMBER() OVER(PARTITION BY ProductID
		ORDER BY ProductID,
			Version DESC,
			MinorVersion DESC,
			ReleaseVersion DESC
	) AS MaxVersion,
	ProductID,
	Version,
	MinorVersion,
	ReleaseVersion,
	StandardCost
	FROM Production.ProductVersion pv WITH (NOLOCK)
)
SELECT ProductID,
	Version,
	MinorVersion,
	ReleaseVersion,
	StandardCost
FROM RowExample1
WHERE MaxVersion = 1
ORDER BY ProductID;

Comparing the two approaches, the ROW_NUMBER() implementation is more readable, manageable, and efficient. It eliminates the need for complex nested subqueries and ensures the integrity of the data at the row level. Additionally, it remains relatively independent of indexing, making it a better choice for large amounts of data.

Impact of Indexing

Indexing plays a crucial role in the performance of the subquery approach. The ROW_NUMBER() implementation, on the other hand, is less affected by indexing. Let’s compare the estimated query costs for different row sizes.

Production.ProductVersion:

Row SizeSubquery Implementation CostROW_NUMBER() Implementation Cost
1,0000.06524620.0355736
10,0000.2385730.673282
100,0002.22585.97198
1,000,00014.388183.7228

Production.ProductVersion2:

Row SizeSubquery Implementation CostROW_NUMBER() Implementation Cost
1,0000.03557360.225896
10,0001.63970.673282
100,00044.13325.97202
1,000,000448.4783.7229

As shown in the tables, the ROW_NUMBER() implementation remains relatively consistent in terms of query cost, regardless of the row size. On the other hand, the subquery implementation’s query cost increases significantly with larger row sizes.

Change in Requirements

Now, let’s consider a change in requirements. Suppose we need to retrieve the maximum MinorVersion for each combination of ProductID and Version. Modifying the subquery implementation to accommodate this change would involve significant overhead and complexity. However, the ROW_NUMBER() implementation only requires a small modification.

WITH RowExample2 AS
(
	SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,
		Version
		ORDER BY ProductID,
			Version,
			MinorVersion DESC,
			ReleaseVersion DESC
	) AS MaxVersion,
	ProductID,
	Version,
	MinorVersion,
	ReleaseVersion,
	StandardCost
	FROM Production.ProductVersion pv WITH (NOLOCK)
)
SELECT ProductID,
	Version,
	MinorVersion,
	ReleaseVersion,
	StandardCost
FROM RowExample2
WHERE MaxVersion = 1
ORDER BY ProductID;

In this modified implementation, we partition the rows by both ProductID and Version. The WHERE clause remains valid because the maximum MinorVersion for each combination is guaranteed to be the first row. The estimated query cost does not change significantly for the modified code.

Conclusion

The ROW_NUMBER() function provides a more readable, manageable, and efficient solution for retrieving the latest or maximum values in SQL Server. It eliminates the need for complex nested subqueries and ensures the integrity of data at the row level. The function’s implementation is relatively independent of indexing, making it a better choice for large datasets. When faced with scenarios that require sequencing or identifying the latest values, the ROW_NUMBER() function should be considered as a powerful tool in your SQL Server toolkit.

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.