Continuous learning is essential to stay up-to-date with SQL Server product features. During a recent demo on SQL Server Statistics, I discovered a powerful feature called Filter Statistics, introduced in SQL Server 2008. This feature addresses situations where the optimizer struggles to estimate query results accurately due to a lack of understanding of data correlation.
To demonstrate the power of Filter Statistics, I created a small demo using two tables: Employee and Sales. Here is the script:
USE MASTER
GO
IF DB_ID('DropMeAfterDemo') IS NOT NULL
BEGIN
ALTER DATABASE DropMeAfterDemo SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE DropMeAfterDemo
END
GO
CREATE DATABASE DropMeAfterDemo
GO
SET NOCOUNT ON
GO
USE DropMeAfterDemo
GO
IF OBJECT_ID('Employee') IS NOT NULL
DROP TABLE Employee
GO
IF OBJECT_ID('Sales') IS NOT NULL
DROP TABLE Sales
GO
CREATE TABLE Employee (
EmpID INT,
name NVARCHAR(100)
)
GO
CREATE TABLE Sales (
EmpID INT,
SalesAmount INT,
DATETIME DATETIME
)
GO
CREATE CLUSTERED INDEX cidx_EmpID ON Employee (EmpID)
GO
CREATE INDEX ix_Employee_name ON Employee (name)
GO
CREATE STATISTICS ix_Employee_EmpID_name ON Employee (EmpID, name)
GO
CREATE CLUSTERED INDEX ix_Sales_id_amount ON Sales (EmpID, SalesAmount)
GO
-- Inserting data with skewness
INSERT Employee VALUES (1, 'Pinal')
INSERT Employee VALUES (2, 'Nupur')
GO
SET NOCOUNT ON
-- Pinal did poorly with just one sale
INSERT Sales VALUES (1, 100, GETDATE())
-- Nupur did exceptionally well with 5000 sales
DECLARE @loop INT
SET @loop = 1
WHILE @loop <= 5000
BEGIN
INSERT Sales VALUES (2, @loop, GETDATE() - RAND() * 1000)
SET @loop = @loop + 1
END
GO
-- Updating statistics with full scan
UPDATE STATISTICS Employee WITH FULLSCAN
GO
UPDATE STATISTICS Sales WITH FULLSCAN
GO
-- Running the query and examining the execution plan
-- Enable the actual execution plan in SQL Server Management Studio
DBCC FREEPROCCACHE
GO
SELECT SalesAmount
FROM Employee, Sales
WHERE Employee.EmpID = Sales.EmpID
AND name = 'Pinal'
GO
DBCC FREEPROCCACHE
GO
SELECT SalesAmount
FROM Employee, Sales
WHERE Employee.EmpID = Sales.EmpID
AND name = 'Nupur'
GO
-- Creating filtered statistics based on the WHERE clause
CREATE STATISTICS Employee_stats_EmpID ON Employee (EmpID) WHERE name = 'Pinal'
GO
CREATE STATISTICS Employee_stats_EmpID2 ON Employee (EmpID) WHERE name = 'Nupur'
GO
-- Running the same queries with filtered statistics
SELECT SalesAmount
FROM Employee, Sales
WHERE Employee.EmpID = Sales.EmpID
AND name = 'Pinal'
GO
SELECT SalesAmount
FROM Employee, Sales
WHERE Employee.EmpID = Sales.EmpID
AND name = 'Nupur'
GO
-- Viewing filtered statistics
SELECT *
FROM sys.stats
WHERE filter_definition IS NOT NULL
In the demo, I created two tables: Employee and Sales. These tables represent a company with two employees who have different sales amounts, resulting in skewed data. To ensure accurate statistics, I updated them with a full scan.
Next, I ran a query to retrieve the sales amount for a specific employee. However, the estimated number of rows in the execution plan was incorrect because SQL Server couldn’t determine the matching rows until it found the Employee ID from the Employee table.
To address this issue, I created filtered statistics based on the WHERE clause in the query. By creating statistics specific to each employee, the optimizer could accurately estimate the number of rows and generate a more efficient query plan.
Running the same queries with the filtered statistics resulted in accurate estimations and different query plans for each employee.
The main usage of Filter Statistics is in situations where data is skewed, such as when certain values occur more frequently than others. By creating filtered statistics, SQL Server can make better decisions during query optimization.
Have you ever encountered a situation where Filter Statistics could have improved query performance in your environment? Let me know in the comments!