In the SQL Server community, there is often a debate about whether to use COUNT(*) or COUNT(columnname) when counting rows in a table. Many people believe that COUNT(columnname) is faster because it only needs to read the specified column, while COUNT(*) would have to read all columns of each row. However, this belief is not entirely accurate and can lead to incorrect advice.
First, let’s understand the difference between the two ways of using COUNT(). The complete syntax for COUNT() is:
COUNT ( { [ ALL | DISTINCT ] expression ] | * } )
The word “expression” refers to any expression except for uniqueidentifier, text, ntext, or image data. In most cases, “expression” is simply a column in the table. The default option is ALL, which means that writing COUNT(expression) is equivalent to writing COUNT(ALL expression). COUNT(*) returns the total number of rows in the table, while COUNT(expression) returns the number of rows where the result of the expression is not NULL. COUNT(DISTINCT expression) counts only unique values.
Now, let’s debunk some myths and clarify the facts. The belief that COUNT(columnname) is faster than COUNT(*) is based on the assumption that SQL Server can read just the contents of a single column without reading the entire row. However, SQL Server stores rows on 8 KB data pages on disk, and to check the value of a single column, an entire page has to be read from disk and placed in memory. Even if the pages are already cached in memory, SQL Server still needs to read an entire page just to check a single column of a row.
To optimize the counting process, SQL Server uses indexes. Indexes are stored on 8 KB index pages and can usually fit more rows per page than data pages. This means that SQL Server doesn’t have to read as many pages to count the rows in an index. Both COUNT(columnname_with_an_index_defined_on_it) and COUNT(*) can utilize indexes to count the rows efficiently.
There may be cases where an index doesn’t cover the specified column in a COUNT(columnname) query, but there is an index defined on another column of the table. In this scenario, COUNT(*) would use the other index to count the rows, while COUNT(columnname_without_an_index) would have to read the data pages to check the column for NULL values and count the rows.
To illustrate this, you can run the following script in SQL Query Analyzer:
USE Northwind GO SET STATISTICS IO ON SELECT COUNT(*) FROM Orders SELECT COUNT(CustomerId) FROM Orders SELECT * FROM Orders SET STATISTICS IO OFF
The script enables SQL Server to output statistics showing the amount of I/O required to execute the queries. By comparing the number of logical and physical page reads, you can determine the amount of resources used by different queries. In my tests, COUNT(*) resulted in fewer logical page reads compared to COUNT(CustomerId), indicating better performance.
So, which one should you use? In most cases, there is no reason not to use COUNT(*). It performs well and provides accurate results. However, there are situations where you should use COUNT(expression). For example, if you are only interested in the number of rows where the column value is not NULL, COUNT(expression) is the appropriate choice. This is particularly useful when combining COUNT() with other aggregated functions.
Consider the following scenario: you have a table with measure data, some of which have NULL values. You want to calculate the average of these values. While the preferred approach is to use AVG(), let’s assume you are not allowed to use it. In this case, using COUNT(column) instead of COUNT(*) can lead to different average results because COUNT(column) counts only non-NULL values. This can cause unexpected discrepancies in your calculations.
To summarize, understanding the differences between COUNT(*) and COUNT(columnname) is crucial for making informed decisions in SQL Server. While COUNT(*) is generally the preferred choice, there are specific scenarios where COUNT(expression) is necessary. By selecting the appropriate counting method, you can ensure accurate results and optimize performance in your SQL Server queries.