Introduction
SQL Server's Ranking functions can be very useful and handy if we use them in a right way and thoughtfully; otherwise, they become evil very fast! In this tip I will talk about my experience with NTILE
function but the same story can be expanded to all other ranking functions.
Background
After spending several hours on why a script was generating different results against same sets of data, I discovered the evil. It was a set of unwise NTILE
function calls. For those who may not be familiar with this function, please check this MSDN page. All rank functions contain an over clause which tells the function which column(s) to be used as reference.
The Problem
Here is an example for NTILE
and how it becomes evil -
Imagine the following table of data (named MyTable
):
I picked simle names A
to F
in order to make it easier to follow the scenario.
Now assume we have the following script:
SELECT [Name], [Cost], NTILE(6) OVER(ORDER BY [Cost]) AS [Rank] FROM [MyTable]
Running the script agains MyTable
you will have the following result set:
Name | Cost | Rank |
C | 1 | 1 |
B | 1 | 2 |
A | 1 | 3 |
E | 2 | 4 |
F | 3 | 5 |
D | 4 | 6 |
As you can see, NTILE
does exactly what it is asked to do. Let's run the same script against the same table with slightly different data order:
As you can see, the tables contents are the same but only the order of data has changed. After running the very same script, you will get the following result set:
Name | Cost | Rank |
B | 1 | 1 |
A | 1 | 2 |
C | 1 | 3 |
E | 2 | 4 |
F | 3 | 5 |
D | 4 | 6 |
Now, let's compare the rankings in two result sets side by side:
| Result 1 | Result 2 |
Name | Rank | Rank |
A | 3 | 2 |
B | 2 | 1 |
C | 1 | 3 |
D | 4 | 4 |
E | 5 | 5 |
F | 6 | 6 |
Can you see the evil? Running the same script against the same data sets with different data order results in totally different outputs. NTILE
by itself is not an evil but the way we assumed it would work was wrong. In script we are asking it to order the Cost
column and then group them into 6 groups. The only order it considers is Cost
because that's what we asked for. Who to blame?!
How to fix it?
When we use rank functions we need to be very careful. We need to consider many things and check what we are asking them to do wisely. Rank functions can receive several orders. What we need to do is that we need to make sure all required orderings are in place. That way we can make sure the data order does not affect our results. In this case in particular, one way to fix the discrepancy is to add a new order to our NTILE
function:
SELECT [Name], [Cost], NTILE(6) OVER(ORDER BY [Cost], [Name]) AS [Rank] FROM [MyTable]
We are adding Name column to the order clause. It does not matter if it as ASC
or DESC
; what matters is that we will get the same results if we run it against any data set with the same data but different order.
Name | Cost | Rank |
A | 1 | 1 |
B | 1 | 2 |
C | 1 | 3 |
E | 2 | 4 |
F | 3 | 5 |
D | 4 | 6 |