Introduction
A while ago I wrote an article on the select statement and in the 'what's new in 2012' section I mentioned the four new
analytical functions, and said I had no real clue how they worked. Having gone on to learn, I'm going to set out to
explain them today. I read a lot of explanations that were flat out confusing, so hopefully I can contribute something in
terms of making this clear, because it's actually not complicated
cume_dist and percent_rank
There are four new functions which break off nicely in to pairs. In order to play with them, we're going to create a
sample database with 10 entries, each with an id, a name and a birthdate. Note that there are repeated values in
here.
create table birthdays
(
id int identity,
name varchar(50),
birthday date
)
insert into birthdays values
('Chris', '19690217'),
('Sarah', '19710514'),
('Bill', '19710514'),
('Hannah', '19960620'),
('Chris', '19230110'),
('Bob', '19710512'),
('Manuel', '19230114'),
('Chris', '20000601'),
('Chris', '19960620'),
('Chris', '19710514')
Now if we do a select and run these two functions, we get this:
select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank
from birthdays
gives:
id | name | birthday | cumeDist | percentRank |
1 | Chris | 1969-02-17 | 0.1 | 0 |
2 | Sarah | 1971-05-14 | 0.2 | 0.111111111111111 |
3 | Bill | 1971-05-14 | 0.3 | 0.222222222222222 |
4 | Hannah | 1996-06-20 | 0.4 | 0.333333333333333 |
5 | Chris | 1923-01-10 | 0.5 | 0.444444444444444 |
6 | Bob | 1971-05-12 | 0.6 | 0.555555555555556 |
7 | Manuel | 1923-01-14 | 0.7 | 0.666666666666667 |
8 | Chris | 2000-06-01 | 0.8 | 0.777777777777778 |
9 | Chris | 1996-06-20 | 0.9 | 0.888888888888889 |
10 | Chris | 1971-05-14 | 1 | 1 |
So, what has this done ? The syntax for cume_dist and percent_rank is the same as for any window function, you need an order by, and you can specify a partition by as well, to create windows inside the data. We have exactly 10 values for a reason. Look at the data for the cumeDist column. The values are all id/10. CumeDist calculates the CUMulativE DISTribution of your data. In plain English, it asks, how many rows have the key value of this row, or less, as a percentage. So, 10% of the rows have an id of 1 or less, 20% have an id of 2 or less, etc. PercentRank is the PERCENTile RANKing, whch is to say, it ask 'how many rows have an id less than mine ?'. If we had 11 rows in our data, then it would be from 0-1, as you can see by running the following:
;with rankData as
(
select id, name, birthday from birthdays
union
select 11 as id, 'fred' as name, '20000101' as birthday
)
select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank
from rankData;
gives:
id | name | birthday | cumeDist | percentRank |
1 | Chris | 1969-02-17 | 0.0909090909090909 | 0 |
2 | Sarah | 1971-05-14 | 0.181818181818182 | 0.1 |
3 | Bill | 1971-05-14 | 0.272727272727273 | 0.2 |
4 | Hannah | 1996-06-20 | 0.363636363636364 | 0.3 |
5 | Chris | 1923-01-10 | 0.454545454545455 | 0.4 |
6 | Bob | 1971-05-12 | 0.545454545454545 | 0.5 |
7 | Manuel | 1923-01-14 | 0.636363636363636 | 0.6 |
8 | Chris | 2000-06-01 | 0.727272727272727 | 0.7 |
9 | Chris | 1996-06-20 | 0.818181818181818 | 0.8 |
10 | Chris | 1971-05-14 | 0.909090909090909 | 0.9 |
11 | fred | 2000-01-01 | 1 | 1 |
By using a UNION to add an extra row, we forced cume_dist to move at increments less than 1, and percent_rank to move
in steps of 1.
PERCENTILE_DISC and PERCENTILE_CONT
So, what do these other two functions do ? Basically, they are the opposite question. They allow you to ask the question 'which is the first row for which cume_dist and percent_rank would return this value ? As such, the syntax is slightly different:
select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over() as [percentileDisc .5],
PERCENTILE_DISC(.55) within group (order by id) over() as [percentileDisc .55],
PERCENTILE_CONT(.5) within group (order by id) over() as [percentileCont .5],
PERCENTILE_CONT(.55) within group (order by id) over() as [percentileCont .55]
from birthdays
Instead of specifying the column that we run over, and an optional window, we pass in a parameter ( the value to match ) and then use the 'within group' syntax to define the window that we look in for the value in question. The order by clause tells the function what value to actually return ( it will order by the id, and return the id that matches the value passed in ) and the over clause is required, even if it's empty, but can be used to define windows inside the data for the functions to run inside.
There is another wrinkle. If you run this SQL, you get the following:
id | name | birthday | cumeDist | percentRank | percentileDisc .5 | percentileDisc .55 | percentileCont .5 | percentileCont .55 |
1 | Chris | 1969-02-17 | 0.1 | 0 | 5 | 6 | 5.5 | 5.95 |
2 | Sarah | 1971-05-14 | 0.2 | 0.111111111111111 | 5 | 6 | 5.5 | 5.95 |
3 | Bill | 1971-05-14 | 0.3 | 0.222222222222222 | 5 | 6 | 5.5 | 5.95 |
4 | Hannah | 1996-06-20 | 0.4 | 0.333333333333333 | 5 | 6 | 5.5 | 5.95 |
5 | Chris | 1923-01-10 | 0.5 | 0.444444444444444 | 5 | 6 | 5.5 | 5.95 |
6 | Bob | 1971-05-12 | 0.6 | 0.555555555555556 | 5 | 6 | 5.5 | 5.95 |
7 | Manuel | 1923-01-14 | 0.7 | 0.666666666666667 | 5 | 6 | 5.5 | 5.95 |
8 | Chris | 2000-06-01 | 0.8 | 0.777777777777778 | 5 | 6 | 5.5 | 5.95 |
9 | Chris | 1996-06-20 | 0.9 | 0.888888888888889 | 5 | 6 | 5.5 | 5.95 |
10 | Chris | 1971-05-14 | 1 | 1 | 5 | 6 | 5.5 | 5.95 |
Now, given the source data, you might think that percentile_disc gives rounder numbers just because cume_dist does, but if you run this:
;with rankData as
(
select id, name, birthday from birthdays
union
select 11 as id, 'fred' as name, '20000101' as birthday
)
select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over() as [percentileDisc .5],
PERCENTILE_DISC(.55) within group (order by id) over() as [percentileDisc .55],
PERCENTILE_CONT(.5) within group (order by id) over() as [percentileCont .5],
PERCENTILE_CONT(.55) within group (order by id) over() as [percentileCont .55]
from rankData;
You get this:
id | name | birthday | cumeDist | percentRank | percentileDisc .5 | percentileDisc .55 | percentileCont .5 | percentileCont .55 |
1 | Chris | 1969-02-17 | 0.090909 | 0 | 6 | 7 | 6 | 6.5 |
2 | Sarah | 1971-05-14 | 0.181818 | 0.1 | 6 | 7 | 6 | 6.5 |
3 | Bill | 1971-05-14 | 0.272727 | 0.2 | 6 | 7 | 6 | 6.5 |
4 | Hannah | 1996-06-20 | 0.363636 | 0.3 | 6 | 7 | 6 | 6.5 |
5 | Chris | 1923-01-10 | 0.454545 | 0.4 | 6 | 7 | 6 | 6.5 |
6 | Bob | 1971-05-12 | 0.545454 | 0.5 | 6 | 7 | 6 | 6.5 |
7 | Manuel | 1923-01-14 | 0.636363 | 0.6 | 6 | 7 | 6 | 6.5 |
8 | Chris | 2000-06-01 | 0.727272 | 0.7 | 6 | 7 | 6 | 6.5 |
9 | Chris | 1996-06-20 | 0.818181 | 0.8 | 6 | 7 | 6 | 6.5 |
10 | Chris | 1971-05-14 | 0.909090 | 0.9 | 6 | 7 | 6 | 6.5 |
11 | fred | 2000-01-01 | 1 | 1 | 6 | 7 | 6 | 6.5 |
This does clean up the percentile_cont values considerable, by changing the result, but it's still true that percentile_disc always gives actual values from the source data, and percentil_cont calculates values that are *between* the values in the source data.
Windowing
Here is an example that creates windows inside the data. It also shows percentile_disc over names ( I used .1 so it's obvious that the answer is the first alphabetical name ).
select id, name, birthday,
cume_dist() over(partition by name order by id) as cumeDist,
PERCENT_RANK() over(partition by name order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over(partition by name) as [percentileDisc .5],
PERCENTILE_CONT(.5) within group (order by id) over(partition by name) as [percentileCont .5],
PERCENTILE_DISC(.1) within group (order by name) over() as [percentileDisc Name .1]
from birthdays
order by id
and the result:
id | name | birthday | cumeDist | percentRank | percentileDisc .5 | percentileCont .5 | percentileDisc Name .1 |
1 | Chris | 1969-02-17 | 0.2 | 0 | 8 | 8 | Bill |
2 | Sarah | 1971-05-14 | 1 | 0 | 2 | 2 | Bill |
3 | Bill | 1971-05-14 | 1 | 0 | 3 | 3 | Bill |
4 | Hannah | 1996-06-20 | 1 | 0 | 4 | 4 | Bill |
5 | Chris | 1923-01-10 | 0.4 | 0.25 | 8 | 8 | Bill |
6 | Bob | 1971-05-12 | 1 | 0 | 6 | 6 | Bill |
7 | Manuel | 1923-01-14 | 1 | 0 | 7 | 7 | Bill |
8 | Chris | 2000-06-01 | 0.6 | 0.5 | 8 | 8 | Bill |
9 | Chris | 1996-06-20 | 0.8 | 0.75 | 8 | 8 | Bill |
10 | Chris | 1971-05-14 | 1 | 1 | 8 | 8 | Bill |
One final example, this time I am grouping by birth month. Hopefully that's enough to give you some ideas, and the data set is varied enough ( or you could use your own data ) to inspire you to play further as you come to grips with how these functions can be used.
select id, name, birthday, month(birthday) as [month],
cume_dist() over(order by month(birthday)) as cumeDist,
PERCENT_RANK() over(order by month(birthday)) as percentRank,
PERCENTILE_DISC(.5) within group (order by month(birthday)) over(partition by month(birthday)) as [percentileDisc .5],
PERCENTILE_CONT(.5) within group (order by month(birthday)) over(partition by month(birthday)) as [percentileCont .5]
from birthdays
Which returns:
id | name | birthday | month | cumeDist | percentRank | percentileDisc .5 | percentileCont .5 |
5 | Chris | 1923-01-10 | 1 | 0.2 | 0 | 1 | 1 |
7 | Manuel | 1923-01-14 | 1 | 0.2 | 0 | 1 | 1 |
1 | Chris | 1969-02-17 | 2 | 0.3 | 0.222222222222222 | 2 | 2 |
2 | Sarah | 1971-05-14 | 5 | 0.7 | 0.333333333333333 | 5 | 5 |
3 | Bill | 1971-05-14 | 5 | 0.7 | 0.333333333333333 | 5 | 5 |
6 | Bob | 1971-05-12 | 5 | 0.7 | 0.333333333333333 | 5 | 5 |
10 | Chris | 1971-05-14 | 5 | 0.7 | 0.333333333333333 | 5 | 5 |
8 | Chris | 2000-06-01 | 6 | 1 | 0.777777777777778 | 6 | 6 |
9 | Chris | 1996-06-20 | 6 | 1 | 0.777777777777778 | 6 | 6 |
4 | Hannah | 1996-06-20 | 6 | 1 | 0.777777777777778 | 6 | 6 |
Conclusion
That basically wraps it up. Hopefully I've explained it clearly, as I said, I was confused at first because of articles I read that failed to explain clearly what these functions are used for. If in doubt, ask me, I'm happy to hear about edits I can make to make these articles clearer, or any other way I can help with SQL.