Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / database / SQL-Server

Thinking in SQL - Distribution Functions in SS 2012

3.67/5 (3 votes)
24 Mar 2014CPOL7 min read 17K   90  
Covering how to use the new Distribution functions in SQL Server 2012

Introduction

A while ago I wrote an article on the select statement and in the 'what's new in 2012' section I mentioned the four new analytical functions, and said I had no real clue how they worked. Having gone on to learn, I'm going to set out to explain them today. I read a lot of explanations that were flat out confusing, so hopefully I can contribute something in terms of making this clear, because it's actually not complicated

cume_dist and percent_rank

There are four new functions which break off nicely in to pairs. In order to play with them, we're going to create a sample database with 10 entries, each with an id, a name and a birthdate. Note that there are repeated values in here.

SQL
create table birthdays
(
   id int identity,
   name varchar(50),
   birthday date
)

insert into birthdays values 
('Chris', '19690217'),
('Sarah', '19710514'),
('Bill', '19710514'),
('Hannah', '19960620'),
('Chris', '19230110'),
('Bob', '19710512'),
('Manuel', '19230114'),
('Chris', '20000601'),
('Chris', '19960620'),
('Chris', '19710514')

Now if we do a select and run these two functions, we get this:

SQL
select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank
from birthdays

gives:

id name birthday cumeDist percentRank
1 Chris 1969-02-17 0.1 0
2 Sarah 1971-05-14 0.2 0.111111111111111
3 Bill 1971-05-14 0.3 0.222222222222222
4 Hannah 1996-06-20 0.4 0.333333333333333
5 Chris 1923-01-10 0.5 0.444444444444444
6 Bob 1971-05-12 0.6 0.555555555555556
7 Manuel 1923-01-14 0.7 0.666666666666667
8 Chris 2000-06-01 0.8 0.777777777777778
9 Chris 1996-06-20 0.9 0.888888888888889
10 Chris 1971-05-14 1 1

So, what has this done ? The syntax for cume_dist and percent_rank is the same as for any window function, you need an order by, and you can specify a partition by as well, to create windows inside the data. We have exactly 10 values for a reason. Look at the data for the cumeDist column. The values are all id/10. CumeDist calculates the CUMulativE DISTribution of your data. In plain English, it asks, how many rows have the key value of this row, or less, as a percentage. So, 10% of the rows have an id of 1 or less, 20% have an id of 2 or less, etc. PercentRank is the PERCENTile RANKing, whch is to say, it ask 'how many rows have an id less than mine ?'. If we had 11 rows in our data, then it would be from 0-1, as you can see by running the following:

SQL
;with rankData as
(
    select id, name, birthday from birthdays
    union 
    select 11 as id, 'fred' as name, '20000101' as birthday
)

select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank
from rankData;

gives:

id name birthday cumeDist percentRank
1 Chris 1969-02-17 0.0909090909090909 0
2 Sarah 1971-05-14 0.181818181818182 0.1
3 Bill 1971-05-14 0.272727272727273 0.2
4 Hannah 1996-06-20 0.363636363636364 0.3
5 Chris 1923-01-10 0.454545454545455 0.4
6 Bob 1971-05-12 0.545454545454545 0.5
7 Manuel 1923-01-14 0.636363636363636 0.6
8 Chris 2000-06-01 0.727272727272727 0.7
9 Chris 1996-06-20 0.818181818181818 0.8
10 Chris 1971-05-14 0.909090909090909 0.9
11 fred 2000-01-01 1 1

By using a UNION to add an extra row, we forced cume_dist to move at increments less than 1, and percent_rank to move in steps of 1.

PERCENTILE_DISC and PERCENTILE_CONT

So, what do these other two functions do ? Basically, they are the opposite question. They allow you to ask the question 'which is the first row for which cume_dist and percent_rank would return this value ? As such, the syntax is slightly different:

SQL
select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over() as [percentileDisc .5], 
PERCENTILE_DISC(.55) within group (order by id) over() as [percentileDisc .55], 
PERCENTILE_CONT(.5) within group (order by id) over() as [percentileCont .5], 
PERCENTILE_CONT(.55) within group (order by id) over() as [percentileCont .55] 
from birthdays

Instead of specifying the column that we run over, and an optional window, we pass in a parameter ( the value to match ) and then use the 'within group' syntax to define the window that we look in for the value in question. The order by clause tells the function what value to actually return ( it will order by the id, and return the id that matches the value passed in ) and the over clause is required, even if it's empty, but can be used to define windows inside the data for the functions to run inside.

There is another wrinkle. If you run this SQL, you get the following:

id name birthday cumeDist percentRank percentileDisc .5 percentileDisc .55 percentileCont .5 percentileCont .55
1 Chris 1969-02-17 0.1 0 5 6 5.5 5.95
2 Sarah 1971-05-14 0.2 0.111111111111111 5 6 5.5 5.95
3 Bill 1971-05-14 0.3 0.222222222222222 5 6 5.5 5.95
4 Hannah 1996-06-20 0.4 0.333333333333333 5 6 5.5 5.95
5 Chris 1923-01-10 0.5 0.444444444444444 5 6 5.5 5.95
6 Bob 1971-05-12 0.6 0.555555555555556 5 6 5.5 5.95
7 Manuel 1923-01-14 0.7 0.666666666666667 5 6 5.5 5.95
8 Chris 2000-06-01 0.8 0.777777777777778 5 6 5.5 5.95
9 Chris 1996-06-20 0.9 0.888888888888889 5 6 5.5 5.95
10 Chris 1971-05-14 1 1 5 6 5.5 5.95

Now, given the source data, you might think that percentile_disc gives rounder numbers just because cume_dist does, but if you run this:

SQL
;with rankData as
(
    select id, name, birthday from birthdays
    union 
    select 11 as id, 'fred' as name, '20000101' as birthday
)

select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over() as [percentileDisc .5], 
PERCENTILE_DISC(.55) within group (order by id) over() as [percentileDisc .55], 
PERCENTILE_CONT(.5) within group (order by id) over() as [percentileCont .5], 
PERCENTILE_CONT(.55) within group (order by id) over() as [percentileCont .55] 
from rankData;

You get this:

id name birthday cumeDist percentRank percentileDisc .5 percentileDisc .55 percentileCont .5 percentileCont .55
1 Chris 1969-02-17 0.090909 0 6 7 6 6.5
2 Sarah 1971-05-14 0.181818 0.1 6 7 6 6.5
3 Bill 1971-05-14 0.272727 0.2 6 7 6 6.5
4 Hannah 1996-06-20 0.363636 0.3 6 7 6 6.5
5 Chris 1923-01-10 0.454545 0.4 6 7 6 6.5
6 Bob 1971-05-12 0.545454 0.5 6 7 6 6.5
7 Manuel 1923-01-14 0.636363 0.6 6 7 6 6.5
8 Chris 2000-06-01 0.727272 0.7 6 7 6 6.5
9 Chris 1996-06-20 0.818181 0.8 6 7 6 6.5
10 Chris 1971-05-14 0.909090 0.9 6 7 6 6.5
11 fred 2000-01-01 1 1 6 7 6 6.5

This does clean up the percentile_cont values considerable, by changing the result, but it's still true that percentile_disc always gives actual values from the source data, and percentil_cont calculates values that are *between* the values in the source data.

Windowing

Here is an example that creates windows inside the data. It also shows percentile_disc over names ( I used .1 so it's obvious that the answer is the first alphabetical name ).

SQL
select id, name, birthday,
cume_dist() over(partition by name order by id) as cumeDist,
PERCENT_RANK() over(partition by name order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over(partition by name) as [percentileDisc .5], 
PERCENTILE_CONT(.5) within group (order by id) over(partition by name) as [percentileCont .5], 
PERCENTILE_DISC(.1) within group (order by name) over() as [percentileDisc Name .1]
from birthdays
order by id

and the result:

id name birthday cumeDist percentRank percentileDisc .5 percentileCont .5 percentileDisc Name .1
1 Chris 1969-02-17 0.2 0 8 8 Bill
2 Sarah 1971-05-14 1 0 2 2 Bill
3 Bill 1971-05-14 1 0 3 3 Bill
4 Hannah 1996-06-20 1 0 4 4 Bill
5 Chris 1923-01-10 0.4 0.25 8 8 Bill
6 Bob 1971-05-12 1 0 6 6 Bill
7 Manuel 1923-01-14 1 0 7 7 Bill
8 Chris 2000-06-01 0.6 0.5 8 8 Bill
9 Chris 1996-06-20 0.8 0.75 8 8 Bill
10 Chris 1971-05-14 1 1 8 8 Bill

One final example, this time I am grouping by birth month. Hopefully that's enough to give you some ideas, and the data set is varied enough ( or you could use your own data ) to inspire you to play further as you come to grips with how these functions can be used.

SQL
select id, name, birthday, month(birthday) as [month],
cume_dist() over(order by month(birthday)) as cumeDist,
PERCENT_RANK() over(order by month(birthday)) as percentRank,
PERCENTILE_DISC(.5) within group (order by month(birthday)) over(partition by month(birthday)) as [percentileDisc .5], 
PERCENTILE_CONT(.5) within group (order by month(birthday)) over(partition by month(birthday)) as [percentileCont .5] 
from birthdays

Which returns:

id name birthday month cumeDist percentRank percentileDisc .5 percentileCont .5
5 Chris 1923-01-10 1 0.2 0 1 1
7 Manuel 1923-01-14 1 0.2 0 1 1
1 Chris 1969-02-17 2 0.3 0.222222222222222 2 2
2 Sarah 1971-05-14 5 0.7 0.333333333333333 5 5
3 Bill 1971-05-14 5 0.7 0.333333333333333 5 5
6 Bob 1971-05-12 5 0.7 0.333333333333333 5 5
10 Chris 1971-05-14 5 0.7 0.333333333333333 5 5
8 Chris 2000-06-01 6 1 0.777777777777778 6 6
9 Chris 1996-06-20 6 1 0.777777777777778 6 6
4 Hannah 1996-06-20 6 1 0.777777777777778 6 6

Conclusion

That basically wraps it up. Hopefully I've explained it clearly, as I said, I was confused at first because of articles I read that failed to explain clearly what these functions are used for. If in doubt, ask me, I'm happy to hear about edits I can make to make these articles clearer, or any other way I can help with SQL.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)