Thinking in SQL - Distribution Functions in SS 2012

Christian Graus

3.67/5 (3 votes)

24 Mar 2014CPOL7 min read

17K

Covering how to use the new Distribution functions in SQL Server 2012

Download Thinking_in_SQL_-_Distribution_Functions.zip - 692 B

Introduction

A while ago I wrote an article on the select statement and in the 'what's new in 2012' section I mentioned the four new analytical functions, and said I had no real clue how they worked. Having gone on to learn, I'm going to set out to explain them today. I read a lot of explanations that were flat out confusing, so hopefully I can contribute something in terms of making this clear, because it's actually not complicated

cume_dist and percent_rank

There are four new functions which break off nicely in to pairs. In order to play with them, we're going to create a sample database with 10 entries, each with an id, a name and a birthdate. Note that there are repeated values in here.

SQL

create table birthdays
(
   id int identity,
   name varchar(50),
   birthday date
)

insert into birthdays values 
('Chris', '19690217'),
('Sarah', '19710514'),
('Bill', '19710514'),
('Hannah', '19960620'),
('Chris', '19230110'),
('Bob', '19710512'),
('Manuel', '19230114'),
('Chris', '20000601'),
('Chris', '19960620'),
('Chris', '19710514')

Now if we do a select and run these two functions, we get this:

SQL

select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank
from birthdays

gives:

id	name	birthday	cumeDist	percentRank
1	Chris	1969-02-17	0.1	0
2	Sarah	1971-05-14	0.2	0.111111111111111
3	Bill	1971-05-14	0.3	0.222222222222222
4	Hannah	1996-06-20	0.4	0.333333333333333
5	Chris	1923-01-10	0.5	0.444444444444444
6	Bob	1971-05-12	0.6	0.555555555555556
7	Manuel	1923-01-14	0.7	0.666666666666667
8	Chris	2000-06-01	0.8	0.777777777777778
9	Chris	1996-06-20	0.9	0.888888888888889
10	Chris	1971-05-14	1	1

So, what has this done ? The syntax for cume_dist and percent_rank is the same as for any window function, you need an order by, and you can specify a partition by as well, to create windows inside the data. We have exactly 10 values for a reason. Look at the data for the cumeDist column. The values are all id/10. CumeDist calculates the CUMulativE DISTribution of your data. In plain English, it asks, how many rows have the key value of this row, or less, as a percentage. So, 10% of the rows have an id of 1 or less, 20% have an id of 2 or less, etc. PercentRank is the PERCENTile RANKing, whch is to say, it ask 'how many rows have an id less than mine ?'. If we had 11 rows in our data, then it would be from 0-1, as you can see by running the following:

SQL

;with rankData as
(
    select id, name, birthday from birthdays
    union 
    select 11 as id, 'fred' as name, '20000101' as birthday
)

select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank
from rankData;

gives:

id	name	birthday	cumeDist	percentRank
1	Chris	1969-02-17	0.0909090909090909	0
2	Sarah	1971-05-14	0.181818181818182	0.1
3	Bill	1971-05-14	0.272727272727273	0.2
4	Hannah	1996-06-20	0.363636363636364	0.3
5	Chris	1923-01-10	0.454545454545455	0.4
6	Bob	1971-05-12	0.545454545454545	0.5
7	Manuel	1923-01-14	0.636363636363636	0.6
8	Chris	2000-06-01	0.727272727272727	0.7
9	Chris	1996-06-20	0.818181818181818	0.8
10	Chris	1971-05-14	0.909090909090909	0.9
11	fred	2000-01-01	1	1

By using a UNION to add an extra row, we forced cume_dist to move at increments less than 1, and percent_rank to move in steps of 1.

PERCENTILE_DISC and PERCENTILE_CONT

So, what do these other two functions do ? Basically, they are the opposite question. They allow you to ask the question 'which is the first row for which cume_dist and percent_rank would return this value ? As such, the syntax is slightly different:

SQL

select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over() as [percentileDisc .5], 
PERCENTILE_DISC(.55) within group (order by id) over() as [percentileDisc .55], 
PERCENTILE_CONT(.5) within group (order by id) over() as [percentileCont .5], 
PERCENTILE_CONT(.55) within group (order by id) over() as [percentileCont .55] 
from birthdays

Instead of specifying the column that we run over, and an optional window, we pass in a parameter ( the value to match ) and then use the 'within group' syntax to define the window that we look in for the value in question. The order by clause tells the function what value to actually return ( it will order by the id, and return the id that matches the value passed in ) and the over clause is required, even if it's empty, but can be used to define windows inside the data for the functions to run inside.

There is another wrinkle. If you run this SQL, you get the following:

id	name	birthday	cumeDist	percentRank	percentileDisc .5	percentileDisc .55	percentileCont .5	percentileCont .55
1	Chris	1969-02-17	0.1	0	5	6	5.5	5.95
2	Sarah	1971-05-14	0.2	0.111111111111111	5	6	5.5	5.95
3	Bill	1971-05-14	0.3	0.222222222222222	5	6	5.5	5.95
4	Hannah	1996-06-20	0.4	0.333333333333333	5	6	5.5	5.95
5	Chris	1923-01-10	0.5	0.444444444444444	5	6	5.5	5.95
6	Bob	1971-05-12	0.6	0.555555555555556	5	6	5.5	5.95
7	Manuel	1923-01-14	0.7	0.666666666666667	5	6	5.5	5.95
8	Chris	2000-06-01	0.8	0.777777777777778	5	6	5.5	5.95
9	Chris	1996-06-20	0.9	0.888888888888889	5	6	5.5	5.95
10	Chris	1971-05-14	1	1	5	6	5.5	5.95

Now, given the source data, you might think that percentile_disc gives rounder numbers just because cume_dist does, but if you run this:

SQL

;with rankData as
(
    select id, name, birthday from birthdays
    union 
    select 11 as id, 'fred' as name, '20000101' as birthday
)

select id, name, birthday,
cume_dist() over(order by id) as cumeDist,
PERCENT_RANK() over(order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over() as [percentileDisc .5], 
PERCENTILE_DISC(.55) within group (order by id) over() as [percentileDisc .55], 
PERCENTILE_CONT(.5) within group (order by id) over() as [percentileCont .5], 
PERCENTILE_CONT(.55) within group (order by id) over() as [percentileCont .55] 
from rankData;

You get this:

id	name	birthday	cumeDist	percentRank	percentileDisc .5	percentileDisc .55	percentileCont .5	percentileCont .55
1	Chris	1969-02-17	0.090909	0	6	7	6	6.5
2	Sarah	1971-05-14	0.181818	0.1	6	7	6	6.5
3	Bill	1971-05-14	0.272727	0.2	6	7	6	6.5
4	Hannah	1996-06-20	0.363636	0.3	6	7	6	6.5
5	Chris	1923-01-10	0.454545	0.4	6	7	6	6.5
6	Bob	1971-05-12	0.545454	0.5	6	7	6	6.5
7	Manuel	1923-01-14	0.636363	0.6	6	7	6	6.5
8	Chris	2000-06-01	0.727272	0.7	6	7	6	6.5
9	Chris	1996-06-20	0.818181	0.8	6	7	6	6.5
10	Chris	1971-05-14	0.909090	0.9	6	7	6	6.5
11	fred	2000-01-01	1	1	6	7	6	6.5

This does clean up the percentile_cont values considerable, by changing the result, but it's still true that percentile_disc always gives actual values from the source data, and percentil_cont calculates values that are *between* the values in the source data.

Windowing

Here is an example that creates windows inside the data. It also shows percentile_disc over names ( I used .1 so it's obvious that the answer is the first alphabetical name ).

SQL

select id, name, birthday,
cume_dist() over(partition by name order by id) as cumeDist,
PERCENT_RANK() over(partition by name order by id) as percentRank,
PERCENTILE_DISC(.5) within group (order by id) over(partition by name) as [percentileDisc .5], 
PERCENTILE_CONT(.5) within group (order by id) over(partition by name) as [percentileCont .5], 
PERCENTILE_DISC(.1) within group (order by name) over() as [percentileDisc Name .1]
from birthdays
order by id

and the result:

id	name	birthday	cumeDist	percentRank	percentileDisc .5	percentileCont .5	percentileDisc Name .1
1	Chris	1969-02-17	0.2	0	8	8	Bill
2	Sarah	1971-05-14	1	0	2	2	Bill
3	Bill	1971-05-14	1	0	3	3	Bill
4	Hannah	1996-06-20	1	0	4	4	Bill
5	Chris	1923-01-10	0.4	0.25	8	8	Bill
6	Bob	1971-05-12	1	0	6	6	Bill
7	Manuel	1923-01-14	1	0	7	7	Bill
8	Chris	2000-06-01	0.6	0.5	8	8	Bill
9	Chris	1996-06-20	0.8	0.75	8	8	Bill
10	Chris	1971-05-14	1	1	8	8	Bill

One final example, this time I am grouping by birth month. Hopefully that's enough to give you some ideas, and the data set is varied enough ( or you could use your own data ) to inspire you to play further as you come to grips with how these functions can be used.

SQL

select id, name, birthday, month(birthday) as [month],
cume_dist() over(order by month(birthday)) as cumeDist,
PERCENT_RANK() over(order by month(birthday)) as percentRank,
PERCENTILE_DISC(.5) within group (order by month(birthday)) over(partition by month(birthday)) as [percentileDisc .5], 
PERCENTILE_CONT(.5) within group (order by month(birthday)) over(partition by month(birthday)) as [percentileCont .5] 
from birthdays

Which returns:

id	name	birthday	month	cumeDist	percentRank	percentileDisc .5	percentileCont .5
5	Chris	1923-01-10	1	0.2	0	1	1
7	Manuel	1923-01-14	1	0.2	0	1	1
1	Chris	1969-02-17	2	0.3	0.222222222222222	2	2
2	Sarah	1971-05-14	5	0.7	0.333333333333333	5	5
3	Bill	1971-05-14	5	0.7	0.333333333333333	5	5
6	Bob	1971-05-12	5	0.7	0.333333333333333	5	5
10	Chris	1971-05-14	5	0.7	0.333333333333333	5	5
8	Chris	2000-06-01	6	1	0.777777777777778	6	6
9	Chris	1996-06-20	6	1	0.777777777777778	6	6
4	Hannah	1996-06-20	6	1	0.777777777777778	6	6

Conclusion

That basically wraps it up. Hopefully I've explained it clearly, as I said, I was confused at first because of articles I read that failed to explain clearly what these functions are used for. If in doubt, ask me, I'm happy to hear about edits I can make to make these articles clearer, or any other way I can help with SQL.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)