Introduction
Today I want to introduce you to windowing functions. You're probably already familiar with some, so let's start with those. sum, avg and count all provide the ability to get a total based on the contents of a column. For example, in AdventureWorks, the following will show you the sum, avg and count of orders from the sales table:
select sum(unitprice * orderqty) as orderSum, avg(unitprice*orderqty)as orderAvg, count(unitprice * orderqty) as orderCount from sales.salesOrderDetail
If you try to add the productid, you'll get an error, because you have some aggregate columns and some single data columns
select productId, sum(unitprice * orderqty) as orderSum, avg(unitprice*orderqty)as orderAvg, count(unitprice * orderqty) as orderCount from sales.salesOrderDetail
The most common way to fix this, is with 'group by'
select productId, sum(unitprice * orderqty) as orderSum, avg(unitprice*orderqty)as orderAvg, count(unitprice * orderqty) as orderCount from sales.salesOrderDetail group by productid
As you're probably aware, 'group by' tells the aggregate functions what constitutes a group of data that needs to be considered for a single result. We can use 'group by' without specifying a single value ( like the product id ), but of course, the data is then hard to read, and if we add the product id, we do need to group by it, for the SQL to work.
There is another way to do this, as follows:
select distinct productId, sum(unitprice * orderqty) over (partition by productid) as orderSum , avg(unitprice*orderqty) over (partition by productid) as orderAvg, count(unitprice * orderqty) over (partition by productid) as orderCount from sales.salesOrderDetail
I am using distinct instead of 'group by' to illustrate that 'group by' is no longer needed in this case. The 'over' statement allows us to tell SQL how to partition the data. This is more wordy than our first example, and of no clear use. But what if we wanted different columns to use different groupings ?
select distinct salesorderid, productId, avg(unitprice * orderqty) over (partition by salesOrderId) as avgByOrder , avg(unitprice*orderqty) over (partition by productid) as avgByProduct from sales.salesOrderDetail order by salesorderid, productid
This gives us the average order price, and the average product price, for the items in question. I am not sure this is a very USEFUL example, but it DOES show we were able to define two groups, without using a CTE, and get out the data accordingly.
Other windowing functions
There are a number of other windowing functions available, that do things far more powerful than count or aggregate values. They are all called windowing functions because they operate on a 'window' of data, even if a simple statement that calculates a sum works on the WHOLE dataset, that is still a window, and the window can be narrowed down, in the two ways I have shown.
The first ones I will show you, have been about since SS 2005. Towards the end, I'll show you some things new in SS2012.
For some of these examples, we are going to create a table that is designed to log access to a network, it contains a user id, a login date time, and a logout date time, which can be null.
create table trackAccess
(
userId int not null,
login datetime not null,
logout datetime
)
We do jump back and forth a bit this week, so the easiest thing is probably to create this inside AdventureWorks and drop it at the end.
ROW_NUMBER()
The first function is simple. It just creates a column with a row number in it.
First run this:
select userId, row_number() over (partition by userid order by login) as login from trackAccess
You can see, it creates a numeric sequence for each user id, assigning a number to each row.
with logins as
(
select userId, row_number() over (partition by userid order by login) as login from trackAccess
)
select userId, max(login) as logins from logins group by userId
This will grab the largest number, which is then the number of times someone has logged in.
Of course, you don't need row_number to do this, you could do:
select userid, count(login) as logins from trackAccess group by userId
or even using the syntax I showed you above:
select distinct userId, count(login) over (partition by userId) as loginCount from trackAccess
Note again, I am using distinct here to make clear that group by is not required with this syntax, but in the real world, I'd probably group by.
However, if I want to find out the first ever login date and time, then this is one way to do it:
;with logins as
(
select userId, login, row_number() over (partition by userid order by login) as loginOrder from trackAccess
)
select userId, login from logins where loginOrder = 1;
Here's another more useful thing to do with row_number, assigning numbers across the whole table, in a desired order, so that I can grab a page of data at a time:
declare @pageSize int = 20
declare @page int = 5
;with data
as
(
select row_number() over (order by modifieddate) as rowNo, salesOrderId, UnitPrice, OrderQty from Sales.salesorderdetail
)
select * from data where rowNo >= ((@page - 1) * @pageSize) and rowNo < (@page * @pageSize);
This approach can be used to paginate any data where you can specify an order, but don't have a sequence of numbers.
Note, if you have a numeric primary key id, that does not mean there's no islands ( caused by deletes ), so this is probably a good way to paginate in nearly all cases.
Deleting duplicate records
Row_number() can be used to delete duplicate records. First, we'll create a table with lots of duplicates:
create table duplicates
(
id int
)
insert into duplicates values (1),(1),(1),(1),(1),(1),(2),(2),(2),(2),(2),(2),(2),(3),(3),(3),(3),(3),(3),(3)
Now we are going to build a query that identifies them. Note, if you had more than one column, you'd have to partition by all the columns you want to identify as duplicate, that's what partition by does, it creates a sequence based on the group you identify.
with dups
as
(
select id, row_number() over (partition by id order by id) as row from duplicates
)
select * from dups
This shows us the data we'll use to do the delete. All we do now, is change the last statement:
with dups
as
(
select id, row_number() over (partition by id order by id) as row from duplicates
)
delete from dups where row > 1
and now this:
select * from duplicates
will return only three rows, one for each value for id.
Finding islands in a sequence
One place where row number excels, is finding islands in your data sequences. This means, if you have a sequence like 1,2,3,4,5.... to find places where there are gaps in the sequence. Create the islands table using the SQL provided.
Then run this:
select id, row_number() over(order by id) as sort from islands
As you'd expect, our sequence goes out of sync with the id, as it has no islands in it.
Now, here's the trick:
select id, id - row_number() over(order by id) as sort from islands
As you can see, id - row_number means one value climbs as the other falls. Once there's an island, there's a one off gap that causes the number calculated to change, then it stays the same. We can use this value as a group by to create a simple query that shows us our islands:
;with island as
(
select id, id - row_number() over(order by id) as sort from islands
)
select min(id) as start, max(id) as [end] from island group by sort;
The main reason for the CTE is so that sort is named for my group by. This is very elegant, I've had to find islands in data before I knew this trick and the code was a lot longer, and a lot less readable.
RANK() and DENSE_RANK()
The rank function does something similar, but it gives the same value to rows with the same value in the order by column.
So this:
SELECT i.ProductID, p.Name, i.LocationID, i.Quantity
,RANK() OVER
(PARTITION BY i.LocationID ORDER BY i.Quantity DESC) AS Rank
FROM Production.ProductInventory AS i
INNER JOIN Production.Product AS p
ON i.ProductID = p.ProductID
WHERE i.LocationID BETWEEN 3 AND 4
ORDER BY i.LocationID;
creates a sequence like this:
ProductID Name LocationID Quantity Rank
494 Paint - Silver 3 49 1
495 Paint - Blue 3 49 1
493 Paint - Red 3 41 3
496 Paint - Yellow 3 30 4
492 Paint - Black 3 17 5
495 Paint - Blue 4 35 1
496 Paint - Yellow 4 25 2
493 Paint - Red 4 24 3
492 Paint - Black 4 14 4
494 Paint - Silver 4 12 5
Note that it restarts at 1 when locationid ( our partition by value ) changes, and it jumps from 1 to 3, because there's two '1' columns, as they both have the same quantity.
DENSE_RANK does the same thing, but, even if there's two values the same, it does not skip a number.
SELECT i.ProductID, p.Name, i.LocationID, i.Quantity
,DENSE_RANK() OVER
(PARTITION BY i.LocationID ORDER BY i.Quantity DESC) AS Rank
FROM Production.ProductInventory AS i
INNER JOIN Production.Product AS p
ON i.ProductID = p.ProductID
WHERE i.LocationID BETWEEN 3 AND 4
ORDER BY i.LocationID;
NTILE
NTILE is used to partition the data in to n partitions, as close to equal size as possible.
select userId, login, ntile(5) over (order by login) as loginOrder from trackAccess
will order all our logins in to 5 groups, of equal size, without respect for the userIds.
select userId, login, ntile(5) over (partition by userid order by login) as loginOrder from trackAccess
will try to create 5 partitions, but it cannot, so it ends up doing a row_number() split by user id.
select userId, login, ntile(5) over (order by userid) as loginOrder from trackAccess
Orders by user id, but you can see very clearly here, that it is not matching the id, it's simply using that order, to create 5 equal groups.
From here on in, if you don't have SS2012, these examples will not work, but I still recommend reading on, to know about these new functions.
'Order by' for aggregate functions
I already showed you how you can use 'OVER' to partition SUM, AVG, or COUNT. However, you can also do this:
select distinct userId, count(login) over (partition by userId order by login) as loginCount from trackAccess
Doesn't look interesting, because what it does is not clear. Try this one:
select top 100 unitprice from Sales.SalesOrderDetail order by salesOrderId, productId
Now try with the partition:
select top 100 SalesOrderId, sum(UnitPrice) over (partition by SalesOrderId) from Sales.SalesOrderDetail order by SalesOrderId
This is now showing us total prices per order, which is why the same value is repeated, once per line item.
What happens when we add the 'order by' ?
select top 100 SalesOrderId, sum(UnitPrice) over (partition by SalesOrderId order by productId) from Sales.SalesOrderDetail order by SalesOrderId
Order by does something very interesting, as well as (initially) counter intuitive. It gives us a running total. The numbers keep climbing until they get to the totals we saw before, because they are being aggregated with all the values before them.
LEAD and LAG
Lead and lag allow us to tell SQL Server to look ahead or behind by X rows from the row we're in. For example, here is code using lag, to show for each login, when the prior login was:
select userId, login, lag(login, 1, null) over (partition by userid order by login) as lastLogin from trackAccess
The next query uses lead to look for the next login, but then subtracts from that, the current logout, so we get the amount of time between logins:
select userid, cast( (lead(login, 1, null) over (partition by userid order by login) - logout) as time) as timeNotLoggedIn from trackAccess
Note that 'null' in this case means 'infinite', no futher logins have been recorded for this user.
FIRST_VALUE and LAST_VALUE
Instead of leading and lagging by a specific number, you can jump direct to the first or last value in the window. Here's a first try:
select userId, first_value(login) over (partition by userId order by login) as firstLogin, LAST_VALUE(login) over (partition by userId order by login) as lastLogin from trackAccess
This returns the following:
userId firstLogin lastLogin
1 2012-01-01 10:00:00.000 2012-01-01 10:00:00.000
1 2012-01-01 10:00:00.000 2012-01-02 09:32:12.000
1 2012-01-01 10:00:00.000 2012-01-03 10:04:53.000
2 2012-01-01 09:08:32.000 2012-01-01 09:08:32.000
2 2012-01-01 09:08:32.000 2012-01-02 09:12:32.000
2 2012-01-01 09:08:32.000 2012-01-03 10:12:41.000
2 2012-01-01 09:08:32.000 2012-01-03 14:27:21.000
3 2012-01-01 08:40:27.000 2012-01-01 08:40:27.000
3 2012-01-01 08:40:27.000 2012-01-02 11:12:13.000
3 2012-01-01 08:40:27.000 2012-01-04 15:27:43.000
It looks like the first value is always found, but a last value is returned for every possible row in the database.
What's going on ?
Remember how we saw above that a window function on SUM or AVG can now have an order by ? The syntax is the same here, and we get the same result. You may recall that using order by causes SQL Server to return a cumulative total ? The reason for this is, you can specify the range a window operates on. You can specify this in terms of ROWS or RANGE. The difference is that RANGE specifies logical association and ROW specifies physical location in the data set. I admit that I'm not sure exactly what that means, I always use ROWS.
Here are some sample values:
ROWS x PRECENDING - this means, look back by x rows ( not allowed for RANGE )
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW - all previous rows and the current row
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING - All rows
ROWS x FOLLOWING - x rows after this one ( if they exist ) ( not allowed for RANGE )
The default value for this is BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is why using an orderby, gives us a cumulative total, because by default, it will sum or avg the values from the start of the data window, up to the current row. This is also why we got so many values back, because by default, it gave every possible last row, where no rows past the current one, are examined.
Now run this:
select distinct userId, first_value(login) over (partition by userId order by login) as firstLogin, LAST_VALUE(login) over (partition by userId order by login ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as lastLogin from trackAccess
We need the DISTINCT because otherwise, it will return one row for every row in the database, but now that we look for the last row in a range that goes to the end of the window, we get the required results and nothing else.
OFFSET and FETCH NEXT
In SS2012 there are new functions that make paging even easier than using row_number(). They only work if your select has an order by ( because otherwise the order is indeterminate and there can't be a specific, consistent position for a row ). You can specify OFFSET to skip a number of rows, and FETCH NEXT to specify how many rows to select. You cannot do FETCH NEXT without an OFFSET, but you can do an OFFSET without a FETCH statement. Here is an example:
declare @pageSize int = 20
declare @page int = 5
SELECT
TransactionID
, ProductID
, TransactionDate
, Quantity
, ActualCost
FROM
Production.TransactionHistory
ORDER BY TransactionDate DESC
OFFSET (@page - 1) * @pagesize ROWS
FETCH NEXT @pagesize ROWS ONLY;
FETCH can use FIRST or NEXT, but they both mean the same thing. I presume the variations in FETCH syntax for use in cursors is the reason for this. You can also use ROW or ROWS
Obviously, TOP cannot be used with OFFSET and FETCH.
Statistical functions
There are four new statistical functions.
CUME_DIST
Calculates the cumulative distribution of a value in a group of values in SQL Server 2012. That is, CUME_DIST computes the relative position of a specified value in a group of values. For a row r, assuming ascending ordering, the CUME_DIST of r is the number of rows with values lower than or equal to the value of r, divided by the number of rows evaluated in the partition or query result set. CUME_DIST is similar to the PERCENT_RANK function.
PERCENTILE_CONT
Calculates a percentile based on a continuous distribution of the column value in SQL Server 2012. The result is interpolated and might not be equal to any of the specific values in the column.
PERCENTILE_DISC
Computes a specific percentile for sorted values in an entire rowset or within distinct partitions of a rowset in SQL Server 2012. For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P. For example, PERCENTILE_DISC (0.5) will compute the 50th percentile (that is, the median) of an expression.
PERCENTILE_DISC calculates the percentile based on a discrete distribution of the column values; the result is equal to a specific value in the column. PERCENTILE_CONT interpolates the appropriate value, whether or not it exists in the data set, while PERCENTILE_DISC always returns an actual value from the set, and for this reason, the values between the two functions may differ.
PERCENT_RANK
Calculates the relative rank of a row within a group of rows in SQL Server 2012. Use PERCENT_RANK to evaluate the relative standing of a value within a query result set or partition. PERCENT_RANK is similar to the CUME_DIST function.
Percent rank uses percentages ( that is, a range of 0 to 1 ) to apply a rank. The formula for percent_rank is :
percent_Rank = (rank-1)/(totalRow - 1).
I admit that the definitions above are copied from MSDN, I am not deeply familiar with the use of any of these, but I've included AdventureWorks examples of using all of them in the downloadable SQL file, as a starting point if you want to dig in further. My samples also came from the web ( mostly MSDN ), because, as I said, I don't have a deep understanding of the use of these functions.
Conclusion
Hopefully this article has opened your eyes to some of the things that can be done with windowing functions, which go well beyond creating a simple aggregation of values in a column. They are a very powerful feature in SQL and one you'd do well to become comfortable with using, and understanding.