SQL Wizardry Part Three - Common Table Expressions (CTEs)

Christian Graus

4.99/5 (19 votes)

12 Jan 2014CPOL9 min read

28.3K

276

In this third installment, we look at how Common Table Expressions can simplify your SQL and help you perform complex tasks

Download SQL_Wizardry_03.zip - 1.3 KB

Introduction

Welcome to my article on Common Table Expressions, or CTEs. A common table expression is a great way of segmenting out otherwise complex SQL to increase readability, but it’s a lot more than that. Let’s get started.

Background

If you didn’t read my earlier articles, you may not realise that I use the AdventureWorks database for some of my examples. I will use other examples today, but it would still help if you had AdventureWorks installed, you can get it here.

Example 1 – calculating two averages.

From my earlier articles, you should know what this does:

SQL

select sum(unitprice) as totalOrders, orderdate from sales.salesorderdetail sod 
inner join  sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
group by orderdate
order by orderdate desc

This will give you order totals for each day, sorted by day. But what if you want to know averages per month ?

SQL

select avg(unitprice) as averageOrders, month(orderdate), year(orderdate) from sales.salesorderdetail sod 
inner join  sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
group by year(orderdate), month(orderdate)
order by year(orderdate), month(orderdate)

Easy, right ? Now, what if you want to show the monthly averages next to a yearly average, for comparison ? That means your group by won’t work, you need to group by year for a yearly average, and month for a monthly average. So, what do you do ? CTEs to the rescue !!

Let’s first illustrate the CTE concept by using a CTE to reproduce the query above:

SQL

with averages
as
(
select avg(unitprice) as averageOrders, month(orderdate) as month, year(orderdate) as year from sales.salesorderdetail sod 
inner join  sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
group by year(orderdate), month(orderdate)
)

select * from averages order by year, month

Really, a CTE has not added anything here. But, there are some important points we can glean from this code: 1 – I had to give aliases to all my columns in my SQL in order for the CTE to work 2 – I could not do an order by within the CTE, I had to do it from my final select

In fact, this also works:

SQL

with averages(averageOrders, month, year)
as
(
select avg(unitprice), month(orderdate), year(orderdate) from sales.salesorderdetail sod 
inner join  sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
group by year(orderdate), month(orderdate)
)

select * from averages order by year, month

And this brings up an important point. The names of the columns INSIDE the CTE are irrelevant, if you specify the names at the top of the CTE. In other words, if I did this:

SQL

with averages(averageOrders, year, month)
as
(
select avg(unitprice), month(orderdate) as month, year(orderdate) as year from sales.salesorderdetail sod 
inner join  sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
group by year(orderdate), month(orderdate)
)

select * from averages order by year, month

You will find that the order is messed up, and the columns are named wrong. Once you specify names at the top of a CTE, those are the names given, the only rule is that you need to give the same number of aliases as there are columns, then they are applied left to right, without any concern for the column names returned by the SQL inside the CTE. If you ever work on large CTEs and add columns, this can trap you. I recommend not giving aliases at the top of the CTE if you can avoid it, because it’s a great place for bugs to creep in unawares.

So, this is useless, right ? What if we also want to show averages for the year ? What we do when we want to find several totals that are incompatible with one another, is we use a CTE for each, then join them at the end. Like so:

SQL

with averagesByMonth  
as
(
select avg(unitprice) as averageOrders, month(orderdate) as month, year(orderdate) as year from sales.salesorderdetail sod 
inner join  sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
group by year(orderdate), month(orderdate)
),
averagesByYear as
(
select avg(unitprice) as averageOrders, year(orderdate) as year from sales.salesorderdetail sod 
inner join  sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
group by year(orderdate)

)

select m.AverageOrders as AverageByMonth, y.AverageOrders as AverageByYear, m.Month, m.year  from averagesByMonth m
inner join averagesByYear y on m.year = y.year
 order by m.year, m.month

The other thing that you can get with CTEs, is to simplify the SQL for calculations. For example, if you wanted to show averages by sales rep, you would only have to return the sales rep id from the CTE, in order to bind to your select, which would also select things like the rep’s name and location ( which requires further joins inside AdventureWorks ). This both stops errors caused by joins excluding values silently, and makes your code more readable. I think of it as a type of object orientation ( just because that’s how my mind works ). It allows me to define an object whose responsibility is to calculate one thing, and I can then reuse that object in my other code.

Another example:

I’ve had a lot of requests for full SQL to create tables and so on inside my articles, so I am going to do that this time. This example came out of a question asked on our forums. What if you have a table that looks like this:

SQL

create TABLE students(studentid INT IDENTITY(1,1), professorid INT, lectureid INT)

This marries a student to a lecture and a professor. The table design is broken, I think. A lecture should seperately store which professor(s) will give it. However, let’s run with this example for now.

So, what this person wanted, was a list of all students, ordered by the average number of students that each professor saw across all lectures. In other words, to order by the average of the count of students divided by the count of lectures. In order to do this, you need to group your query by professors, which then gives you groups, not individual records. Again, the solution is a CTE. All of this code is in the download, but here is the code to insert values to the table:

SQL

INSERT INTO students (professorid, lectureid)
VALUES(101, 401), (101, 402),(101, 402),(101, 402),
(101, 401), (101, 402),
(102, 403), (102, 404),(102, 404),(102, 404),(102, 404),(102, 404),(102, 404),(102, 404),
(103, 405), (104, 406),
(101, 406),(101, 401),(101, 406),(101, 406),(101, 406),(101, 406),(101, 406),
(103, 406),(103, 406),(103, 406),(103, 406),(103, 408),(103, 408),(103, 408),(103, 408),(103, 408),(103, 408),(103, 408),(103, 408),
 (102, 408),(102, 408),(102, 408), (103, 405), (103, 405), (103, 405), (103, 405)

This creates 42 rows. The SQL to get a list of averages is as follows: select professorId, count(studentid)/count( distinct lectureid) as average from students group by professorId

It returns this:

SQL

professorId	average
101	4
102	3
103	5
104	1

So there’s four professors across 42 students, with average student counts across all lectures of 1,3,4 and 5. Now we wrap this in a CTE and use the CTE to sort the list from the main table, like so:

SQL

with avgStudents as
(
 select professorId, count(studentid)/count( distinct lectureid) as average from students group by professorId
)
 
select s.studentid, av.professorid, s.lectureid, av.average from students s
inner join avgStudents av on av.professorId = s.professorId
order by av.average desc

The fundamental concept is, use a CTE to calculate the values you need, and return them with a primary key you can use to join the values back to the main tables you’re going to do your main select from.

RECURSION

Everything we’ve seen so far is convenient and helpful to writing readable queries. I always like to say, writing code the compiler understands is easy. The trick is to write code humans can understand. However, can CTEs do anything that is otherwise impossible ? The answer is yes. CTEs allow you to write recursive queries ( that is, queries that call themselves over and over until they reach a terminating point ).

I’ve seen a lot of examples on the web showing how this works. One common one is to show an org chart in a company. I’ve decided on another common example, that of a forum post tree. If you look at that bottom of this article, or in the forums on this site, you’ll see that you can reply to any post you like, and they are shown as a tree. Although I’m sure the actual SQL is more intelligent, this is often shown in examples with a structure like this:

SQL

create table postHistory
(
	id int,
	parentid int
)

The idea is simple – each post knows who it’s parent is. If the parent is null, it’s a top level post. Of course, a real post would also have a body, a title, and the id of who posted it ( and possibly would link to another table to store who voted on that post, and what they voted ). The question becomes, how do I get a list of posts that form a single post tree ? You could write SQL like this:

SQL

declare @id int = 1

Select * from posthistory p1
Inner join posthistory p2 on p2.parentid = p1.id 
Inner join posthistory p3 on p3.parentid = p2.id 
where p1.id = @id

but, you need to add self joins for every level, and this is also expensive, because each self join creates a new expensive join, the cost is exponential. This is not maintainable. Enter the CTE. The concept is simple. We create a CTE and within it, we create a UNION ALL which refers back to the CTE by name, causing it to call itself. Here is the SQL:

SQL

declare @id int = 1;

with postlist as
(
 select 1 as level, id, parentid from postHistory where id = @id 
  union all
 select pl.level + 1 as level, ph.id, ph.parentid from postHistory ph
 inner join postlist pl on ph.parentid = pl.id 
)

select * from postlist order by level, id

So, from the top, we’re going to search for a specific post, so we define the id of that post. Then we define our CTE, called postlist.

The first level looks like this:

SQL

select 1 as level, id, parentid from postHistory where id = @id

select 1 as level is an anchor. It defines the top level of the heirarchy level field we want to return. This field will tell us what level each post is, and allows us to order by level.

This will return the top post only. This is the top record in our tree, and everything we want to return is based on walking down a tree from this level. The rest of the CTE looks like this:

SQL

 union all
select pl.level + 1 as level, ph.id, ph.parentid from postHistory ph
inner join postlist pl on ph.parentid = pl.id

A recursive CTE requires that we use UNION ALL and not just UNION. The first select increments the level. It’s perhaps the clearest example of how our recursion is working. We then join a new instance of the postHistory table to postlist, which is the name of our CTE. This has the effect of causing the full CTE to be the second part of the join, which then calls itself. Because we just want to walk to the end, there’s no WHERE statement in our second select. If you execute this, you will see a result like this:

SQL

level	id	parentid
1	1	NULL
2	3	1
2	4	1
3	5	3
3	6	4
3	7	3
3	8	3
3	9	4
4	10	5
4	11	5
4	13	5
4	14	6
5	12	10
5	15	14
5	16	14

As you can see, this has called itself all the way down the tree. As always, we could join the CTE result to other tables to pull out other information if we needed to ( for example, total post votes if they existed ). You can change the value of @id to 2 to see how it selects the other tree.

A more canonical example would be to select ALL hierarchies, although in this case, it’s the same as a select *.

SQL

with postlist as
(
 select 1 as level, id, parentid from postHistory where parentid is null
  union all
 select pl.level + 1 as level, ph.id, ph.parentid from postHistory ph
 inner join postlist pl on ph.parentid = pl.id
 where ph.parentid is not null
)

select * from postlist order by parentid

A more sensible reason to do this, is to list each post in order, as follows:

SQL

with postlist as
(
 select id as root, id, parentid from postHistory where parentid is null
  union all
 select pl.root as root, ph.id, ph.parentid from postHistory ph
 inner join postlist pl on ph.parentid = pl.id
 where ph.parentid is not null
)

select * from postlist order by root, id

Here the value that’s passed through is not incremented, meaning for each group of recursion steps, the top level value is maintained and used to order by each post tree in turn. It’s assumed that each post id is sequential, in the real world, a date column would probably exist to ensure this was the case.

Final Notes

If you use a CTE query with other SQL, the last statement before your CTE needs to end in a ;, and so does your CTE SQL, if you have anything after. For this reason, a lot of CTE examples start with a ;, because it either does no harm, or makes the CTE work with preceding SQL.
You can’t use CTEs more than once, you can have as many CTEs as you like, but they need to all be used in one query.
CTEs can be more efficient than temporary tables, but temporary tables can have an index, and CTEs cannot.
A CTE that is used more than once in your SQL, can confuse the optimiser and cause the SQL to slow down. If you use the same CTE name twice in your SQL and it’s slow, try declaring two identical CTEs and using each once.
A CTE can make SQL look neater than is true. If you write a subquery using a CTE, it’s going to slow things down, the same as if you wrote it inline. Often though, you can use a CTE to avoid subqueries, by instead joining to a CTE that contains the result of what could have been a subquery, and your primary key for joining.

Hopefully this has opened up a whole world for you in SQL. Learning about CTEs certainly did that for me, it makes things that see incredibly complex to do in one query, super simple, and it makes your code far more readable, and therefore maintainable, also.

For next weeks entry, I am intending on writing about passing arbitrary lists of values to SQL Server, using CSVs, XML, or user defined table types.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)