Correlated Subqueries in SQL

essentialSQL

1.00/5 (1 vote)

7 Mar 2022CPOL6 min read

5.3K

Comparison of a correlated sub query to a join

In this article, we will look at a couple of examples and compare a correlated sub query to a join.

Correlated subqueries are used to tie an inner query’s criteria to values within the outer query. They are a powerful technique to avoid “hard coding” values. In this article, look at a couple examples and compare a correlated sub query to a join.

All the examples for this lesson are based on Microsoft SQL Server Management Studio and the AdventureWorks2012 database.

Correlated Queries

There are ways to incorporate the outer query’s values into the subquery’s clauses. These types of queries are called correlated subqueries, since the results from the subquery are connected, in some form, to values in the outer query. They are sometimes called synchronized queries.

If you’re having trouble knowing what correlate means, check out this definition from Google:

Correlate: “have a mutual relationship or connection, in which one thing affects or depends on another.” (Google)

A typical use for a correlated subquery is used one of the outer query’s columns in the inner query’s WHERE clause. This is common sense in many cases you want to restrict the inner query to a subset of data.

Example

We’ll provide a correlated subquery example by reporting back each SalesOrderDetail LineTotal, and the Average LineTotal’s for the overall Sales Order.

This request differs significantly from our earlier examples since the average we’re calculating varies for each sales order.

This is where correlated subqueries come into play. We can use a value from the outer query and incorporate it into the filter criteria of the subquery.

Let’s take a look at how we calculate the average line total. To do this, I’ve put together an illustration that shows the SELECT statement with subquery.

To further elaborate on the diagram. The SELECT statement consists of two portions, the outer query, and the subquery. The outer query is used to retrieve all SalesOrderDetail lines. The subquery is used to find and summarize sales order details lines for a specific SalesOrderID.

If I were to verbalize the steps we are going to take, I would summarize them as:

Get the SalesOrderID.
Return the Average LineTotal from All SalesOrderDetail items where the SalesOrderID matches.
Continue on to the next SalesOrderID in the outer query and repeat steps 1 and 2.

The query you can run in the AdventureWork2012 database is:

SQL

SELECT SalesOrderID,
       SalesOrderDetailID,
       LineTotal,
       (SELECT AVG(LineTotal)
          FROM   Sales.SalesOrderDetail
         WHERE  SalesOrderID = SOD.SalesOrderID)
                AS AverageLineTotal
FROM   Sales.SalesOrderDetail SOD

Here are the results of the query:

There are a couple of items to point out.

You can see I used column aliases to help make the query results easier to read.
I also used a table alias, SOD, for the outer query. This makes it possible to use the outer query’s values in the subquery. Otherwise, the query isn’t correlated!
Using the table aliases make it unambiguous which columns are from each table.

Breaking Down Correlated Nested Queries

Let’s now try to break this down using SQL.

To start, let’s assume we’re going to just get our example for SalesOrderDetailID 20. The corresponding SalesOrderID is 43661.

To get the average LineTotal for this item is easy:

SQL

SELECT AVG(LineTotal)
FROM   Sales.SalesOrderDetail
WHERE  SalesOrderID = 43661

This returns the value 2181.765240.

Now that we have the average, we can plug it into our query:

SQL

SELECT SalesOrderID,
       SalesOrderDetailID,
       LineTotal,
       <span style="color: #ff0000;">2181.765240 AS AverageLineTotal
FROM   Sales.SalesOrderDetail
WHERE  SalesOrderDetailID = 20

Using subqueries, this becomes:

SQL

SELECT SalesOrderID,
       SalesOrderDetailID,
       LineTotal,
       (SELECT AVG(LineTotal)
          FROM Sales.SalesOrderDetail
         WHERE SalesOrderID = 43661) AS AverageLineTotal
FROM   Sales.SalesOrderDetail
WHERE  SalesOrderDetailID = 20

The final query is:

SQL

SELECT SalesOrderID,
       SalesOrderDetailID,
       LineTotal,
       (SELECT AVG(LineTotal)
          FROM Sales.SalesOrderDetail
WHERE  SalesOrderID = SOD.SalesOrderID) AS AverageLineTotal
FROM   Sales.SalesOrderDetail AS SOD

Using with a Different Table

A correlated subquery, or for that matter, any subquery, can use a different table than the outer query. This can come in handy when you’re working with a “parent” table, such as SalesOrderHeader, and you want to include in result a summary of child rows, such as those from SalesOrderDetail.

Let’s return the OrderDate, TotalDue, and number of sales order detail lines. To do this, we can use the following diagram to gain our bearings:

To do this, we’ll include a correlated subquery in our SELECT statement to return the COUNT of SalesOrderDetail lines. We’ll ensure we are counting the correct SalesOrderDetail item by filtering on the outer query’s SalesOrderID.

Here is the final SELECT statement:

SQL

SELECT SalesOrderID,
       OrderDate,
       TotalDue,
       (SELECT COUNT(SalesOrderDetailID)
          FROM Sales.SalesOrderDetail
         WHERE SalesOrderID = SO.SalesOrderID) as LineCount
FROM   Sales.SalesOrderHeader SO

The results are:

Some things to notice with this example are:

The subquery is selecting data from a different table than the outer query.
I used table and column aliases to make it easier to read the SQL and results.
Be sure to double-check your where clause! If you forget to include the table name or aliases in the subquery WHERE clause, the query won’t be correlated.

Correlated Subqueries versus Inner Joins

It is important to understand that you can get the same results using either a subquery or join. Though both return the same results, there are advantages and disadvantages to each method!

Consider the last example where we count line items for SalesHeader items.

SQL

SELECT SalesOrderID,
       OrderDate,
       TotalDue,
       (SELECT COUNT(SalesOrderDetailID)
          FROM Sales.SalesOrderDetail
WHERE SalesOrderID = SO.SalesOrderID) as LineCount
FROM  Sales.SalesOrderHeader SO

This same query can be done using an INNER JOIN along with GROUP BY as:

SQL

SELECT   SO.SalesOrderID,
         OrderDate,
         TotalDue,
         COUNT(SOD.SalesOrderDetailID) as LineCount
FROM     Sales.SalesOrderHeader SO
         INNER JOIN Sales.SalesOrderDetail SOD
         ON SOD.SalesOrderID = SO.SalesOrderID
GROUP BY SO.SalesOrderID, OrderDate, TotalDue

Which one is faster?

You’ll find that many folks will say to avoid subqueries as they are slower. They’ll argue that the correlated subquery has to “execute” once for each row returned in the outer query, whereas the INNER JOIN only has to make one pass through the data.

Myself? I say check out the query plan. I followed my own advice for both of the examples above and found the plans to be the same!

That isn’t to say the plans would change if there was more data, but my point is that you shouldn’t just make assumptions. Most SQL DBMS optimizers are really good at figuring out the best way to execute your query. They’ll take your syntaxes, such as a subquery, or INNER JOIN, and use them to create an actual execution plan.

Which One is Easier to Read?

Depending upon what you’re comfortable with, you may find the INNER JOIN example easier to read than the correlated query. Personally, in this example, I like the correlated subquery as it seems more direct. It is easier for me to see what is being counted.

In my mind, the INNER JOIN is less direct. First, you have to see that all the sales details rows are being returned and then summarized. You don’t really get this until you read the entire statement.

Which One Is Better?

Let me know what you think. I would like to hear whether you would prefer to use the correlated subquery or INNER JOIN example.

Correlated Subqueries in HAVING Clause

As with any other subquery, subqueries in the HAVING clause can be correlated with fields from the outer query.

Suppose we further group the job titles by marital status and only want to keep those combinations of job titles and martial statuses whose vacation hours are greater than those for their corresponding overall marital status?

In other words, we want to answer a question similar to “do married accountants have, on average, more remaining vacation, than married employees in general?”

One way to find out is to use the following query:

SQL

SELECT   JobTitle,
         MaritalStatus,
         AVG(VacationHours)
FROM     HumanResources.Employee AS E
GROUP BY JobTitle, MaritalStatus
HAVING   AVG(VacationHours) > 
            (SELECT AVG(VacationHours)
             FROM   HumanResources.Employee
             WHERE  HumanResources.Employee. MaritalStatus =
                    E.MaritalStatus)

There are a couple of things to point out. First, notice that I aliased the Employee as “E” in the outer query. This allows me to reference the outer table within the inner query.

Also, with the correlated query, only fields used in the GROUP BY can be used in the inner query. For instance, for kicks and grins, I tried replacing MaritalStatus with Gender and got an error.

SQL

SELECT   JobTitle,
         MaritalStatus,
         AVG(VacationHours)
FROM     HumanResources.Employee AS E
GROUP BY JobTitle, MaritalStatus
HAVING   AVG(VacationHours) > 
            (SELECT AVG(VacationHours)
             FROM   HumanResources.Employee
             WHERE  HumanResources.Employee. Gender = 
                    E. Gender)

Is a broken query. If you try to run it, you’ll get the following error:

Column ‘HumanResources.Employee.Gender’ is invalid in the HAVING clause 
because it is not contained in either an aggregate function or the GROUP BY clause.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)