Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / SQL

Joins Versus Subqueries SQL Puzzle

5.00/5 (2 votes)
30 Dec 2017MIT5 min read 5.8K  
Joins versus subqueries in SQL

In this puzzle, we’re going to learn about joins versus subqueries. In many queries, you can substitute joins and subqueries. Yet, since each has its own strengths, it isn’t wise to do so. Once you have worked through this puzzle, you see some of the benefits and disadvantages of joins versus subqueries.

Solving puzzles is a great way to learn SQL. Nothing beats practicing what you’ve learned. Once you have figured out the puzzle, post your answer in the comments so we can all learn from one another.

SQL Puzzle Question – Joins Versus Subqueries

The system admin would like a report of active users. Can you write a report that returns the name, phone number and most recent date for any user that has logged in the 30 days since March 14th, 2017?

(You can tell a user has logged in if the action field in UserHistory is set to “Logged On”).

User

  • userID
  • name
  • phoneNumber

UserHistory

  • userID
  • actionDate
  • action

Every time a user logs in, a new row is inserted into the UserHistory table with userID, current date and action (where action = “Logged On”).

Question #1 – Write a SQL query without using Subqueries.

Question #2 – Write the SQL using subqueries.

As you answer the questions, think about the differences of joins versus subqueries. Which technique is better suited to solve the problem? Can you see advantages to either case?

To get you started, download this script which contains the table definitions and some test data.

Answer to Question #1 – INNER JOIN

Like most queries I write, I took this one step by step. Let me show you what I mean by walking you through my thought process.

Regardless of how this query turned out, I knew I would have to figure out how to get the logins 30 days since 2017-03-14. To do this, I wrote a simple query that used the DATEDIFF function to calculate the number of days from the actionDate to the 14th. If the days were less than or equal to 30, I knew the entry was within the window.

SQL
SELECT *,
       DATEDIFF(DAY, actionDate, '2017-03-14')
FROM   @UserHistory
WHERE  action = 'Logged On'
       AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;

You’ll see that I included the result of the DATEDIFF in query result. I did this because I always get the parameter mixed up! By displaying the result, I could verify that I had the formula correct; therefore, being used in my WHERE clause as expected.

Here is what I saw:

Joins versus Subqueries - DATEDIFF result

Now that I knew I could get actionDate within the window, I proceeded to join the UserHistory with User so I could include the PhoneNumber in my results. I joined the tables by userID.

SQL
SELECT U.Name,
       U.PhoneNumber,
       UH.actionDate AS RecentLogonDate
FROM   @User AS U
       INNER JOIN @UserHistory AS UH
       ON U.userID = UH.userID
WHERE  action = 'Logged On'
       AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;

Given there are several UserHistory entries per user, I would expect to see duplicates. I’ve highlighted some of those below:

Joins versus Subqueries - Duplicate Entries Due to Join

The last task which remains is to only show the most recent login date for each user. For Bob, this would be 2017-03-02.

Notice, that for each user, the most recent date is the maxim date for that user. This is a big hint, as it points us towards using aggregate functions such as MAX.

Joins versus Subqueries - GROUBYBY hint

So, at this point, we can use GROUP BY to calculate the maximum date for each user.

SQL
SELECT   U.Name,
         U.PhoneNumber,
         MAX(UH.actionDate) AS RecentLogonDate
FROM     @User AS U
         INNER JOIN @UserHistory AS UH
         ON U.userID = UH.userID
WHERE    action = 'Logged On'
         AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
GROUP BY U.Name, U.phoneNumber;

Whose result is:

Joins versus Subqueries - Final INNER JOIN Result

There are a couple of takeaways from this:

  • First, here is no shame is doing a query to get intermediate results. When using formulas, such as DATEDIFF, make sure you’re getting the calculation you expect.
  • Second, it can take more than one technique to solve a problem. In our case, we had first join the items, and once we had the result, use GROUP BY to get the most recent date.
  • Last, keep it simple! At first, I thought I would have to do another join so I could compare dates, but I realized that was going to get complicated. Also, I figured I could partition the data and use a window function, but that seemed like overkill. Once I realized I simply had to calculate the MAX, I realized GROUP BY would work.

I’m sure there are other ways to solve this puzzle. So, what answer did you get? Please share in the comments.

Answer to Question #2 – Subquery

Let’s look at the second piece of the puzzle when it comes to joins versus subqueries; the subquery.

This query is elegant in that IN operator naturally remove duplicates, but gets ugly as another sub query is needed to pull in the most recent date.

What you’ll see is that subqueries are really great when you need to return rows from one table based on the existence of one or more conditions in another, but not so swift at combining data from one table with another.

To show you what I mean, check out this query which shows the user and phone number for an actionDate within the windows:

SQL
SELECT U.Name,
       U.PhoneNumber
FROM   @User AS U
WHERE  U.UserID IN (SELECT UserID
                    FROM   @UserHistory
                    WHERE  action = 'Logged On'
                    AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);

In this case, we naturally get a unique list of name and phone numbers as we are listing users.

The uniqueness is guaranteed by the fact the table naturally contains unique value (yes, I know I didn’t define a Primary Key on the table, but hopefully you see it could be one).

The subquery in the WHERE clause compiles a list of userIDs whose actionDate falls within the 30-day window.

Joins versus Subqueries - Subquery but no ActionDate

Of course, our query is only partially written, as it doesn’t display the latest actionDate.

If I could dream a bit (the dreamy bits are in blue), I would love to be able to write a query like:

SQL
SELECT U.Name,
       U.PhoneNumber,
       (SELECT Max(UH.actionDate)
        FROM   @UserHistory AS UH
        WHERE  action = 'Logged On'
        AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
AND    U.userID = UH.userID) AS RecentLogonDate
FROM   @User AS U
WHERE  DATEDIFF(DAY, RecentLogonDate, '2017-03-14') <= 30
       AND RecentLogonDate IS NOT NULL

But this isn’t a properly written query, and SQL gets mad!

SQL
Msg 207, Level 16, State 1, Line 62
Invalid column name 'RecentLogonDate'.
Msg 207, Level 16, State 1, Line 63
Invalid column name 'RecentLogonDate'.

What makes our subquery ugly is being forced to use two subqueries to retrieve the most recent actionDate. To do this, you must essentially repeat the query, but now as a correlated sub query. The correlation is show in red. It is just the match of the User table UserID to that in UserHistory.

SQL
SELECT U.Name,
       U.PhoneNumber,
       (SELECT Max(UH.actionDate)
        FROM   @UserHistory AS UH
        WHERE  action = 'Logged On'
               AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
               AND U.userID = UH.userID) AS RecentLogonDate
FROM   @User AS U
WHERE  U.UserID IN (SELECT UserID
                    FROM   @UserHistory
                    WHERE  action = 'Logged On'
                    AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);

Conclusion

Now that you’ve seen the differences of joins versus subqueries, I’m hoping you have a better understanding on when to use one versus the other. To summarize, joins excel at combining data from two tables, subqueries are best when testing for the existence of a value from one table found in another.

What are your experiences with joins versus subqueries? Which do you prefer to use when? Please share in the comments!

License

This article, along with any associated source code and files, is licensed under The MIT License