Solving puzzles is a great way to learn SQL. In this puzzle, we’re going to learn about joins vs subqueries.
In many queries, you can substitute joins and subqueries. Yet, since each has its strengths, it isn’t wise to do so. Once you have worked through this puzzle, you see some of the benefits and disadvantages of joins vs subqueries.
Solving puzzles is a great way to learn SQL. Nothing beats practicing what you’ve learned. Once you have figured out the puzzle, post your answer in the comments so we all can learn from one another.
SQL Puzzle Question – Joins vs Subqueries
The system admin would like a report of active users. Can you write a report that returns the name, phone number and most recent date for any user that has logged in the 30 days since March14th, 2017?
(You can tell a user has logged in if the action
field in UserHistory
is set to “Logged On
”).
User
UserHistory
Every time a user logs in, a new row is inserted into the UserHistory
table with userID
, current date and action (where action = “Logged On”
).
Question #1 – Write a SQL query without using Subqueries.
Question #2 – Write the SQL using subqueries.
As you answer the questions, think about the differences of joins vs subqueries. Which technique is better suited to solve the problem? Can you see advantages to either case?
To get you started, download this script which contains the table definitions and some test data.
Answer to Question #1 – INNER JOIN
Like most queries I write, I took this one step by step. Let me show you what I mean by walking you through my thought process.
Regardless of how this query turned out, I knew I would have to figure out how to get the logins 30 days since 2017-03-14. To do this, I wrote a simple query that used the DATEDIFF
function to calculate the number of days from the actionDate
to the 14th. If the days were less than or equal to 30
, I knew the entry was within the window.
SELECT *,
DATEDIFF(DAY, actionDate, '2017-03-14')
FROM @UserHistory
WHERE action = 'Logged On'
AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;
DATEDIFF Coding Hack
You’ll see that I included the result of the DATEDIFF
in query result. I did this because I always get the parameter mixed up! By displaying the result, I could verify that I had the formula correct; therefore, being used in my WHERE
clause as expected.
Here is what I saw:
Now that I knew I could get actionDate
within the window, I proceeded to join the UserHistory
with User
so I could include the PhoneNumber
in my results. I joined the tables by userID
.
SELECT U.Name,
U.PhoneNumber,
UH.actionDate AS RecentLogonDate
FROM @User AS U
INNER JOIN @UserHistory AS UH
ON U.userID = UH.userID
WHERE action = 'Logged On'
AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30;
Expected Duplicates
Given there are several UserHistory
entries per user, I would expect to see duplicates. I’ve highlighted some of those below:
The last task which remains is to only show the most recent login date for each user. For Bob
, this would be 2017-03-02
.
Notice, that for each user, the most recent date is the maxim date for that user. This is a big hint, as it points us towards using aggregate functions such as MAX
.
So, at this point, we can use GROUP BY
to calculate the maximum date for each user.
SELECT U.Name,
U.PhoneNumber,
MAX(UH.actionDate) AS RecentLogonDate
FROM @User AS U
INNER JOIN @UserHistory AS UH
ON U.userID = UH.userID
WHERE action = 'Logged On'
AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
GROUP BY U.Name, U.phoneNumber;
Whose result is:
Key Takeaways
There are a couple of takeaways from this:
- First, here is no shame is doing a query to get intermediate results. When using formulas, such as
DATEDIFF
, make sure you’re getting the calculation you expect. - Second, it can take more than one technique to solve a problem. In our case, we first join the items, and once we have the result, use
GROUP BY
to get the most recent date. - Last, keep it simple! At first, I thought I would have to do another join so I could compare dates, but I realized that was going to get complicated. Also, I figured I could partition the data and use a window function, but that seemed like overkill. Once I realized I simply had to calculate the
MAX
, I realized GROUP BY
would work.
I’m sure there are other ways to solve this puzzle. So, what answer did you get? Please share in the comments.
Answer to Question #2 – Subquery
Let’s look at the second piece of the puzzle when it comes to joins vs subqueries; the subquery.
This query is elegant in that IN
operator naturally remove duplicates, but gets ugly as another sub query is needed to pull in the most recent date.
What you’ll see is that subqueries are really great when you need to return rows from one table based on the existence of one or more conditions in another, but not so swift at combining data from one table with another.
To show you what I mean, check out this query which shows the user and phone number for an actionDate
within the windows:
SELECT U.Name,
U.PhoneNumber
FROM @User AS U
WHERE U.UserID IN (SELECT UserID
FROM @UserHistory
WHERE action = 'Logged On'
AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);
Natural Uniqueness via Subquery
In this case, we naturally get a unique list of name and phone numbers as we are listing users.
The uniqueness is guaranteed by the fact the table naturally contains unique value (yes, I know I didn’t define a Primary Key on the table, but hopefully you see it could be one).
The subquery in the WHERE
clause compiles a list of userID
s whose actionDate
falls within the 30-day window.
Of course, our query is only partially written, as it doesn’t display the latest actionDate
.
If I could dream a bit (the dreamy bits are in bold), I would love to be able to write a query like:
SELECT U.Name,
U.PhoneNumber,
(SELECT Max(UH.actionDate)
FROM @UserHistory AS UH
WHERE action = 'Logged On'
AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
AND U.userID = UH.userID) AS RecentLogonDate
FROM @User AS U
WHERE DATEDIFF(DAY, RecentLogonDate, '2017-03-14') <= 30
AND RecentLogonDate IS NOT NULL
But this isn’t a properly written query, and SQL gets mad!
Msg 207, Level 16, State 1, Line 62
Invalid column name 'RecentLogonDate'.
Msg 207, Level 16, State 1, Line 63
Invalid column name 'RecentLogonDate'.
Forced To Use Two Subqueries!
What makes our subquery ugly is being forced to use two subqueries to retrieve the most recent actionDate
. To do this, you must essentially repeat the query, but now as a correlated sub query. It just a match of the User
table UserID
to that in UserHistory
.
SELECT U.Name,
U.PhoneNumber,
(SELECT Max(UH.actionDate)
FROM @UserHistory AS UH
WHERE action = 'Logged On'
AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30
AND U.userID = UH.userID) AS RecentLogonDate
FROM @User AS U
WHERE U.UserID IN (SELECT UserID
FROM @UserHistory
WHERE action = 'Logged On'
AND DATEDIFF(DAY, actionDate, '2017-03-14') <= 30);
Conclusion
Now that you’ve seen the differences of joins vs subqueries, I’m hoping you have a better understanding on when to use one versus the other. To summarize, joins excel at combining data from two tables, subqueries are best when testing for the existence of a value from one table found in another.
What are your experiences with joins versus subqueries? Which do you prefer to use when? Please share in the comments!