A discussion of how joins significantly outperform subqueries and how this is more evident when OR conditions are involved.
SQL queries can be written in a number of ways and be functionally equivalent but significantly different speed-wise. A great example of this is the use of cross joins instead of a while loop for populating a numbers
table. The cross join, being set-based, was far more efficient. The insert
statements used later in this post depend on the existence of a table ‘numbers
’, which is used in place of a Student
table. Here’s the code for this – pulled from the cross joins post [this only takes a few seconds instead of over a minute to run like the while loop - cross joins do have a place in SQL development]:
CREATE TABLE numbers (num INT)
DECLARE @numbers TABLE
(
num INT
)
DECLARE @counter INT
SET @counter = 0
While @counter < 10
BEGIN
INSERT INTO @numbers VALUES(@counter)
SET @counter = @counter + 1
END
INSERT INTO numbers (num)
SELECT
(
n100000place.num*100000 +
n10000place.num*10000 +
n1000place.num*1000 +
n100place.num*100 +
n10place.num*10 +
n1place.num
) AS 'number'
FROM @numbers n1place
CROSS JOIN @numbers n10place
CROSS JOIN @numbers n100place
CROSS JOIN @numbers n1000place
CROSS JOIN @numbers n10000place
CROSS JOIN @numbers n100000place
WHERE n100000place.num < 8
ORDER BY NUMBER
Today, I’d like to talk about subqueries and OR conditions in a SQL statement. Before discussing the particular queries, let’s set up a test table with some dummy data. For this, I’m inserting a significant number of rows [4 million] into a table so any performance issues become more evident than they would with a trivial amount of data.
CREATE TABLE StudentTestData (StudentNumber VARCHAR(8), _
testid VARCHAR(8), TestScore VARCHAR(3))
INSERT INTO StudentTestData
SELECT numbers.num, 'AAABBBCC', numbers.num % 55 FROM numbers
INSERT INTO StudentTestData
SELECT numbers.num, 'AAABBBDD', numbers.num % 44 FROM numbers
INSERT INTO StudentTestData
SELECT numbers.num, 'FFFBBBDD', numbers.num % 33 FROM numbers
INSERT INTO StudentTestData
SELECT numbers.num, 'FFFRRRDD', numbers.num % 66 FROM numbers
INSERT INTO StudentTestData
SELECT numbers.num, 'CCCMMMDD', numbers.num % 55 FROM numbers
This table is similar in structure and size to a table we have at Boston Public Schools for storing student test results. Using the StudentTestData
table, let’s say we needed to write a query to get a list of students who met the following conditions:
Got over a 10 on test FFFBBBDD AND over a 20 on test FFFRRRDD
AND
Got over a 16 on test CCCMMMDD AND over a 30 on test AAABBBCC
AND
Got over a 25 on test AAABBBDD
In this case, all 5 test score requirements must be met.
Here are two queries that return the results, the first based on joins and the second based on subqueries. I’ve used left joins because of a slight modification I’m going to make to the requirements later in the post – for the following query, inner joins would work fine.
SELECT DISTINCT Numbers.num FROM Numbers
LEFT JOIN StudentTestData TestAAABBBCC
ON TestAAABBBCC.StudentNumber = Numbers.num AND TestAAABBBCC.testid = 'AAABBBCC'
LEFT JOIN StudentTestData TestFFFBBBDD
ON TestFFFBBBDD.StudentNumber = Numbers.num AND TestFFFBBBDD.testid = 'FFFBBBDD'
LEFT JOIN StudentTestData TestFFFRRRDD
ON TestFFFRRRDD.StudentNumber = Numbers.num AND TestFFFRRRDD.testid = 'FFFRRRDD'
LEFT JOIN StudentTestData TestCCCMMMDD
ON TestCCCMMMDD.StudentNumber = Numbers.num AND TestCCCMMMDD.testid = 'CCCMMMDD'
LEFT JOIN StudentTestData TestAAABBBDD
ON TestAAABBBDD.StudentNumber = Numbers.num AND TestAAABBBDD.testid = 'AAABBBDD'
WHERE
(
(TestFFFBBBDD.TestScore > 10 AND TestFFFRRRDD.TestScore > 20)
AND
(TestCCCMMMDD.TestScore > 16 AND TestAAABBBCC.TestScore > 30)
)
AND
(
TestAAABBBDD.TestScore > 25
)
SELECT DISTINCT Numbers.num FROM Numbers
WHERE
(
(
num IN (SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 10 AND testid = 'FFFBBBDD')
AND num IN(SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 20 AND testid = 'FFFRRRDD')
)
AND
(
num IN (SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 16 AND testid = 'CCCMMMDD')
AND num IN(SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 30 AND testid = 'AAABBBCC')
)
)
AND
(
num IN (SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 25 AND testid = 'AAABBBDD')
)
As I hinted would be the case in my previous post, the results clearly are in favor of the join-based query. While the speed differential wasn’t too significant for the first run of the query [5 seconds compared to 3 seconds], the differential became much more important for subsequent query runs. SQL Server appears to be much better at saving the execution plan for join-based queries than subquery-based queries.
Let’s say we throw a few OR conditions into the requirements. OR conditions will really illustrate why the join
-based version is superior. Here are the new requirements:
Provide a list of student numbers for students who met one of the following two conditions:
Condition 1:
Got over a 10 on test FFFBBBDD OR over a 20 on test FFFRRRDD
AND
Got over a 16 on test CCCMMMDD OR over a 30 on test AAABBBCC
Condition 2:
Got over a 25 on test AAABBBDD
From a SQL query perspective, the queries are almost identical to the ones above, except that we needed to change AND
to OR
in a few cases and add parentheses when needed so precedence rules match the requirements. Also, the left joins are used in case some students didn’t take a test [using inner joins would remove students from the result if they didn't take all 5 tests].
SELECT DISTINCT Numbers.num FROM Numbers
LEFT JOIN StudentTestData TestAAABBBCC
ON TestAAABBBCC.StudentNumber = Numbers.num AND TestAAABBBCC.testid = 'AAABBBCC'
LEFT JOIN StudentTestData TestFFFBBBDD
ON TestFFFBBBDD.StudentNumber = Numbers.num AND TestFFFBBBDD.testid = 'FFFBBBDD'
LEFT JOIN StudentTestData TestFFFRRRDD
ON TestFFFRRRDD.StudentNumber = Numbers.num AND TestFFFRRRDD.testid = 'FFFRRRDD'
LEFT JOIN StudentTestData TestCCCMMMDD
ON TestCCCMMMDD.StudentNumber = Numbers.num AND TestCCCMMMDD.testid = 'CCCMMMDD'
LEFT JOIN StudentTestData TestAAABBBDD
ON TestAAABBBDD.StudentNumber = Numbers.num AND TestAAABBBDD.testid = 'AAABBBDD'
WHERE
(
(TestFFFBBBDD.TestScore > 10 OR TestFFFRRRDD.TestScore > 20)
AND
(TestCCCMMMDD.TestScore > 16 OR TestAAABBBCC.TestScore > 30)
)
OR
(
TestAAABBBDD.TestScore > 25
)
SELECT DISTINCT Numbers.num FROM Numbers
WHERE
(
(
num IN (SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 10 AND testid = 'FFFBBBDD')
OR num IN(SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 20 AND testid = 'FFFRRRDD')
)
AND
(
num IN (SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 16 AND testid = 'CCCMMMDD')
OR num IN(SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 30 AND testid = 'AAABBBCC')
)
)
OR
(
num IN (SELECT studentNumber FROM StudentTestData _
WHERE TestScore > 25 AND testid = 'AAABBBDD')
)
Just by adding the OR
conditions to the requirements, the time differential between the join-based query and the subquery version skyrocketed. While the join-based query was slightly slower, the subquery version didn’t finish. One of these nights, I’ll let the subquery version run overnight to see how long it takes.