Introduction
This is a simple case study to compare the efficiency of using a correlated Sub-Query or a case statement to perform data aggregation.
Background
Here is a common scenario: I have a news web-site that displays stories grouped by the usual categories:
- World news
- Local news
- Sport
- Lifestyle
- Etc.
When a user leaves the site, the number of page views for each news item are stored in the database.
INSERT PageHits( News_ID, HitDate, NoHits )
VALUES(@News_ID, @Date, @Hits)
Now, the powers that control the pay-check have asked for some reporting to be done on this information. Namely, they want to know the number of page hits by news category for each date in a range.
Date | Local News | World News | Local Sport | World Sport | Business | Lifestyle | Crime | Weather | Total Hits |
2010-02-25 | 1 | 2 | 2 | 4 | 0 | 0 | 0 | 0 | 9 |
2010-02-26 | 1 | 36 | 2 | 0 | 0 | 5 | 0 | 0 | 44 |
2010-02-27 | 35 | 10 | 0 | 0 | 0 | 2 | 0 | 0 | 47 |
To demonstrate the steps required, let's build a simple sample database named NewsSite
:
USE Master
GO
CREATE DATABASE NewsSite
Go
USE NewsSite
GO
CREATE TABLE Category(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Name VARCHAR(30) UNIQUE NOT NULL
)
GO
CREATE TABLE NewsItem(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Cat_ID INT NOT NULL FOREIGN KEY REFERENCES Category(ID),
ItemDate DATETIME NOT NULL DEFAULT GETDATE(),
Name VarChar(100) NOT NULL
)
GO
CREATE TABLE NewsHits(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Date DateTime Not Null Default GETDATE(),
News_ID INT NOT NULL FOREIGN KEY REFERENCES NewsItem(ID),
Hits INT NOT NULL
)
GO
Now we add some dummy data (I will skip some of the boring stuff, but you will get the picture...
USE NewsSite
GO
INSERT Category(Name)
VALUES('Local News')
INSERT Category(Name)
VALUES('World News')
INSERT Category(Name)
VALUES('Local Sport')
INSERT Category(Name)
VALUES('World Sport')
INSERT Category(Name)
VALUES('Business')
INSERT Category(Name)
VALUES('Lifestyle')
INSERT Category(Name)
VALUES('Crime')
INSERT Category(Name)
VALUES('Weather')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
VALUES(1, '2/27/2010','Man Bites Dog')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
VALUES(1, '2/27/2010','Rat Bites Cat')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
VALUES(3, '2/27/2010','Rugby League Player Not in Trouble')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
VALUES(4, '2/27/2010','NFL Player Goes to Jail')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
VALUES(5, '2/27/2010','USD Plummets')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
VALUES(7, '2/27/2010','Murder in East LA')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
VALUES(8, '2/27/2010','Chile Tsunami Warning')
INSERT NewsHits(Date, News_ID, Hits)
VALUES('2/25/2010', 1, 2)
INSERT NewsHits(Date, News_ID, Hits)
VALUES('2/25/2010', 3, 1)
INSERT NewsHits(Date, News_ID, Hits)
VALUES('2/25/2010', 7, 4)
INSERT NewsHits(Date, News_ID, Hits)
VALUES('2/27/2010', 2, 2)
INSERT NewsHits(Date, News_ID, Hits)
VALUES('2/26/2010', 4, 7)
Go
The Correlated Sub_Query
Here it is in all it's glory - avert your eyes if they are easily damaged:
USE NewsSite
GO
SELECT nh.Date, CASE ISNULL(ln.LocalNews, 0)
WHEN 0 THEN 0
ELSE ln.LocalNews
END AS LocalNews,
CASE ISNULL(wn.WorldNews, 0)
WHEN 0 THEN 0
ELSE wn.WorldNews
END AS WorldNews,
CASE ISNULL(ls.LocalSport, 0)
WHEN 0 THEN 0
ELSE ls.LocalSport
END AS LocalSport,
CASE ISNULL(ws.WorldSport, 0)
WHEN 0 THEN 0
ELSE ws.WorldSport
END AS WorldSport,
CASE ISNULL(b.Business, 0)
WHEN 0 THEN 0
ELSE b.Business
END AS Business,
CASE ISNULL(l.Lifestyle, 0)
WHEN 0 THEN 0
ELSE l.Lifestyle
END AS LifeStyle,
CASE ISNULL(c.Crime, 0)
WHEN 0 THEN 0
ELSE c.Crime
END AS Crime,
CASE ISNULL(w.Weather, 0)
WHEN 0 THEN 0
ELSE w.Weather
END AS Weather,
SUM(nh.Hits) As TotalHits
FROM NewsHits nh
LEFT JOIN (SELECT nhh.Date, SUM(nhh.Hits) As LocalNews
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'Local News'
GROUP BY nhh.Date) ln
ON nh.Date = ln.Date
LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As WorldNews
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'World News'
GROUP BY nhh.Date) wn
ON nh.Date = wn.Date
LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As LocalSport
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'Local Sport'
GROUP BY nhh.Date) ls
ON nh.Date = ls.Date
LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As WorldSport
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'World Sport'
GROUP BY nhh.Date) ws
ON nh.Date = ws.Date
LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As Business
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'Business'
GROUP BY nhh.Date) b
ON nh.Date = b.Date
LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As Crime
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'Crime'
GROUP BY nhh.Date) c
ON nh.Date = c.Date
LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As Weather
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'Weather'
GROUP BY nhh.Date) w
ON nh.Date = w.Date
LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As LifeStyle
FROM NewsHits nhh
JOIN NewsItem nii
ON nhh.News_ID = nii.ID
JOIN Category cc
ON nii.Cat_ID = cc.ID
WHERE cc.Name = 'Lifestyle'
GROUP BY nhh.Date) l
ON nh.Date = l.Date
WHERE nh.Date BETWEEN '2/24/2010' AND '2/28/2010'
GROUP BY nh.Date, ln.LocalNews, wn.WorldNews, ls.LocalSport,
ws.WorldSport, b.Business, c.Crime, w.Weather,
l.LifeStyle
GO
Ugly, isn't it? And according to my more learned friends, not a particularly brutal version of the species!
It is pretty simple to follow, the outer query returns the results of the sub-queries for each date in the range. Run the sample script and you will see output similar to the table above. I was quite happy to use this (or something similar but worse looking) in a production environment when it was kindly pointed out that, in this case at least, I may be better served by using a simpler CASE
statement construct.
The CASE Statement
USE NewsSite
Go
SELECT nh.[Date],
SUM(CASE WHEN c.Name = 'Local News'
THEN nh.Hits ELSE 0 END) AS 'Local News',
SUM(CASE WHEN c.Name = 'World News'
THEN nh.Hits ELSE 0 END) AS 'World News',
SUM(CASE WHEN c.Name = 'Local Sport'
THEN nh.Hits ELSE 0 END) AS 'Local Sport',
SUM(CASE WHEN c.Name = 'World Sport'
THEN nh.Hits ELSE 0 END) AS 'World Sport',
SUM(CASE WHEN c.Name = 'Business'
THEN nh.Hits ELSE 0 END) AS 'Business',
SUM(CASE WHEN c.Name = 'Lifestyle'
THEN nh.Hits ELSE 0 END) AS 'Lifestyle',
SUM(CASE WHEN c.Name = 'Crime'
THEN nh.Hits ELSE 0 END) AS 'Crime',
SUM(CASE WHEN c.Name = 'Weather'
THEN nh.Hits ELSE 0 END) AS 'Weather',
SUM(nh.Hits) As [Total Hits]
FROM NewsHits nh
JOIN NewsItem ni
ON ni.ID = nh.News_ID
JOIN Category c
ON ni.Cat_ID = c.ID
WHERE nh.[Date] BETWEEN '2/24/2010' AND '2/28/2010'
GROUP BY nh.[Date]
GO
That's a lot better, easier on the eye, and easier to understand at a glance.
The clincher comes when examining the execution times of the queries:
SP Results | CSQ | CASE |
Number of INSERT , DELETE and UPDATE statements | 0 | 0 |
Rows affected by INSERT , DELETE and UPDATE statements | 0 | 0 |
Number of SELECT statements | 1 | 1 |
Rows returned by SELECT statements | 3 | 3 |
Number of Transactions | 0 | 0 |
Network Statistics | | |
Number of server roundtrips | 1 | 1 |
TDS packets sent from client | 3 | 1 |
TDS packets received from server | 1 | 1 |
Bytes sent from client | 8748 | 1990 |
Bytes received from server | 435 | 445 |
Time Statistics | | |
Client processing time (ms) | 1.2 | 4.3 |
Total execution time (ms) | 9.6 | 5.2 |
Wait time on server replies (ms) | 8.4 | 0.9 |
The CASE
statement runs nearly twice as fast as the Correlated Sub-Query example (9.6 ms vs 5.5 ms).
The CSQ spent 1.2 ms processing on the client and then 8.4 sec waiting on the server, whereas the CASE
example spends 4.2 ms processing on the client and then a miniscule 0.9 ms waiting for the server.
This may not be significant in this small example, but in a situation where there may be thousands or more records, the CSQ starts to fade into the distance in this race.
Using a similar structure on a large dataset, the CASE
example was more than 100X faster than the CSQ. Granted, after adding an index or two, I was able to reduce this to about 16:1.
Don't Jump Too Fast...
While I was able to substitute a different method in this case, it may not always be possible because of joined table structures, etc. and you may be 'forced' to use the Correlated Sub-Query instead. If you do go down this path, remember to set the appropriate indexes to ensure your query runs as efficiently as possible.
Points of Interest
This article came about because of a short discussion in the General Database Forum, and shows how the first "good' idea you may have when it comes to a solution may not always be the best one.
Oh, and CodeProject is a great technical resource, frequented by some talented and generous people - thanks to Mycroft Holmes and i.jrussell for pointing me down this path.
History