Correlated Sub-Query vs. Case Statement

Andy_L_J

3.00/5 (6 votes)

7 Mar 2010CPOL4 min read

A simple case study to compare the efficiency of using a Sub-Query vs a Case statement to perform data aggregation.

Download sample - 1.78 KB

Introduction

This is a simple case study to compare the efficiency of using a correlated Sub-Query or a case statement to perform data aggregation.

Background

Here is a common scenario: I have a news web-site that displays stories grouped by the usual categories:

World news
Local news
Sport
Lifestyle
Etc.

When a user leaves the site, the number of page views for each news item are stored in the database.

SQL

INSERT PageHits( News_ID, HitDate, NoHits )
  VALUES(@News_ID, @Date, @Hits)

Now, the powers that control the pay-check have asked for some reporting to be done on this information. Namely, they want to know the number of page hits by news category for each date in a range.

Date	Local News	World News	Local Sport	World Sport	Business	Lifestyle	Crime	Weather	Total Hits
2010-02-25	1	2	2	4	0	0	0	0	9
2010-02-26	1	36	2	0	0	5	0	0	44
2010-02-27	35	10	0	0	0	2	0	0	47

To demonstrate the steps required, let's build a simple sample database named NewsSite:

SQL

USE Master
GO

CREATE DATABASE NewsSite
Go

-- Now add some tables...

USE NewsSite
GO

CREATE TABLE Category(
  ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
  Name VARCHAR(30) UNIQUE NOT NULL
)
GO

CREATE TABLE NewsItem(
  ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
  Cat_ID INT NOT NULL FOREIGN KEY REFERENCES Category(ID),
  ItemDate DATETIME NOT NULL DEFAULT GETDATE(),
  Name VarChar(100) NOT NULL
)
GO

CREATE TABLE NewsHits(
  ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
  Date DateTime Not Null Default GETDATE(),
  News_ID INT NOT NULL FOREIGN KEY REFERENCES NewsItem(ID),
  Hits INT NOT NULL
)
GO

Now we add some dummy data (I will skip some of the boring stuff, but you will get the picture...

SQL

USE NewsSite
GO

-- Category Table
INSERT Category(Name)
  VALUES('Local News')
INSERT Category(Name)
  VALUES('World News')
INSERT Category(Name)
  VALUES('Local Sport')
INSERT Category(Name)
  VALUES('World Sport')
INSERT Category(Name)
  VALUES('Business')
INSERT Category(Name)
  VALUES('Lifestyle')
INSERT Category(Name)
  VALUES('Crime')
INSERT Category(Name)
  VALUES('Weather') 

-- NewsItem Table
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
  VALUES(1, '2/27/2010','Man Bites Dog')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
  VALUES(1, '2/27/2010','Rat Bites Cat')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
  VALUES(3, '2/27/2010','Rugby League Player Not in Trouble')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
  VALUES(4, '2/27/2010','NFL Player Goes to Jail')
-- ...
-- ...
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
  VALUES(5, '2/27/2010','USD Plummets')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
  VALUES(7, '2/27/2010','Murder in East LA')
INSERT NewsItem(Cat_ID, ItemDate, ItemTitle)
  VALUES(8, '2/27/2010','Chile Tsunami Warning') 

-- NewsHits Table
INSERT NewsHits(Date, News_ID, Hits)
  VALUES('2/25/2010', 1, 2)
INSERT NewsHits(Date, News_ID, Hits)
  VALUES('2/25/2010', 3, 1)
INSERT NewsHits(Date, News_ID, Hits)
  VALUES('2/25/2010', 7, 4)
-- ...
-- ...
INSERT NewsHits(Date, News_ID, Hits)
  VALUES('2/27/2010', 2, 2)
INSERT NewsHits(Date, News_ID, Hits)
  VALUES('2/26/2010', 4, 7)
Go

The Correlated Sub_Query

Here it is in all it's glory - avert your eyes if they are easily damaged:

SQL

USE NewsSite
GO

-- Outer Query
SELECT nh.Date,  CASE ISNULL(ln.LocalNews, 0)
                  WHEN 0 THEN 0
                   ELSE ln.LocalNews
                   END AS LocalNews,
                CASE ISNULL(wn.WorldNews, 0)
                   WHEN 0 THEN 0
                   ELSE wn.WorldNews
                   END AS WorldNews,
                CASE ISNULL(ls.LocalSport, 0)
                   WHEN 0 THEN 0
                   ELSE ls.LocalSport
                   END AS LocalSport,
                CASE ISNULL(ws.WorldSport, 0)
                   WHEN 0 THEN 0
                   ELSE ws.WorldSport
                   END AS WorldSport,
                CASE ISNULL(b.Business, 0)
                   WHEN 0 THEN 0
                   ELSE b.Business
                   END AS Business,
                CASE ISNULL(l.Lifestyle, 0)
                   WHEN 0  THEN 0
                   ELSE l.Lifestyle
                   END AS LifeStyle,
                CASE ISNULL(c.Crime, 0)
                   WHEN 0  THEN 0
                   ELSE c.Crime
                   END AS Crime,
                CASE ISNULL(w.Weather, 0)
                   WHEN 0  THEN 0
                   ELSE w.Weather
                   END AS Weather,
                SUM(nh.Hits) As TotalHits
                
 FROM NewsHits nh
  -- Inner Queries...
  LEFT JOIN (SELECT nhh.Date, SUM(nhh.Hits) As LocalNews
              FROM NewsHits nhh
                JOIN NewsItem nii
                  ON nhh.News_ID = nii.ID
                JOIN Category cc
                  ON nii.Cat_ID = cc.ID
              WHERE cc.Name = 'Local News'
              GROUP BY nhh.Date) ln
  ON nh.Date = ln.Date
 
  LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits)  As WorldNews
                FROM NewsHits nhh
                  JOIN NewsItem nii
                    ON nhh.News_ID = nii.ID
                  JOIN Category cc
                    ON nii.Cat_ID = cc.ID
                WHERE cc.Name = 'World News'
                GROUP BY nhh.Date) wn
  ON nh.Date = wn.Date
 
  LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As LocalSport
                FROM NewsHits nhh
                  JOIN NewsItem nii
                    ON nhh.News_ID = nii.ID
                  JOIN Category cc
                    ON nii.Cat_ID = cc.ID
                WHERE cc.Name = 'Local Sport'
                GROUP BY nhh.Date) ls
  ON nh.Date = ls.Date
 
  LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As WorldSport
                FROM NewsHits nhh
                  JOIN NewsItem nii
                    ON nhh.News_ID = nii.ID
                  JOIN Category cc
                    ON nii.Cat_ID = cc.ID
                WHERE cc.Name = 'World Sport'
                GROUP BY nhh.Date) ws
  ON nh.Date = ws.Date
 
  LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As Business
               FROM NewsHits nhh
                 JOIN NewsItem nii
                   ON nhh.News_ID = nii.ID
                 JOIN Category cc
                   ON nii.Cat_ID = cc.ID
               WHERE cc.Name = 'Business'
               GROUP BY nhh.Date) b
  ON nh.Date = b.Date

  LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As Crime
                FROM NewsHits nhh
                  JOIN NewsItem nii
                    ON nhh.News_ID = nii.ID
                  JOIN Category cc
                    ON nii.Cat_ID = cc.ID
                WHERE cc.Name = 'Crime'
                GROUP BY nhh.Date) c
  ON nh.Date = c.Date

  LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As Weather
                FROM NewsHits nhh
                  JOIN NewsItem nii
                    ON nhh.News_ID = nii.ID
                  JOIN Category cc
                    ON nii.Cat_ID = cc.ID
                WHERE cc.Name = 'Weather'
                GROUP BY nhh.Date) w
  ON nh.Date = w.Date

  LEFT JOIN ( SELECT nhh.Date, SUM(nhh.Hits) As LifeStyle
                FROM NewsHits nhh
                  JOIN NewsItem nii
                    ON nhh.News_ID = nii.ID
                  JOIN Category cc
                    ON nii.Cat_ID = cc.ID
                WHERE cc.Name = 'Lifestyle'
                GROUP BY nhh.Date) l
  ON nh.Date = l.Date

WHERE nh.Date  BETWEEN '2/24/2010' AND '2/28/2010'
GROUP BY nh.Date, ln.LocalNews, wn.WorldNews, ls.LocalSport,
         ws.WorldSport, b.Business, c.Crime, w.Weather,
         l.LifeStyle
GO

Ugly, isn't it? And according to my more learned friends, not a particularly brutal version of the species!

It is pretty simple to follow, the outer query returns the results of the sub-queries for each date in the range. Run the sample script and you will see output similar to the table above. I was quite happy to use this (or something similar but worse looking) in a production environment when it was kindly pointed out that, in this case at least, I may be better served by using a simpler CASE statement construct.

The CASE Statement

SQL

USE NewsSite
Go

SELECT nh.[Date],
       SUM(CASE WHEN c.Name = 'Local News'
                THEN nh.Hits ELSE 0 END) AS 'Local News',
       SUM(CASE WHEN c.Name = 'World News'
                THEN nh.Hits ELSE 0 END) AS 'World News',
       SUM(CASE WHEN c.Name = 'Local Sport'
                THEN nh.Hits ELSE 0 END) AS 'Local Sport',
       SUM(CASE WHEN c.Name = 'World Sport'
                THEN nh.Hits ELSE 0 END) AS 'World Sport',
       SUM(CASE WHEN c.Name = 'Business'
                THEN nh.Hits ELSE 0 END) AS 'Business',
       SUM(CASE WHEN c.Name = 'Lifestyle'
                THEN nh.Hits ELSE 0 END) AS 'Lifestyle',
       SUM(CASE WHEN c.Name = 'Crime'
                THEN nh.Hits ELSE 0 END) AS 'Crime',
       SUM(CASE WHEN c.Name = 'Weather'
                THEN nh.Hits ELSE 0 END) AS 'Weather',
       SUM(nh.Hits) As [Total Hits]
FROM NewsHits nh
  JOIN NewsItem ni
    ON ni.ID = nh.News_ID
  JOIN Category c
    ON ni.Cat_ID = c.ID
WHERE nh.[Date] BETWEEN '2/24/2010' AND '2/28/2010'
GROUP BY nh.[Date]
GO

That's a lot better, easier on the eye, and easier to understand at a glance.

The clincher comes when examining the execution times of the queries:

SP Results	CSQ	CASE
Number of `INSERT`, `DELETE` and `UPDATE` statements	0	0
Rows affected by `INSERT`, `DELETE` and `UPDATE` statements	0	0
Number of `SELECT` statements	1	1
Rows returned by `SELECT` statements	3	3
Number of Transactions	0	0
Network Statistics
Number of server roundtrips	1	1
TDS packets sent from client	3	1
TDS packets received from server	1	1
Bytes sent from client	8748	1990
Bytes received from server	435	445
Time Statistics
Client processing time (ms)	1.2	4.3
Total execution time (ms)	9.6	5.2
Wait time on server replies (ms)	8.4	0.9

The CASE statement runs nearly twice as fast as the Correlated Sub-Query example (9.6 ms vs 5.5 ms).

The CSQ spent 1.2 ms processing on the client and then 8.4 sec waiting on the server, whereas the CASE example spends 4.2 ms processing on the client and then a miniscule 0.9 ms waiting for the server.

This may not be significant in this small example, but in a situation where there may be thousands or more records, the CSQ starts to fade into the distance in this race.

Using a similar structure on a large dataset, the CASE example was more than 100X faster than the CSQ. Granted, after adding an index or two, I was able to reduce this to about 16:1.

Don't Jump Too Fast...

While I was able to substitute a different method in this case, it may not always be possible because of joined table structures, etc. and you may be 'forced' to use the Correlated Sub-Query instead. If you do go down this path, remember to set the appropriate indexes to ensure your query runs as efficiently as possible.

Points of Interest

This article came about because of a short discussion in the General Database Forum, and shows how the first "good' idea you may have when it comes to a solution may not always be the best one.

Oh, and CodeProject is a great technical resource, frequented by some talented and generous people - thanks to Mycroft Holmes and i.jrussell for pointing me down this path.

History

Version 1

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)