Introduction
In my last article on Basic Data Retrieval, I shared that sometimes software developers are called upon to retrieve data from SQL databases. These articles are designed to be an aid to those developers. This particular article focuses on crunching the data. Having a million points of data does a company no good if they don't know what it means. This guide will help you understand how to process these large amounts of data while ensuring that you aren't getting bad data. The worst thing you can do when querying large amounts of data is to be off by a little bit. You probably won't notice it until it is too late. This article will help you know how to be sure you will get the right information before you even run the query.
Notes
I have developed this article while using the AdventureWorks
database provided by Microsoft. All of the following queries will work against that database. I have provided a tool that will run each query for you. Both the executable and the source code for the T-SQL Test Platform are provided to you. You can run my example queries directly or you can use this tool to test your own queries. I have tested everything against a Microsoft SQL Server 2008 R2, but I believe all of the included commands should work against even Microsoft SQL Server 2005 unless stated otherwise.
As part of my usual disclaimer, I want to state that it is outside the scope of this article to discuss the merits and drawbacks of calling SQL queries from code versus executing stored procedures on the server. It is my opinion that if you have the ability to do so, writing stored procedures is the way to go but that is not always an option. Either way, this article is not the place for that discussion.
Now that we have all of the administrative stuff out of the way, on to the good stuff.
Processing Data with SQL Functions
There are many functions that can be performed on a set of data. The ones that are well known include SUM
, COUNT
, AVG
, etc. While these functions can be simple to understand, there are a few things that need to be understood about how these functions work so that you do not make a costly assumption that turns out to be faulty.
The COUNT
operator is used quite often to figure out how many rows (or items) are in a particular set of data. For example, if we wanted to know how many records were in the Person.Contact
table, we could run the following query:
SELECT COUNT(*) AS RecordCount
FROM Person.Contact
RecordCount
19972
(1 row(s) affected)
Perfect. With one simple query, I know how many rows are in my table. Before we dive deeper into what COUNT
can do, I want to step back to look at this particular query. This method of getting the number of records in a table is a very common one. However, there is an alternative that is much faster. If you need to find the number of records in a particular table, use the following query:
SELECT rows AS RecordCount
FROM sysindexes
WHERE id = OBJECT_ID('Person.Contact') AND indid < 2
RecordCount
19972
(1 row(s) affected)
The speed difference between these two queries is incredible. On my machine, the COUNT(*)
method takes 25 times as long as this method. That is a big deal difference, especially if you are executing the statement multiple times a day. This sysindexes
table stores the information for all of the indexes in each table. The first Index ID (indid
) indicates the Primary Key, which always indexes every single record in the table. Thus, we can pull the number of records it has indexed and find out how many records are in the table.
Getting back to the COUNT
statement, we can also use COUNT
on a specific column. This would seem like a better way to count records since it would only capture one column instead of the dreaded star select. We all know we shouldn't do SELECT *
so it would make sense that we don't use COUNT(*)
either, right? Actually no. Using COUNT
against one column instead of star actually has a different purpose. Using COUNT with a column name counts the non-NULL entries in that column. If you use COUNT
against a column that does not allow NULL
s, you would not notice a difference in the resulting number. However, observe the following example:
SELECT COUNT(MiddleName) AS RecordCount
FROM Person.Contact
RecordCount
11473
Warning: Null value is eliminated by an aggregate or other SET operation.
(1 row(s) affected)
Notice that since I am running this query in the SQL Query window, it actually warned me of this little gem. If you were running this command in your code, you won't get the nice warning message to prompt you to consider what you are doing. The end result of this query is that we are "missing" about 8,500 records. That is a big deal. What can be even worse, however, is if you were missing only a few. You would probably notice if 40% of your records were not present but would you notice if 0.1% of them were missing? This is a prime example of why you need to know what every command does before using it.
I'm sure you are now wondering about the other SQL function statements such as SUM
and AVG
. The fact is that they work the same way. When you look at just the COUNT
function, your first instinct is probably that this is a bug. However, when you look at this in light of SUM
(and the other functions), it makes a lot of sense. NULL
is not a number. In fact, NULL
is the absence of anything. It is a vacuum. It does not mean the same thing as zero or empty string. It means "the absence of a value". Therefore, it should not be included in an average. Instead, those rows are skipped. If that isn't to your liking, the next section will tell you how to convert NULL
values into default values.
Translating NULL Values with COALESCE
There will be times when you want to deal with NULL
values as specific values. For example, if a student is required to turn in three assignments for a grade, you might want to treat any missing assignments as zeros after the assignment date comes due. You may not want to put a zero in their record, since this would indicate a completed assignment that received no credit (which is different than a missing assignment). When querying the data, you can use the COALESCE
function to convert these entries to a default value. For example, in our above statement on using the COUNT(MiddleName)
method for counting our rows, we found that there were a bunch of middle names that had NULL
values. For the sake of demonstration (since this is just to show how to use the COALESCE
keyword), we could change the query like so:
SELECT COUNT(COALESCE(MiddleName,'')) AS RecordCount
FROM Person.Contact
RecordCount
19972
(1 row(s) affected)
Notice that we now have the total record count again. That is because we used the COALESCE
keyword and converted all of the NULL
entries in the MiddleName
column to be empty strings instead. If we were to use this to convert grades, we would have put a zero instead of an empty string. COALESCE
can be a very useful tool as well when the system you are sending the results to does not play well with NULL
values. You could use COALESCE
on every column that allowed NULL
values so that the resulting recordset was NULL
-free.
The COALESCE
operator can be used for even more interesting applications than just removing NULL
values. As we will see below, we can use it to put new values into a column based upon what the NULL
value means. We could even use COALESCE
to combine two (or more) columns into one if we knew that only one column would have a value (or we only wanted the first value we found).
Collapsing Data with GROUP BY
One of the most common ways to get meaningful information out of a dataset is to collapse it. For example, we don't look at the results of every free throw a basketball player attempts. Instead, we look at the percentage that the player makes. When looking at multiple players inside of one table, you would use a GROUP BY
statement to group each player's records together and then count how many shots were made and divide that by the total shots attempted. This would give us our percentage for each player.
A GROUP BY
statement groups the records by each column specified in the GROUP BY
statement. It can then perform aggregate functions against those grouped records. For example, here I have grouped the Contact
records by Email Promotion and then counted how many records are in each group:
SELECT EmailPromotion, COUNT(*) AS RecordCount
FROM Person.Contact
GROUP BY EmailPromotion
EmailPromotion RecordCount
0 11158
1 5044
2 3770
(3 row(s) affected)
This query would be useful for finding out how many people were in each email promotion. While this is a simple use of GROUP BY
, it does highlight the basic structure we need to use to form this command properly. When using GROUP BY
, you need to include every column that is not an aggregate column in the GROUP BY
statement. Each column that you put in this statement will be part of what is grouped on. Therefore, if you put just last name in this section, it will group on each last name. There might be 15 Smiths in your database. If you put first name and last name in the GROUP BY
section, it would group by the combination of the two so there might be 12 entries with a last name of Smith. It is rare that you would put a large number of columns in the GROUP BY
section since that will make the number of matches very few. The one exception would be if you were looking for duplicates. For example, if you wanted to create user names by combining the first and last name, you might first run a query to see how many duplicates you would have. You would do this by putting both the first and last name columns in the GROUP BY
and doing a count of the records. You could then sort by the count descending to find the maximum number of duplicates for any given name. Here is an example of this type of query statement:
SELECT FirstName,LastName, COUNT(*) AS Duplicates
FROM Person.Contact
GROUP BY FirstName,LastName
ORDER BY Duplicates DESC
FirstName LastName Duplicates
Laura Norman 5
Kim Ralls 4
Jean Trenary 4
Sheela Word 4
…
(19516 row(s) affected)
Notice that we used the ORDER BY
to sort descending so that we could have our duplicates on top. This query returned a lot of records, most of which were not duplicates. We could solve this by limiting our query using the TOP
command but that wouldn't really give us what we want. Say, for example, we wanted to only find the records that had four or more duplicates. Logic would say that we would use a WHERE
statement. The problem is that the WHERE
statement operates on each record, not on the aggregate data. I've seen some people solve this by nesting the query inside of another that has the WHERE
statement. It works but it is unnecessary. Instead, we should use the HAVING
keyword. This keyword performs the filtering action on the resultant rows, not on the original rows. Here is an example of how we could use this in our query:
SELECT FirstName,LastName, COUNT(*) AS Duplicates
FROM Person.Contact
WHERE LastName = 'Miller' OR LastName = 'Martin'
GROUP BY FirstName,LastName
HAVING COUNT(*) > 3
ORDER BY Duplicates DESC
FirstName LastName Duplicates
Benjamin Martin 4
Mindy Martin 4
Dylan Miller 4
Frank Miller 4
(4 row(s) affected)
Notice that the HAVING
statement does not recognize the alias from the SELECT
statement so we need to put the expression in the HAVING
statement as well. The reality is we wouldn't even need to have the expression in the SELECT
statement if we didn't want to. For example, we might want only those who have four or more duplicates but we don't care about how many records they have. In that case, we would omit the duplicates column but would keep the statement in the HAVING
section. Also note that I included a WHERE
statement just to show it can be done. Remember, WHERE
operates against each individual record while HAVING
operates against the resultant row data.
Getting Sub-totals using the WITH ROLLUP Command
Using the GROUP BY
statement starts you down the path of being able to summarize your data. Once you have a firm understanding of using GROUP BY
, this next command makes life even better. The WITH ROLLUP
command will allow you to get sub-totals inside of your query. For example, what if you had a bunch of purchase orders that had multiple items on each order. Your boss comes to you and asks for a report of what the total per purchase order was but you also need to provide a summary of how much was spent on each product on each purchase order (assuming there could be multiple lines on a purchase order that could reference the same product). One way to do this would be to do a GROUP BY
on the purchase order number and the product number. You could do a SUM
on the product cost column. That would give you the total per item per purchase order but it wouldn't give you the total for each purchase order. This is where the WITH ROLLUP
command comes in. It will give you a sub-total per section and an overall total as well. Let's see what this would look like:
SELECT PurchaseOrderID ,
ProductID ,
SUM(LineTotal) AS Total
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH ROLLUP
PurchaseOrderID ProductID Total
1 1 201.04
1 NULL 201.04
2 359 135.36
2 360 136.7415
2 NULL 272.1015
3 530 8847.30
3 NULL 8847.30
4 4 171.0765
4 NULL 171.0765
NULL NULL 9491.518
(10 row(s) affected)
If we ran this statement without the WITH ROLLUP
command, it would have provided us with a similar list but the rows that contain NULL
values would not be included. This is a simplistic example, but it shows how the WITH ROLLUP
command works. Each row that contains a NULL
value is a section summary. Look at the fifth row (PurchaseOrderID 2 ProductID NULL
). This is a summary of the entire purchase order. Thus, the Total
column is the SUM
of the previous two sub-totals. At the bottom, the row that has NULL
in both of the first two columns is the grand total row. The value in the Total
column is the SUM
of the entire query. As you can see, this can be a valuable command to keep in mind when you are looking to get aggregate information. You can use it with any of the commands like AVG
, SUM
, COUNT
, etc. Please note that column order is important in the GROUP BY statement. I put my PurchaseOrderID
column first so that I could get totals for each Purchase Order. If I had put ProductID
first, I would have gotten totals for each product, which probably isn't what I want.
Advanced Sub-totals using the WITH CUBE Command
Once you understand the WITH ROLLUP
command, you can go even further using the WITH CUBE
command. This command allows you to get a summary each way. For instance, in our above example, we wanted the total per purchase order and the sum of each item under each purchase order. However, what if we wanted to also see how much we sold of each product as well. If you already have your query set up with the WITH ROLLUP
command, just change it to be WITH CUBE
instead. Here is an example:
SELECT PurchaseOrderID ,
ProductID ,
SUM(LineTotal) AS Total
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH CUBE
PurchaseOrderID ProductID Total
1 1 201.04
NULL 1 201.04
4 4 171.0765
NULL 4 171.0765
2 359 135.36
NULL 359 135.36
2 360 136.7415
NULL 360 136.7415
3 530 8847.30
NULL 530 8847.30
NULL NULL 9491.518
1 NULL 201.04
2 NULL 272.1015
3 NULL 8847.30
4 NULL 171.0765
(15 row(s) affected)
If you compare this to the previous query results, you will see five extra rows. We have the sub total for each Product ID
in addition to what we already had. The order is a bit different, where the totals per ProductID
are intermixed in our list and the totals per purchase order are put at the bottom. If order is important to you, you can change the order given in the GROUP BY
section since it won't affect the data given, just the order it is displayed in (unlike the WITH ROLLUP
, where order is important). The WITH CUBE
command allows us to get summary data for each column. Use this with care, however, since it can create a lot of overhead if you try to do too many columns in the GROUP BY
section.
Identifying Summary Rows with the GROUPING Keyword
Once you start to become comfortable with the use of WITH ROLLUP
and WITH CUBE
, you might start to wonder if there is a way to identify the rows that are summary rows other than by the NULL
value. SQL comes equipped with the GROUPING
keyword that allows us to identify each summary row that the WITH ROLLUP
or WITH CUBE
command generates. Here is an example of how to use it:
SELECT PurchaseOrderID ,
ProductID ,
SUM(LineTotal) AS Total ,
GROUPING(PurchaseOrderID) AS PurchaseOrderGrouping,
GROUPING(ProductID) AS ProductGrouping
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH CUBE
PurchaseOrderID ProductID Total PurchaseOrderGrouping ProductGrouping
1 1 201.04 0 0
NULL 1 201.04 1 0
4 4 171.0765 0 0
NULL 4 171.0765 1 0
2 359 135.36 0 0
NULL 359 135.36 1 0
2 360 136.7415 0 0
NULL 360 136.7415 1 0
3 530 8847.30 0 0
NULL 530 8847.30 1 0
NULL NULL 9491.518 1 1
1 NULL 201.04 0 1
2 NULL 272.1015 0 1
3 NULL 8847.30 0 1
4 NULL 171.0765 0 1
(15 row(s) affected)
As you can see, the GROUPING
command will put a 1 in the column if that particular item is a sub-total or total. It will put a 0 if the record is one that was created by the GROUP BY
statement. I used an alias to name these two new columns for easier reference. Note that you can now filter this query based upon these values if you use the HAVING
statement.
Identifying Summary Rows with the COALESCE Keyword
We have already identified how the COALESCE
command works. As we know, COALESCE
gives us the first value that is not null
in the list of values it has inside it. In the earlier examples, we used COALESCE
to simply scrub out a NULL
value and replace it with a constant as a method of providing a default. However, we can do much more advanced statements inside of our COALESCE
. Let me show you what I mean via example and then I will explain what I did.
SELECT COALESCE(CAST(PurchaseOrderID AS NVARCHAR(10)),_
'Product Total: ' + CAST(ProductID AS NVARCHAR(10)), 'Grand Total') AS PurchaseOrder,
COALESCE(CAST(ProductID AS NVARCHAR(10)),'PO Total: ' + _
CAST(PurchaseOrderID AS NVARCHAR(10)), '') AS Product ,
SUM(LineTotal) AS Total
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH CUBE
PurchaseOrder Product Total
1 1 201.04
Product Total: 1 1 201.04
4 4 171.0765
Product Total: 4 4 171.0765
2 359 135.36
Product Total: 359 359 135.36
2 360 136.7415
Product Total: 360 360 136.7415
3 530 8847.30
Product Total: 530 530 8847.30
Grand Total 9491.518
1 PO Total: 1 201.04
2 PO Total: 2 272.1015
3 PO Total: 3 8847.30
4 PO Total: 4 171.0765
(15 row(s) affected)
The first thing you will note is that this is a bit complicated. Because the columns are of type int
, I had to CAST the value to type nvarchar
or the statement would have thrown an error. This makes for an ugly query but the performance is no different than the statement where we used the GROUPING
keyword (I checked it multiple times because I didn't believe it).
Getting beyond the fact that it looks scary, there isn't much to this query. Basically, I use a COALESCE
statement with three options. The first option is the field itself. However, if that is NULL
, we know that we are getting a total for the other column. In that case, I combine a static string
with the other column value to get a total line. However, if the other column is NULL
too, that means that we are on the grand total line. In this case, I alternated what I did based upon which column I was in. For the first column, I marked it as the grand total line. The second column I just put an empty string
since the line was already designated as a grand total line. The only real complex thing to remember here is that if you find a NULL
value in one column, it means that the other column is the one being totaled (not the current column).
Conclusion
In this article, we have learned how to compact and aggregate data into meaningful information. We covered a lot of information centered around the GROUP BY
statement and how this can be used effectively beyond just the simple methods we normally see. There is a lot more that I have not yet covered about SQL. My plan is to keep addressing the needs of the software developer when it comes to writing T-SQL. I hope you have found something of interest in this article. I appreciate your constructive feedback and I look forward to your thoughts on what can be done to improve this article.
History
- January 22, 2011: Initial version