This post is a guide to the Group By clause in SQL. It can be used to consolidate like values into a single row.
Use the SQL GROUP BY Clause is to consolidate like values into a single row. The group is a set of columns. The group by
returns a single row from one or more within the query having the same column values. Its main purpose is this work alongside functions, such as SUM
or COUNT
, and provides a means to summarize values.
What is a SQL Group?
In SQL, groups are unique column value combinations. When a query has a GROUP BY
, rather than returning every row that meets the filter condition, values are first grouped together. The rows returned are the unique combinations within the columns.
The overall syntax for GROUP BY
is:
SELECT colum1, column2, column3, ..
FROM table1
GROUP BY column1, column2, column3,…
For the lesson, we’ll focus on the HumanResources.Employee
table.
In this example, I’ve ordered the table by JobTitle
and Gender
.
Here, you can see the beginnings of the group. The final step is to add a SQL GROUP BY
clause to the statement to create the grouping.
SELECT JobTitle, MaritalStatus, COUNT(*) NumberEmployees
FROM HumanResources.Employee
GROUP BY JobTitle, MaritalStatus
ORDER BY JobTitle, MaritalStatus
Whose result you see here:
I want you to notice several things:
I removed columns that were part of the group or summarization, such as COUNT(*)
. Only columns listed in the GROUP BY
are allowed in the select
. In addition to these, you are able to also use aggregation functions, such as SUM
, COUNT
, MIN
, MAX
and AVG
.
I kept the ORDER BY
in the statement since GROUP BY
doesn’t guarantee the groups are sorted.
Now that you have a good grasp over how it works, let’s look at some more items related to GROUP BY
.
SQL GROUP BY with Sum
Let’s do some summarizations! For a change, we’ll summarize SalesOrderDetail
records. For each SalesOrder
, let’s calculate the OrderTotal
, the Average SalesOrderDetail
Line Total for a SalesOrder
, and the number of SalesOrderDetail
lines within an order.
We’ll use the group by
with SUM
, AVG
, and COUNT
functions to help.
SELECT SalesOrderID
,SUM(LineTotal) AS OrderTotal
,AVG(LineTotal) AS AverageLineTotal
,COUNT(LineTotal) AS NumberOfLines
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderIDORDER BY SalesOrderID;
Since there are one or more lines per Sales Order
, we’ll group by
the SalesOrderID
. Check out the results:
Here are some tips to think about when using these functions. You can:
- Use more than one function in your SQL.
- Include the functions within Expressions.
- Alias the expression.
You can also create new expressions using aggregate functions. To demonstrate this, let’s calculate the Average Line Total knowing the Average = the Sum of Line Totals / the Count.
Here is the query to make it so:
SELECT SalesOrderID
,SUM(LineTotal) AS OrderTotal
,COUNT(LineTotal) AS NumberOfLines
,SUM(LineTotal) / COUNT(LineTotal) AS AverageLineTotal
,AVG(LineTotal) AverageFunctionLineTotal
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID
ORDER BY SalesOrderID;
And the results. Notice how the “calculated” AverageLineTotal
equal the Function’s value.
Ordering Grouping Results
Groups aren’t returned in sorted order. In order to do this, you need to include the ORDER BY
clause. Add the ORDER BY
clause to the end of the SQL statement.
Modify this statement so it sorts the groups by SalesOrderID
and CarrierTrackingNumber
.
SELECT SalesOrderID, CarrierTrackingNumber, COUNT(1) as NumberofLines
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID, CarrierTrackingNumber
ORDER BY SalesOrderID, CarrierTrackingNumber
HAVING and GROUP BY
The HAVING
clause filters groups created using GROUP BY
. The HAVING
clause determines whether a group is included within the result.
Unlike the WHERE
clause which filters individual records, the HAVING
clause filter groups; however, keep in mind that SQL containing both a WHERE
and HAVING
processes both!
In this case, the WHERE
clause is first processed, and those records filtered. The HAVING
is then applied to groups as specified in the GROUP BY
.
Typically HAVING
is used with GROUP BY
, but if a GROUP BY
clause isn’t included, then HAVING
operates on the entire result set as one implicit group.
Consider our previous example. Here, we calculate the OrderTotals
.
SELECT SalesOrderID
,SUM(LineTotal) AS OrderTotal
,AVG(LineTotal) AS AverageLineTotal
,COUNT(LineTotal) AS NumberOfLines
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID
ORDER BY SalesOrderID;
Suppose we need to show large orders, those totalling more than $100,000.00. How can we do so?
Use the HAVING
clause to filter on the SUM(LineTotal)
.
Here is the query to show large orders:
SELECT SalesOrderID
,SUM(LineTotal) AS OrderTotal
,AVG(LineTotal) AS AverageLineTotal
,COUNT(LineTotal) AS NumberOfLines
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID
HAVING SUM(LineTotal) > 100000.00
ORDER BY SalesOrderID;
What’s nice about HAVING
HAVING
clause’s key advantage is its ability to filter GROUPS
using aggregate functions. This is something you cannot do within a SELECT
statement.
Hint: If you are having a hard time understand HAVING
and WHERE
’s role within SELECT
, I recommend understanding the SELECT
statement order of execution first.
GROUP BY with Multiple Tables
GROUP BY
can also include columns from multiple tables. In the following query, let’s calculate the Sales by ProductName
. To do so, we'll join SalesOrderDetail
to Product
. We’ll join
on product
.
SELECT p.Name ProductName,
SUM(s.OrderQTY) TotalQuantitySold,
SUM(s.LineTotal) TotalSales,
SUM(s.OrderQTY * p.StandardCost) TotalStandardCost,
SUM(s.LineTotal) - SUM(s.OrderQTY * p.StandardCost) GrossMargin
FROM Sales.SalesOrderDetail s
INNER JOIN Production.Product p ON s.ProductID = p.ProductID
GROUP BY p.Name
ORDER BY p.Name
Check out how columns from the two tables participate within the TotalStandardCost
calculation.
Sort by SUM()
Using the same example as above, notice I can sort by one of the totals. In this case, we’ll sort by TotalQuantitySold
.
SELECT p.Name ProductName,
SUM(s.OrderQTY) TotalQuantitySold,
SUM(s.LineTotal) TotalSales,
SUM(s.OrderQTY * p.StandardCost) TotalStandardCost,
SUM(s.LineTotal) - SUM(s.OrderQTY * p.StandardCost) GrossMargin
FROM Sales.SalesOrderDetail s
INNER JOIN Production.Product p ON s.ProductID = p.ProductID
GROUP BY p.Name
ORDER BY SUM(s.OrderQTY)
Since this column is aliased, let’s make the query easier to read, by sorting by the alias name.
GROUP BY with an Expression
You can group on an expression. Grouping is allowed as long as the same expression is within the SELECT
list.
SELECT Year(OrderDate) OrderYear, Sum(TotalDue) YearlySales
FROM Sales.SalesOrderHeader
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear
In this example, let’s group and summarize SalesOrder
totals by year
And there is nothing stopping us from grouping one or more expressions! Let’s further group sales by Month
.
SELECT Year(OrderDate) OrderYear,
Month(OrderDate) OrderMonth,
SUM (TotalDue) MonthlySales
FROM Sales.SalesOrderHeader
GROUP BY YEAR(OrderDate), MONTH(OrderDate)
ORDER BY OrderYear, OrderMonth
Before I forget, check out how I sorted the results. I used the column alias to order rather than repeating the Expression. Both are valid, but it sure is nice to keep our expression to a minimum! I think it’s easier to read. In case you’re wondering, this is all possible due to SQL Order of Operations.
Here are the results. I color coded the columns to better show you the multiple expressions. Like the previous example, each expression used within the GROUP BY
is also listed as a column.
SQL GROUP BY versus DISTINCT
In some ways, GROUP BY
is similar to using the DISTINCT
Clause.
SELECT DISTINCT JobTitle, MaritalStatus
FROM HumanResources.Employee
ORDER BY JobTitle, MaritalStatus
Returns the same results as this SQL using SQL GROUP BY
:
SELECT JobTitle, MaritalStatus
FROM HumanResources.Employee
GROUP BY JobTitle, MaritalStatus
ORDER BY JobTitle, MaritalStatus
The key difference between DISTINCT
and GROUP BY
is that when using GROUP BY
, you can also summarize values.
Here, we included the COUNT()
function to determine how many rows make up each group:
SELECT SalesOrderID, CarrierTrackingNumber, COUNT(1) as NumberofLines
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID, CarrierTrackingNumber;
Conclusion
When using SQL GROUP BY
, keep these points in mind:
- The SQL
GROUP BY
does not guarantee the groups are returned in sorted order. - You can use aggregate functions, such as
SUM
and COUNT
to summarize data. - The non summarized columns in your
SELECT
list must match those listed in the GROUP BY
.