Before we start, if you missed part 1 ( about joins ) you can find it here
Introduction
Well, I said I’d do an article on CTEs this Sunday, and I still intend to, but answering some questions in the forums made me realise the need for an article in my series on select statements, after all. So, this article is going to cover most of the basics of selects. Things like window functions and using XML, will be covered separately, later on.
To get the most out of SQL articles, you really need to be in SQL Server, where you can run SQL, change it and see how the results change. To facilitate this, I’d like you to download and install the AdventureWorks database, because it gives you a good set of sample data to work with. You can download it here. Go ahead, I’ll wait.
Simple Select
OK, so when you open the SQL file I provided, you’ll want to uncomment one select at a time and run it, to work through my examples. The first one is easy:
select * from Sales.Currency
This just returns a whole table. Although select * will return all rows, you should avoid using it. This is because your code is not specifying what columns you want. If someone writes code against your SQL that processes columns by ordinal instead of by name, a new column could break that code. If a column is removed, your SQL would run, and the downstream code would break. A good principle is, if something breaks, have it break as close to the cause as possible, so you can easily work out what has gone wrong.
Note that while the currency codes are probably alphabetical ( and that’s presumably the order they were inserted in ), the order is NOT guarenteed. In fact, SQL is a set based language. A set has no guarenteed order. If you care about the order of data, you should always use the ORDER BY statement.
Select with Order By
select CurrencyCode, Name, ModifiedDate from Sales.Currency order by name
This returns all currencies, ordered by name ( which is probably what you’d want if you were going to show them in a list ). Note if the first column you order by has duplicate values, you can specify additional columns to order by, comma separated.
TOP - selecting X Rows
You can also select a certain number of rows, like this:
select top 10 CurrencyCode, Name, ModifiedDate from Sales.Currency order by name
You can also sort in descending order. Here’s a trick I learned to get a page of data out of the DB.
declare @page int = 5
declare @pageSize int = 10
select CurrencyCode, Name, ModifiedDate
from
(
select top (@pageSize) CurrencyCode, Name, ModifiedDate from
(
select top (@pageSize * (@page + 1)) CurrencyCode, Name, ModifiedDate from Sales.Currency order by name
) p
order by name desc
) t order by name
This pulls out the data up to the end of the page you wanted, in order, then counts the page size from the bottom of the list, then finally flips the order again.
Selecting random rows
If you want a quick sample of your data, there’s a keyword ‘tablesample’. It does not work on the Sales.Currency table, and I’m not sure why, although I know the rules for what it won’t work with:
TABLESAMPLE cannot be applied to:
- derived tables tables from linked servers
- tables derived from table-valued functions
- rowset functions
- OPENXML
It also can’t be used in views or functions.
It works like this:
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (10 PERCENT) ;
This does not work:
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (50 ROWS) ;
But this does:
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (500 ROWS) ;
So I think it gets confused when you ask for too small a sample set. It’s not truly random, but you can make it repeat the same set like so:
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (10 PERCENT)
REPEATABLE (205)
Repeatable acts as a static 'random' seed, and can be any value.
You will notice the data tends to be selected in blocks. If you want truly random data, do something like this:
SELECT TOP 500 FirstName, LastName
FROM Person.Person order by newid()
NewId() returns a UNIQUEIDENTIFIER
, which is T-SQL for a GUID. This is slower, but gives a better result, depending on your needs.
Another option is this:
SELECT TOP 500 FirstName, LastName
FROM Person.Person order by (SELECT NULL)
Null is a special value, and NULL does not equal NULL, so if you order by NULL, the order in which the values are sorted is random, because the values are unequal, and unknown.
WHERE and DISTINCT
Of course, you usually don’t want ALL your data, you want a filtered view. This is where the WHERE statement comes in.
select firstname, lastname from person.person where persontype = 'EM'
will return all people who are Employees. If you wanted to know what all the person types are, you would do this:
select distinct persontype from person.person
Note: Distinct is good for when you know there’s a range of values and you want specific ones. When you join tables together, you’ll sometimes find you have duplicate results. This usually means that one of the tables you are joining to, has a duplicate value in the key you specified, thus doubling the result across the other tables. Or it just means your SQL is not well thought out. Don’t use DISTINT to hide your bugs. If you’re getting duplicate rows, work out why.
GROUP BY
Now we get to the construct I really wanted to write something about, because I see so many forum questions that are based on people not knowing it. I suspect one of the first things people want to do with SQL that moves beyond just being ‘given the data’, is to get a total of something, or perhaps some other calculation. If you do this:
select FirstName, LastName, LineTotal from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
order by lastName, firstName
You can see the individual prices users have paid. But if you do this:
select FirstName, LastName, sum(LineTotal) from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
order by lastName, firstName
You get an error:
Column 'person.person.FirstName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
What does the error mean ? Well, SUM is an aggregate function. That means, it does not give you the value in one row, but a value that is aggregated (derived from a group of rows). SQL is smart enough to realise you probably don’t want to see the entire total over and over. If you want to know the total for all items ordered, do this:
select sum(LineTotal) from sales.salesorderdetail
Notice I got rid of all the joins in my SQL. These will have the effect of filtering your values ( note I also used a right join because I noticed that the personid is sometimes null ). Never join if you don’t need to, it might have an unintended consequence of filtering out rows that exist in the tables you care about, but not in a table you joined to.
However, we want to know the total per customer. This is how we do that:
select LastName, FirstName, sum(LineTotal) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
group by firstname, lastname
order by lastname, firstname
Notice that some of our values are NULL. The examples I can see, only have one value in the DB, and it has a NULL line total. However, the behaviour I want to point out is that NULL is not a zero, and so if you add any values to NULL, the answer is NULL. NULL means ‘no known value’, so it can’t be SUMmed or used in any mathematical function. So the correct SQL is:
select LastName, FirstName, sum(coalesce(LineTotal, 0)) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
group by firstname, lastname
order by lastname, firstname
This will replace all nulls with 0 and provide meaningful answers across the board ( we can assume that in the real world, NULL means no payment was made, which is equivelant to zero, although NULL does not equal zero, we can assume it does, in this case ).
Of course, if you were to join to tables that give product details, you could group by product, or anything else you like. This is a very powerful technique for finding values in SQL, and you should spend some time playing with it, if you’ve not used it before.
There are a variety of other functions you could use, for example
select LastName, FirstName, max(coalesce(LineTotal, 0)) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
group by firstname, lastname
order by total desc
will give you client names in order of who paid the most for a single line ( not line item, as LineTotal could be 8 million of one cheap item ).
Other functions include:
AVG | Calculates the average |
MIN | Gets the smallest value |
MAX | Gets the biggest value |
SUM | Adds all the values |
COUNT | Tells you how many rows were present |
HAVING
Try to run this:
select LastName, FirstName, max(coalesce(LineTotal, 0)) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
where Total > 500
group by firstname, lastname
order by total desc
You will get this error:
Invalid column name 'Total'.
Because of the order in which SQL processes your request, your column Total does not exist yet. We need to reproduce the statement that creates the value, unless we want to use a subquery.
select LastName, FirstName, max(coalesce(LineTotal, 0)) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
where max(coalesce(LineTotal, 0)) > 500
group by firstname, lastname
order by total desc
This gives a new error:
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.
How do we fix this ? The error message gives us a clue. What is a ‘having’ clause ?
select LastName, FirstName, max(coalesce(LineTotal, 0)) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
group by firstname, lastname
having max(coalesce(LineTotal, 0)) > 500
order by total desc
Note the change in order, the where statement needed to be above the group by and the having needs to be after. ‘Having’ is used to specify requirements for a grouping function. If you were to filter on a non grouping function, ‘having’ works just the same as ‘where’, but, you need to use ‘having’ to filter on an aggregate. This is because HAVING runs AFTER your select, and it's purpose is to be able to filter on the result of aggregate functions.
You don’t need to make the statement identical, you can do this:
select LastName, FirstName, max(coalesce(LineTotal, 0)) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
group by firstname, lastname
having max(LineTotal) > 500
order by total desc
Or even this:
select LastName, FirstName, max(coalesce(LineTotal, 0)) as Total from sales.salesorderdetail sod
inner join sales.salesorderheader soh on soh.salesorderid = sod.salesorderid
inner join sales.customer sc on sc.customerid = soh.customerid
right join person.person p on sc.personID = p.BusinessEntityID
group by firstname, lastname
having sum(LineTotal) > 20000
order by total desc
You just need to be operating on the column that is being aggregated above.
ISNULL and COALESCE
It’s probably a bit late, given that I’ve been using them throughout, but both these methods can be used to replace NULL with a default value. However, they are not the same. There are a number of differences. The most important, is that COALESCE can take many values. This is because COALESCE is syntactic sugar for a CASE statement, which could go on forever.
Select coalesce(val1, val2, val3, 0)
Is the same as
Select case
When val1 is not null then val1
When val2 is not null then val2
When val3 is not null then val 3
Else 0
End
IsNull is a distinct method in SQL Server. It only takes two values. This probably explains why COALESCE tends to be slower than ISNULL.
If you are using Entity Framework, this means that, because ISNULL uses the type of the first object in the list, the type will be nullable, and so your EF property will be INullable, if it's a type like int. You can use COALESCE to try to get around this, or you can explicitly cast the return type to make it not nullable. The other issue with EF is that if you return 1 or 0, EF will decide it's an INT, not a BIT, you need to cast it to BIT to get a bool type in EF.
ISNULL uses the first parameter to determine the type of the return value, and COALESCE uses the same rules as CASE statements ( because it is one ).
COALESCE is standard SQL, and ISNULL is proprietary T-SQL.
There is a third construct, NULLIF. It returns NULL if two values are the same. It returns the first expression, if they are not equal. Obviously, it has the type of the first expression as a result.
CASE statements
I’ve used these already, too, but, CASE statements allow you to inject procedural logic in to your set based database code.
For example, here’s SQL to show quarterly sales totals for the entire AdventureWorks database:
select sum(linetotal),
case
when month(modifieddate) < 4 then 1
when month(modifieddate) < 7 then 2
when month(modifieddate) < 10 then 3
else 4
end as quarter,
year(modifieddate) as year
from sales.salesorderdetail
group by
year(modifieddate),
case
when month(modifieddate) < 4 then 1
when month(modifieddate) < 7 then 2
when month(modifieddate) < 10 then 3
else 4
end
order by year, quarter
Note that I am assuming that the case statement will be evaluated in the order given. From what I’ve read, it seems that all statements are evaluated ( a bit like VB6 ) but you can rely on values 1, 2 and 3 passing the first check and not falling through to the second. You can always make it more explicit if this makes you nervous. I probably would, in production. SQL is different to procedural languages, you’re telling the database WHAT you want, not HOW you want it done. I find it best to always be as explicit as possible.
Optimising your queries
One thing that I think often gets lost by people using SQL is, you can do a LOT of things, such as searching for substrings with LIKE, or using functions like YEAR to search for the year in a date, but if you use functions to modify the data in columns, you disqualify any chances SQL Server can use it's indexes, which will slow your query down. So instead of
select * from tbl where YEAR(birthday) = 1969
a query which does not modify the column in your table will be faster, such as :
select * from tbl where birthday between '19690101' and '19691231'
This specific form only works if 'birthday' is a date and not a datetime, read my article on working with dates for more detail on working well with date and time columns.
Conclusion
I assume that anyone reading this has written a select statement before, but hopefully I’ve opened a few extra doors for some people at least. SQL is incredibly powerful and expressive, once you get your head around how it works. The best way to learn is to take my statements, and alter them and see what results you get. That’s why AdventureWorks is great, there’s a ton of data there for you to attempt other tasks on and see if you can work out how to do them.
As I said, I’ll start soon on an article on CTEs, which includes the ability to create recursive queries in SQL Server. It’s very cool, and I doubt it will go live on Sunday as I expect it will take quite a while for me to cover all my bases, but it will be live as soon as I can get it done. I hope you’re having fun, I sure am.