Introduction
I want to take a step back from the sorts of things I’ve been covering to talk about a common issue. A lot of the things I decide to write about, come out of questions I see in the forums. One thing I commonly see people doing, is throwing in the ‘DISTINCT’ keyword to try to solve their issues.
Using DISTINCT
For those who don’t know, here’s how it works. Run this SQL to create a pretty simple table:
create table testDistinct
(
id int
)
insert into testdistinct values (1), (2), (1), (3), (4), (1), (3), (2), (5)
Now run this:
select * from testDistinct
and then run this:
select distinct id from testdistinct
As you can see, it removes duplicate values. If I had a lot of columns, ALL the columns would need to be identical, in order for the row to be culled by DISTINCT.
So, often people find they are getting too many results, and use DISTINCT to cull them down.
A more complex sample
Now run this SQL to create some test data. The data supposes that we are running a mailing server for auto repairers. We have tables for a repairer, a client, a client’s car, and also for mail settings for each repairer. Here is the SQL:
create table autoRepair
(
id int identity,
name varchar(50)
)
create table client
(
id int identity,
repairerId int,
name varchar(50)
)
create table car
(
id int identity,
clientId int,
type varchar(50)
)
create table repairMailSettings
(
repairId int,
replyAddress varchar(100)
)
insert into autoRepair values('CO-OP Toyota'), ('Motors Holden')
insert into client values (1, 'Fred Smith'), (1, 'Joe Kaputnuk'), (2, 'Bill Brown')
insert into car values (1, 'Fred''s car'), (1, 'Fred''s other car'), (2, 'The Joe Mobile'),(3, 'Bill''s Brown Bomb')
insert into repairMailSettings values (1, 'info@co-op.com'), (2, 'help@motors.com'), (1, 'info@co-op.com')
Now run this to get a list of all car owners, and their cars and repairer:
select ar.name as RepairerName, c.Name as ClientName, cr.Type as CarName from autoRepair ar
inner join client c on c.repairerId = ar.id
inner join car cr on cr.clientId = c.id
This returns:
RepairerName ClientName CarName
CO-OP Toyota Fred Smith Fred's car
CO-OP Toyota Fred Smith Fred's other car
CO-OP Toyota Joe Kaputnuk The Joe Mobile
Motors Holden Bill Brown Bill's Brown Bomb
Of course, if you wanted to send an email out to all clients, you’d need the mail settings too, so let’s include those:
select ar.name as RepairerName, c.Name as ClientName, cr.Type as CarName, ms.replyAddress from autoRepair ar
inner join client c on c.repairerId = ar.id
inner join car cr on cr.clientId = c.id
inner join repairMailSettings ms on ms.repairId = ar.id
This returns the following:
RepairerName ClientName CarName replyAddress
CO-OP Toyota Fred Smith Fred's car info@co-op.com
CO-OP Toyota Fred Smith Fred's car info@co-op.com
CO-OP Toyota Fred Smith Fred's other car info@co-op.com
CO-OP Toyota Fred Smith Fred's other car info@co-op.com
CO-OP Toyota Joe Kaputnuk The Joe Mobile info@co-op.com
CO-OP Toyota Joe Kaputnuk The Joe Mobile info@co-op.com
Motors Holden Bill Brown Bill's Brown Bomb help@motors.com
All the CO-OP records have been duplicated !!! So, the easy fix is to add DISTINCT:
select distinct ar.name as RepairerName, c.Name as ClientName, cr.Type as CarName, ms.replyAddress from autoRepair ar
inner join client c on c.repairerId = ar.id
inner join car cr on cr.clientId = c.id
inner join repairMailSettings ms on ms.repairId = ar.id
And we get:
RepairerName ClientName CarName replyAddress
CO-OP Toyota Fred Smith Fred's car info@co-op.com
CO-OP Toyota Fred Smith Fred's other car info@co-op.com
CO-OP Toyota Joe Kaputnuk The Joe Mobile info@co-op.com
Motors Holden Bill Brown Bill's Brown Bomb help@motors.com
Looks great, right ? But have we solved an issue, or just hidden it ?
How joins work
Although tables are sometimes used to form 1 to 1 relationships ( as our repairMailSettings table is, in theory, the idea is to stop one table from becoming huge and full of disparate information ), most of the time, they define one to many relationships. For example, Fred is one person, but he owns two cars. His name appears in the client table once, but when we join clients to cars, because he has two cars, two rows are created, both with the same name, but with different car names.
Know what you’re asking
Sometimes people use DISTINCT because they have failed to ask the right question. For example, someone asked the other day why these two queries gave different results
Select count(a.id) from a inner join b on a.id = b.id
And
Select count(a.id) from a where a.id not in (select id from b)
The reason was, it was perfectly valid for b to have the same id more than once, but this created duplicate rows in the join, and increased the count. A join is always better than a subquery, but, sometimes a join is asking the wrong question.
This is a case where DISTINCT is perfectly valid. We know that the ids in a are unique but appear more than once in b. But we want the count of items in a that also appears in b. This is then how to do that:
Select count(distinct a.id) from a inner join b on a.id = b.id
Understand the result, don’t hide it
So, DISTINCT seemed to solve our car dealer issue, but all it did was fix the one query. A close look ( for those who missed it ) will reveal that there’s a duplicate row in repairMailSettings. This is the issue. Instead of using DISTINCT, we need to add a UNIQUE foreign key to our table, so the same repairerId can’t appear twice in our table. Had we gone with the DISTINCT issue, and had the user changed their settings, it’s possible we’d have ended up having two different sets of settings in our database. Even if they had never done that, every bit of SQL that used this table, would have needed DISTINCT to continue hiding the issue.
Conclusion
So, the moral of the story is, if you find you’re getting duplicate rows, understand why that is, don’t just hide it. DISTINCT is perfectly valid at times, but if you use it, you should first understand why you need it, and be sure there’s no other solution. Sometimes, what you’re doing is hiding an issue that is only going to resurface at a later date.