Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / SQL

Finding Similar Rows In SQL

0.00/5 (No votes)
21 Jun 2013CPOL3 min read 27.1K  
Finding similar rows in SQL

Introduction

In some ways, this is what SQL should excel at doing, but for some reason it just isn’t supported. There’s always things like Full Text Search, but the overhead of maintaining a full text catalogue, and learning how to use the technology is just a bit too much sometimes and isn’t worth it for what you’re trying to achieve.

Imagine you are running an eCommerce website which sells guitar amplifiers and you need a ‘similar amplifiers’ feature that will suggest amplifiers similar to the amp the user is currently looking at based on a few key features of the amp.

Soundex

Just as a note, I’ve used the DIFFERENCE function in the snippets below. This is just a small library built into SQL Server that allows you to compare how words sound, for example “beech” and “beach” will have a match index of 4 (Best match), while “Road” and “Germany” will return 0 (Worst match). It’s worth looking up.

Simple

If you don’t care about having any knowledge of how ‘good’ of a match the suggestion is, then it’s quite easy. Be careful though, this will be unreliable and spammy but is good for a quick prototype.

SQL
-- First we nee to insert the selected record to match against into a temp
-- table (This is much tidier than using variables for each field!)

INSERT INTO #tmpSelectedAmplifier
SELECT ModelName
	, ManufacturerName
	, Colour
	, AmplifierType
	, Wattage
	, HasOnboardEffects
FROM Amplifier
WHERE AmplifierId = @ampId

SELECT *
FROM Amplifier amps

-- We need to perform an inner join on the record in the temp table
-- with all the fields to compare as conditions in the predicate

INNER JOIN #tmpSelectedAmplifier sAmp
	ON (
		DIFFERENCE(sAmp.ModelName, amps.ModelName) >= 3
		
		OR sAmp.Colour = amps.Colour
		OR sAmp.Wattage = amps.Wattage
		OR sAmp.AmplifierType = amps.AmplifierType
		OR sAmp.HasOnboardEffects = amps.HasOnboardEffects
	)

This query is pretty nice and simple, it will match on any of the conditions specified in the INNER JOIN. More often than not, this really isn’t the way you’d want to do this.

Note that I used a temp table and an inner join instead of using a WHERE clause. The only reason to do this is because it’s cleaner (At least in my opinion) because to use a WHERE you’d have to start using variables for each field in the source record (in this case, the amplifier you’re viewing) which can just get messy and cumbersome pretty quickly.

Getting the Number of Matched Fields

The more common case is that you’ll want to have at least two or more fields to match, and will want to order by the amount of fields that do match.

This is still nice and simple, but it does definitely bloat the SQL up a bit. The general idea is just to create a temp table to hold an amplifier ID. This temp table will be inserted to with (for example) all the amps with the colour we are looking for, and then all the amps with the wattage we need and so on.. And then, once we count how many instances of each AmplifierId appear in the temp table, we will have a rough idea of how good of a match each record is to the one the user is currently viewing.

SQL
-- Create a temp table with NO unique constraint on the ID

DECLARE @ampMatches TABLE (AmplifierId UNIQUEIDENTIFIER)

-- We then add records from the Amplifier table into the temp
-- table based on each field to match against, one at a time
-- this means that if a record matches on two columns it will appear
-- twice in the temp table and we can do a COUNT based on the AmplifierID

INSERT INTO @ampMatches
SELECT AmplifierId
FROM Amplifier
WHERE DIFFERENCE(sAmp.ModelName, amps.ModelName) >= 3

INSERT INTO @ampMatches
SELECT AmplifierId
FROM Amplifier
WHERE sAmp.Colour = amps.Colour

INSERT INTO @ampMatches
SELECT AmplifierId
FROM Amplifier
WHERE sAmp.Wattage = amps.Wattage

INSERT INTO @ampMatches
SELECT AmplifierId
FROM Amplifier
WHERE sAmp.AmplifierType = amps.AmplifierType

INSERT INTO @ampMatches
SELECT AmplifierId
FROM Amplifier
WHERE sAmp.HasOnboardEffects = amps.HasOnboardEffects

;WITH ampsMatchCountCTE AS
(
	SELECT AmplifierId
		, COUNT(AmplifierId) AS MatchCount
	FROM @ampMatches
	GROUP BY AmplifierId
)

SELECT *
FROM Amplifier

-- We can safely do an inner join here because if there is no matching
-- record in the CTE then we know that no matches were made

INNER JOIN ampsMatchCountCTE ampsMatch
	ON ampsMatch.AmplifierId = Amplifier.AmplifierId
	
-- Filtering based on MatchCount of at least two just means we don't just end
-- up giving the user completely unrelated matches

WHERE ampsMatch.MatchCount >= 2
ORDER BY ampsMatch.MatchCount DESC

Just as a note, this query can be written using HAVING instead of the CTE, but you’d end up having to list all your non-aggregated fields in a GROUP BY, which would be a bit of a nightmare..

Order of Precedence of Match Conditions

Just as a quick extension to the previous example, we can add weightings to certain columns so that we provide some extra criteria to either filter or order based on.

SQL
DECLARE @ampMatches TABLE (AmplifierId UNIQUEIDENTIFIER
				, ColumnWeight TINYINT)

INSERT INTO @ampMatches
SELECT AmplifierId
	, 10 -- The column weighting
FROM Amplifier
WHERE DIFFERENCE(sAmp.ModelName, amps.ModelName) >= 3

INSERT INTO @ampMatches
SELECT AmplifierId
	, 5 -- The column weighting
FROM Amplifier
WHERE sAmp.Colour = amps.Colour

-- ... etc ... --

;WITH ampsMatchCountCTE AS
(
	SELECT Amplifier
		, COUNT(AmplifierId) AS MatchCount
		
		-- The weighting is determined by the sum of the weightings of
		-- all matched columns
		
		, SUM(ColumnWeight) AS RowWeighting
	FROM @ampMatches
	GROUP BY AmplifierId
)

SELECT *
FROM Amplifier
INNER JOIN ampsMatchCountCTE ampsMatch
	ON ampsMatch.AmplifierId = Amplifier.AmplifierId
WHERE ampsMatch.MatchCount >= 2
ORDER BY ampsMatch.RowWeighting DESC

Of course, in this example, the weightings are used purely for ordering results, and not for filtering the results set.

Links

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)