Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / SQL

Forename-Title Profiling

1.00/5 (1 vote)
9 Mar 2010CPOL1 min read 7.2K  
Introduction...

Introduction


This script shows how data in the table can be cleansed/enhanced using other records in the same table. It is essential that you have enough records in the database to get correct results.
In this example, we will see how to fix values in Titles field of table that contains contact details basing on popular Forename-Title matches.

Using the Code


With CREATE TABLE statement, we find popular Forename-Gender combinations in our table and store them in Forename_Gender table.
SQL
-------------------------------------------------------------
-- Create table with all forename-gender combinations      --
-------------------------------------------------------------
DROP TABLE Forename_Gender;
CREATE TABLE Forename_Gender
AS
SELECT 
   Forenames, 
   Decode(Title,'Mr', 'M', 'Sir', 'M', 'Master', 'M', 'Miss', 'F',
           'Mrs', 'F', 'Ms', 'F','Mistress','F', 'U') AS Gender, 
   COUNT(*) AS Amount
FROM Contacts
WHERE 
   NOT Forenames LIKE '% %'
   AND LENGTH(Forenames)>2
GROUP BY 
   Forenames,
   Decode(Title,'Mr', 'M', 'Sir', 'M', 'Master', 'M', 'Miss', 'F',
          'Mrs', 'F', 'Ms', 'F', 'Mistress', 'F', 'U')
/
-- Delete matches where gender is Unknown
DELETE FROM Forename_Gender WHERE Gender='U'
/

Then we need to remove records where the same forename can be given to both boy and girl. I have used 1% threshold, i.e. if both male and female genders have more than 1% of records (amount) in the forename group, then this Forename-Gender combination cannot be used and will be deleted from Forename_Gender table.
SQL
-------------------------------------------------------------
--	Delete records where forename can be both 
--		male and female
-------------------------------------------------------------
DELETE FROM forename_gender f
WHERE EXISTS
(SELECT * FROM forename_gender WHERE
	Forenames=f.Forenames 
	AND Gender<>f.Gender
	AND AMOUNT*100>f.amount -- there is more than 1%
	-- of people with different gender and same forename.
	-- NOTE: you must change 100 to other number to use 
	-- different threshold. I.e. 50 will mean that 2% of records
	-- can have other Gender without being taken into account.
)
/

Then I delete unpopular forenames - less than 5 records in the table.
SQL
-------------------------------------------------------------
--	Need more than 5 records having same Gender
-------------------------------------------------------------
DELETE FROM forename_gender
WHERE Amount<5
/

Check what will be the result of update:
SQL
-- This statement returns results that show which records will be 
-- affected by the update.
SELECT 
   C.Forenames, -- Forename of the contact 
   Count(C.Forenames) As Number_To_Update, -- Records will be updated
   DECODE(Gender,'M','Mr','F','Ms','Mr/Ms') as New_Title, -- New Title
   AVG(FG.Amount) AS Number_In_DB -- Number of records in DB that
                                  -- support this match
FROM Contacts C Inner Join
    Forename_Gender FG ON C.Forenames=FG.Forenames
WHERE
    UPPER(NVL(C.Title,'MR/MS'))='MR/MS'
GROUP BY C.Forenames,DECODE(Gender, 'M', 'Mr', 'F', 'Ms','Mr/Ms')
ORDER BY AVG(FG.Amount) DESC
/

And finally we can update Title field with values from Forename_Gender.
SQL
-------------------------------------------------------------
--	UPDATE Contacts.Title fields where it is not set
--	or set to Mr/Ms
-------------------------------------------------------------
-- !!!  This code is going to change data in your table.
-- !!!  Test this code untill you are confident in results.
UPDATE Contacts C
SET 
	Title = (SELECT DECODE(Gender, 'M', 'Mr', 'F', 'Ms','Mr/Ms')
		 FROM Forename_Gender 
		 WHERE C.Forenames=Forenames)
-- SELECT COUNT(*) FROM Contacts C
WHERE 
	UPPER(NVL(C.Title,'MR/MS'))='MR/MS'
	AND EXISTS (SELECT * 
			FROM Forename_Gender 
			WHERE C.Forenames=Forenames)
/


Do More


I have done a very similar exercise on addresses table to fix Towns and Counties (UK customer base) using popular Postcode area - town combinations.

Also this is a way of profiling your records, so you can use match tables (i.e. Forname_Gender, Postcode_Town) to find erroneous records in your tables.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)