Introduction
This script shows how data in the table can be cleansed/enhanced using other records in the same table. It is essential that you have enough records in the database to get correct results.
In this example, we will see how to fix values in
Titles
field of table that contains contact details basing on popular
Forename-Title
matches.
Using the Code
With
CREATE TABLE
statement, we find popular Forename-Gender combinations in our table and store them in
Forename_Gender
table.
DROP TABLE Forename_Gender;
CREATE TABLE Forename_Gender
AS
SELECT
Forenames,
Decode(Title,'Mr', 'M', 'Sir', 'M', 'Master', 'M', 'Miss', 'F',
'Mrs', 'F', 'Ms', 'F','Mistress','F', 'U') AS Gender,
COUNT(*) AS Amount
FROM Contacts
WHERE
NOT Forenames LIKE '% %'
AND LENGTH(Forenames)>2
GROUP BY
Forenames,
Decode(Title,'Mr', 'M', 'Sir', 'M', 'Master', 'M', 'Miss', 'F',
'Mrs', 'F', 'Ms', 'F', 'Mistress', 'F', 'U')
/
DELETE FROM Forename_Gender WHERE Gender='U'
/
Then we need to remove records where the same forename can be given to both boy and girl. I have used 1% threshold, i.e. if both male and female genders have more than 1% of records (amount) in the forename group, then this Forename-Gender combination cannot be used and will be deleted from
Forename_Gender
table.
DELETE FROM forename_gender f
WHERE EXISTS
(SELECT * FROM forename_gender WHERE
Forenames=f.Forenames
AND Gender<>f.Gender
AND AMOUNT*100>f.amount
)
/
Then I delete unpopular forenames - less than 5 records in the table.
DELETE FROM forename_gender
WHERE Amount<5
/
Check what will be the result of update:
SELECT
C.Forenames,
Count(C.Forenames) As Number_To_Update,
DECODE(Gender,'M','Mr','F','Ms','Mr/Ms') as New_Title,
AVG(FG.Amount) AS Number_In_DB
FROM Contacts C Inner Join
Forename_Gender FG ON C.Forenames=FG.Forenames
WHERE
UPPER(NVL(C.Title,'MR/MS'))='MR/MS'
GROUP BY C.Forenames,DECODE(Gender, 'M', 'Mr', 'F', 'Ms','Mr/Ms')
ORDER BY AVG(FG.Amount) DESC
/
And finally we can update
Title
field with values from
Forename_Gender
.
UPDATE Contacts C
SET
Title = (SELECT DECODE(Gender, 'M', 'Mr', 'F', 'Ms','Mr/Ms')
FROM Forename_Gender
WHERE C.Forenames=Forenames)
WHERE
UPPER(NVL(C.Title,'MR/MS'))='MR/MS'
AND EXISTS (SELECT *
FROM Forename_Gender
WHERE C.Forenames=Forenames)
/
Do More
I have done a very similar exercise on addresses table to fix Towns and Counties (UK customer base) using popular Postcode area - town combinations.
Also this is a way of profiling your records, so you can use match tables (i.e.
Forname_Gender
,
Postcode_Town
) to find erroneous records in your tables.