Introduction
During development of ETL processes, I encountered the following situation more than once: records are lost in JOIN
s where they shouldn't. Let's say we join 2 tables, both have records with the same text value (for example, product serial number), but JOIN
returns nothing. It usually takes a lot of time and effort to investigate and find out that values only look similar, but in fact they are different: Cyrillic symbols are used instead of Latin, because someone didn't bother to switch keyboard layout when manually populating the data.
In the following example, 2 string
s are not equal: the first has Latin letters, and the second has Cyrillic.
declare @SerialNumber1 nvarchar(100), @SerialNumber2 nvarchar(100);
select @SerialNumber1 = N'ABC123', @SerialNumber2 = N'???123';
select
case when @SerialNumber1 = @SerialNumber2
then 1 else 0 end as SimpleComparison;
This issue can be resolved using Master Data Services, but this tool is not always available. And besides, in this case, it's an overkill. There is a much simpler solution to this problem.
Background
There are a lot of Cyrillic symbols which looks more or less similar to Latin symbols: A, B, C, E, O etc.. Please look into the list of Unicode characters for full reference.
Cyrillic 2 Latin Auto Convertor
In order to automatically convert Cyrillic letters to corresponding Latin during the data load in ETL, I created an inline T-SQL table-valued function. I decided to use inline function because it usually shows better performance than regular table-valued or scalar-valued functions.
The function creates an in-memory table of similarly looking letters (there are only 95 rows in it, so it won't use much memory). Then it uses recursive CTE to apply a REPLACE
operation to the target string
, for each pair of similarly looking letters.
A full source code of this procedure (with all 95 rows of similarly looking letters) is saved as an attachment for this article.
create function dbo.Cyrillic2Latin(@targetString nvarchar(4000))
RETURNS table
AS
RETURN (
WITH VisuallySimilarLetters as
(select * from (VALUES
(0x0400,'Cyrillic Capital Letter Ie with grave',0x0045, 'Latin Capital letter E'),
(0x0401,'Cyrillic Capital Letter Io',0x0045, 'Latin Capital letter E'),
...
(0x04FC,'Cyrillic Capital Letter Ha with hook',0x0058, 'Latin Capital letter X'),
(0x04FD,'Cyrillic Small Letter Ha with hook',0x0078, 'Latin Small Letter X')
) AS VisuallySimilarLetters (
CyrillicCode,
CyrillicName,
LatinCode,
LatinName)),
VisuallySimilarLettersNumbered as
(select *, row_number() over (order by CyrillicCode) as n
from VisuallySimilarLetters),
baseCte as
(select replace(@targetString, NCHAR(CyrillicCode), NCHAR(LatinCode)) as TargetString, _
1 as RowProcessed
from VisuallySimilarLettersNumbered
where n = 1
union all
select replace(bc.TargetString, NCHAR(CyrillicCode), NCHAR(LatinCode)) as TargetString, _
bc.RowProcessed + 1
from baseCte bc
inner join VisuallySimilarLettersNumbered vsln
on bc.RowProcessed + 1 = vsln.n)
SELECT TargetString
FROM baseCte
WHERE
RowProcessed = (SELECT max(RowProcessed) FROM baseCte)
)
Let's come back to the example above, and compare serial numbers once again using this function.
declare @SerialNumber1 nvarchar(100), @SerialNumber2 nvarchar(100);
select @SerialNumber1 = N'ABC123', @SerialNumber2 = N'???123';
SELECT
CASE WHEN cl1.TargetString = cl2.TargetString
THEN 1 ELSE 0 END AS SmartComparison
FROM
dbo.Cyrillic2Latin(@SerialNumber1) cl1
CROSS APPLY dbo.Cyrillic2Latin(@SerialNumber2) cl2
Using the Auto Convertor
I use this function during the ETL process, when the data is already loaded into landing table, but before it is joined with something else.
The code to update the landing table looks like this:
UPDATE lt
SET
SerialNumber = cl.TargetString
FROM
LANDING.MerchandiseTable lt
cross apply dbo.Cyrillic2Latin(lt.SerialNumber) cl
In my case, landing tables are relatively small (less than a million records), and the update operation executes almost instantly.
It allowed to significantly reduce the quantity of errors in ETL, we were able to find a lot of records that were previously lost because of mix of different languages.
Conclusion
Mix of Cyrillic and Latin letters is a common ETL problem in regions where Cyrillic alphabets are broadly used. This article describes how to automatically resolve this problem, or at least substantially reduce its magnitude.