Replace Cyrillic Letters with Visually Similar Latin Letters Using T-SQL

Pavel Sinkevich

5.00/5 (3 votes)

28 Oct 2016CPOL2 min read

23.5K

129

If data source is populated manually, there is a typical problem: a mix of visually similar Cyrillic and Latin letters. This tip describes how to fix it.

Download source code - 1.5 KB

Introduction

During development of ETL processes, I encountered the following situation more than once: records are lost in JOINs where they shouldn't. Let's say we join 2 tables, both have records with the same text value (for example, product serial number), but JOIN returns nothing. It usually takes a lot of time and effort to investigate and find out that values only look similar, but in fact they are different: Cyrillic symbols are used instead of Latin, because someone didn't bother to switch keyboard layout when manually populating the data.

In the following example, 2 strings are not equal: the first has Latin letters, and the second has Cyrillic.

SQL

declare @SerialNumber1 nvarchar(100), @SerialNumber2 nvarchar(100);
select @SerialNumber1 = N'ABC123', @SerialNumber2 = N'???123';

select
 case when @SerialNumber1 = @SerialNumber2
 then 1 else 0 end as SimpleComparison;

SimpleComparison
0

This issue can be resolved using Master Data Services, but this tool is not always available. And besides, in this case, it's an overkill. There is a much simpler solution to this problem.

Background

There are a lot of Cyrillic symbols which looks more or less similar to Latin symbols: A, B, C, E, O etc.. Please look into the list of Unicode characters for full reference.

Cyrillic 2 Latin Auto Convertor

In order to automatically convert Cyrillic letters to corresponding Latin during the data load in ETL, I created an inline T-SQL table-valued function. I decided to use inline function because it usually shows better performance than regular table-valued or scalar-valued functions.

The function creates an in-memory table of similarly looking letters (there are only 95 rows in it, so it won't use much memory). Then it uses recursive CTE to apply a REPLACE operation to the target string, for each pair of similarly looking letters.

A full source code of this procedure (with all 95 rows of similarly looking letters) is saved as an attachment for this article.

SQL

create function dbo.Cyrillic2Latin(@targetString nvarchar(4000))
RETURNS table
AS
RETURN (
    WITH VisuallySimilarLetters as
        (select * from (VALUES
        (0x0400,'Cyrillic Capital Letter Ie with grave',0x0045, 'Latin Capital letter E'),
        (0x0401,'Cyrillic Capital Letter Io',0x0045, 'Latin Capital letter E'),
...
        (0x04FC,'Cyrillic Capital Letter Ha with hook',0x0058, 'Latin Capital letter X'),
        (0x04FD,'Cyrillic Small Letter Ha with hook',0x0078, 'Latin Small Letter X')
        ) AS VisuallySimilarLetters (
            CyrillicCode, 
            CyrillicName, 
            LatinCode, 
            LatinName)),
    VisuallySimilarLettersNumbered as
        (select *, row_number() over (order by CyrillicCode) as n
        from VisuallySimilarLetters),
    baseCte as
        (select replace(@targetString, NCHAR(CyrillicCode), NCHAR(LatinCode)) as TargetString, _
         1 as RowProcessed
        from VisuallySimilarLettersNumbered
        where n = 1
        union all
        select replace(bc.TargetString, NCHAR(CyrillicCode), NCHAR(LatinCode)) as TargetString, _
        bc.RowProcessed + 1
        from baseCte bc
        inner join VisuallySimilarLettersNumbered vsln
        on bc.RowProcessed + 1 = vsln.n)

    SELECT TargetString 
    FROM baseCte
    WHERE 
        RowProcessed = (SELECT max(RowProcessed) FROM baseCte)
)

Let's come back to the example above, and compare serial numbers once again using this function.

SQL

declare @SerialNumber1 nvarchar(100), @SerialNumber2 nvarchar(100);
select @SerialNumber1 = N'ABC123', @SerialNumber2 = N'???123';

SELECT
     CASE WHEN cl1.TargetString = cl2.TargetString
     THEN 1 ELSE 0 END AS SmartComparison
FROM
    dbo.Cyrillic2Latin(@SerialNumber1) cl1
    CROSS APPLY dbo.Cyrillic2Latin(@SerialNumber2) cl2

SmartComparison
1

Using the Auto Convertor

I use this function during the ETL process, when the data is already loaded into landing table, but before it is joined with something else.

The code to update the landing table looks like this:

SQL

UPDATE lt
SET 
    SerialNumber = cl.TargetString
FROM 
    LANDING.MerchandiseTable lt
    cross apply dbo.Cyrillic2Latin(lt.SerialNumber) cl

In my case, landing tables are relatively small (less than a million records), and the update operation executes almost instantly.

It allowed to significantly reduce the quantity of errors in ETL, we were able to find a lot of records that were previously lost because of mix of different languages.

Conclusion

Mix of Cyrillic and Latin letters is a common ETL problem in regions where Cyrillic alphabets are broadly used. This article describes how to automatically resolve this problem, or at least substantially reduce its magnitude.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)