Fast Typo-Insensitive String Matching in a Program and a Database

David Wincelberg

4.73/5 (11 votes)

12 Jul 2021CPOL7 min read

10.2K

292

Strings that differ by typos can be matched.

Typos can occur in search terms, databases, or other searched text. The functions presented below can handle matching words that differ by typos that are limited so that matched words look like one may have resulted from an attempt to type the other one. These functions are written in C++ and SQL. They are then modified to allow for searching for a word in a larger portion of text. The time complexities of these functions are addressed.

Introduction

When I was in school, I mistyped a search term when using a library search engine and a record was listed. I then noticed the typo and typed the search term properly. This record was not listed. I told a librarian and the record was fixed. This started me thinking about typos.

When I visited a relative in a hospital, I heard a beep coming from a computer. I glanced at the screen and noticed that my last name was mistyped. I told the nurse the correct spelling. These incidents show that a typo could be in a database or in a search term.

A typo or typographical error results from four types of errors: insertion, deletion, substitution or transposition. Since any word can be changed to any other word by a series of typos, it is necessary to limit the changes so that two words that differ by one or more typos look like one word may have resulted from an attempt to type the other one. In the software presented below, this is handled by enforcing a minimum separation between typos.

Matching in a Program

The Visual Studio solution TypoMatching.sln contains three projects. This is done so that unit testing works in C++. The demo program is in TypoMatching, the typo-matching logic is in TypoMatcherLib, and the unit testing is in TypoMatchingUnitTests.

The next thing to consider is that there are three string types that are used with C++. They are C++ strings, C strings, and MFC strings. C++ strings can be handled by a function that accepts std::string references. Both C strings and MFC strings can be handled by a function that accepts constant string pointers.

Another issue is whether Unicode is selected. This is handled by the primary matching function accepting constant string pointers that adapt based on whether UNICODE is defined and by defining a type for C++ strings that adapts based on this setting. Once that is taken care of, the result is a secondary function that accepts C++ string references and calls the primary matching function.

In the listing below, after some preliminaries, the primary matching function steps through a pair of strings. Each pair of characters is checked for equality by converting to uppercase before comparing. Converting to lowercase or taking into account cultural or locale differences would also work. If a pair of characters does not match in this manner, the first step is to determine if there is a previous typo that is too close to the current one. If not, typos are looked for in a certain order.

This order is transposition, insertion, deletion, and substitution. Suppose a typo is a transposition error. It could also look like an insertion followed by a deletion, a deletion followed by an insertion, or two substitution typos. Since a minimum typo separation is enforced to ensure that words that differ by typos look similar, it is important to select the type of typo that best matches. The above order handles that.

If the ends of the strings are not reached at the same time, there is a check for a final insertion or deletion typo. The parameters to the matching function allow for retrieving the number of typos and changing the default minimum separation to a value other than two.

C++

bool TypoMatcher(LPCTSTR pszWord1, LPCTSTR pszWord2, int* pnTypos /* = nullptr */,
	vector<pair<ETypos, int>>* vTypos /* = nullptr */, int nMinSeparation /* = 2 */)
{
	if (pnTypos != nullptr) {
		*pnTypos = 0;
	}
	
	if (pszWord1 == nullptr || pszWord2 == nullptr) {
		return false;
	}

	int nTypos = 0;
	LPCTSTR pszWord1Start = pszWord1, pLastGoodPos1 = nullptr;
	TCHAR c1Upper = 0, c2Upper = 0, c1NextUpper = 0, c2NextUpper = 0;

	for (;  *pszWord1 != _T('\0') && *pszWord2 != _T('\0');  ++pszWord1, ++pszWord2) {
		c1Upper = ::_totupper(*pszWord1);  // generic for toupper
		c2Upper = ::_totupper(*pszWord2);

		if (c1Upper != c2Upper) {
			if (pLastGoodPos1 != nullptr && pszWord1 - pLastGoodPos1 < nMinSeparation) {
				return false;
			}

			c1NextUpper = ::_totupper(*(pszWord1 + 1));
			c2NextUpper = ::_totupper(*(pszWord2 + 1));

			if (c1Upper == c2NextUpper && c1NextUpper == c2Upper) {		// transposition
				AddDetails(vTypos, ETypos::Transposition, pszWord1 - pszWord1Start);
				pLastGoodPos1 = pszWord1 + 2;
				++pszWord1;  ++pszWord2;
			}
			else if (c1Upper == c2NextUpper) {							// insertion
				AddDetails(vTypos, ETypos::Insertion, pszWord1 - pszWord1Start);
				pLastGoodPos1 = pszWord1;
				++pszWord2;
			}
			else if (c1NextUpper == c2Upper) {							// deletion
				AddDetails(vTypos, ETypos::Deletion, pszWord1 - pszWord1Start);
				pLastGoodPos1 = pszWord1 + 1;
				++pszWord1;
			}
			else {														// substitution
				AddDetails(vTypos, ETypos::Substitution, pszWord1 - pszWord1Start);
				pLastGoodPos1 = pszWord1 + 1;
			}

			++nTypos;
		}
	}

	// Handles an insertion or a deletion typo at the end of a string.
	if (*pszWord1 != _T('\0') || *pszWord2 != _T('\0')) {
		if (pLastGoodPos1 != nullptr && pszWord1 - pLastGoodPos1 < nMinSeparation) {
			return false;
		}

		if (*pszWord1 == _T('\0') && *(pszWord2 + 1) == _T('\0')) {			// insertion
			AddDetails(vTypos, ETypos::Insertion, pszWord1 - pszWord1Start);
			++nTypos;
		}
		else if (*(pszWord1 + 1) == _T('\0') && *pszWord2 == _T('\0')) {	// deletion
			AddDetails(vTypos, ETypos::Deletion, pszWord1 - pszWord1Start);
			++nTypos;
		}
		else {
			return false;
		}
	}

	if (pnTypos != nullptr) {
		*pnTypos = nTypos;
	}

	return true;
}	// TypoMatcher

The demo program Typo Matching allows for testing typo matching with user input. A user may specify the minimum separation between typos. The displayed output is whether the strings match, how many typos if they match, and the types and locations of these typos. Note that positions start at one for this display. This program also includes automated testing.

Typo Matching

Matching in a Database

In order for typo-insensitive matching to work in a database, it is necessary to write a stored function. A stored function differs from a stored procedure in that it must return a value and is called by using a SELECT statement instead of a CALL statement. The C++ function TypoMatcher() returns a boolean result and allows for retrieving the number of typos when the return value is true. In a stored function, there is only one output. That works here since a return value of -1 is used to signal that the words don’t match.

The stored function TypoMatcher() in the listing below is written in the MySQL version of SQL and was tested using MySQL Workbench. The loop for stepping through the words differs from the C++ version in that character positions are used instead of pointers. Note that character positions start at one instead of zero in SQL. In:

SQL

DECLARE nWord1Length INT DEFAULT CHAR_LENGTH(TypoWord1);

CHAR_LENGTH() is used instead of LENGTH() in computing word lengths since it handles multibyte characters. For accessing characters. The C++ line:

C++

c1Upper = ::_totupper(*pszWord1);

becomes:

SQL

SET c1Upper = UPPER(SUBSTR(TypoWord1, nWord1Pos, 1));

SUBSTR() is also multibyte safe. Otherwise, the conversion is straightforward.

The listing below contains TypoMatcher() and SQL statements for testing it. The words for testing are the same as those used for automated testing of the C++ version of TypoMatcher(). The testing is done through the following SQL statement.

SQL

SELECT FindWord, TypoWord, TypoMatcher(FindWord, TypoWord, 2) AS Typos
FROM FindWords, TypoWords HAVING Typos > -1
ORDER BY FindWord, Typos, TypoWord;

The third parameter of TypoMatcher() is the required minimum separation between typos.

SQL

DROP DATABASE IF EXISTS TypoMatching;
CREATE DATABASE TypoMatching;
USE TypoMatching;

CREATE TABLE TypoWords (
	TypoWord VARCHAR(20) NOT NULL
);

INSERT INTO TypoWords VALUES
('matches'),
('Matches'),
('red'),
('care'),
('raed'),
('rads'),
('trenfer'),
('transfrs'),
('spearetion'),
('seepatation'),
('spatation'),
('sdpatation'),
('catch'),
('ingoratjon'),
('lue'),
('xlue'),
('spills'),
('flachljght'),
('flachliht');

CREATE TABLE FindWords (
	FindWord VARCHAR(20) NOT NULL
);

INSERT INTO FindWords VALUES
('matches'),
('read'),
('core'),
('transfer'),
('separation'),
('cat'),
('information'),
('blue'),
('spill'),
('flashlight');

DELIMITER $$
CREATE FUNCTION TypoMatcher(TypoWord1 VARCHAR(20), TypoWord2 VARCHAR(20), nMinSeparation INT)
RETURNS INT DETERMINISTIC
BEGIN

	DECLARE nTypos INT DEFAULT 0;
	DECLARE nLastGoodPos1 INT DEFAULT -10;
    DECLARE nWord1Pos INT DEFAULT 1;
    DECLARE nWord2Pos INT DEFAULT 1;
	DECLARE nWord1Length INT DEFAULT CHAR_LENGTH(TypoWord1);
	DECLARE nWord2Length INT DEFAULT CHAR_LENGTH(TypoWord2);
    DECLARE c1Upper CHAR;
	DECLARE c2Upper CHAR;
	DECLARE c1NextUpper CHAR;
	DECLARE c2NextUpper CHAR;

	WHILE nWord1Pos <= nWord1Length AND nWord2Pos <= nWord2Length
    DO
		SET c1Upper = UPPER(SUBSTR(TypoWord1, nWord1Pos, 1));
		SET c2Upper = UPPER(SUBSTR(TypoWord2, nWord2Pos, 1));
    
		IF c1Upper != c2Upper THEN
			IF nWord1Pos - nLastGoodPos1 < nMinSeparation THEN
				RETURN -1;
			END IF;
            
            SET c1NextUpper = UPPER(SUBSTR(TypoWord1, nWord1Pos + 1, 1));
			SET c2NextUpper = UPPER(SUBSTR(TypoWord2, nWord2Pos + 1, 1));
            
            IF c1Upper = c2NextUpper AND c1NextUpper = c2Upper THEN		-- transposition
				SET nLastGoodPos1 = nWord1Pos + 2;
				SET nWord1Pos = nWord1Pos + 1;
				SET nWord2Pos = nWord2Pos + 1;
			ELSEIF c1Upper = c2NextUpper THEN							-- insertion
				SET nLastGoodPos1 = nWord1Pos;
				SET nWord2Pos = nWord2Pos + 1;
			ELSEIF c1NextUpper = c2Upper THEN							-- deletion
				SET nLastGoodPos1 = nWord1Pos + 1;
				SET nWord1Pos = nWord1Pos + 1;
			ELSE														-- substitution
                SET nLastGoodPos1 = nWord1Pos + 1;
            END IF;
            
			SET nTypos = nTypos + 1;
        END IF;

		SET nWord1Pos = nWord1Pos + 1;
        SET nWord2Pos = nWord2Pos + 1;
	END WHILE;
    
	-- Handles an insertion or a deletion typo at the end of a string.
	IF nWord1Pos <= nWord1Length OR nWord2Pos <= nWord2Length THEN
		IF nWord1Pos - nLastGoodPos1 < nMinSeparation THEN
			RETURN -1;
		END IF;

		if nWord1Pos = nWord1Length AND nWord2Pos = nWord2Length + 1 THEN		-- insertion
			SET nTypos = nTypos + 1;
		ELSEIF nWord1Pos = nWord1Length + 1 AND nWord2Pos = nWord2Length THEN	-- deletion
			SET nTypos = nTypos + 1;
		ELSE
			RETURN -1;
		END IF;
	END IF;

	RETURN nTypos;

END $$
DELIMITER ;

SELECT FindWord, TypoWord, TypoMatcher(FindWord, TypoWord, 2) AS Typos
FROM FindWords, TypoWords HAVING Typos > -1
ORDER BY FindWord, Typos, TypoWord;

The results are in the following table:

SQL table for TypoMatcher

Note that for the FindWord of "read", the number of typos is either one or two and that the results are sorted by number of typos for each FindWord. When used in a database search, the results should be sorted by number of typos so that exact matches precede those with one typo, and those precede matches with more typos.

Approximate String Matching

Approximate string matching differs from typo-insensitive string matching in that the former searches for a match in a portion of the text to be searched while the latter matches the entire text. From the examples in the introduction, the latter is more useful for matching search terms to words in a database.

A Wikipedia article on approximate string matching provides an overview of various algorithms [1]. Most of these algorithms only consider insertion, deletion, and substitution typos. Some also handle transposition typos. This article notes that available algorithms are too slow to be used with a database.

In reference 2, there is a detailed treatment of an algorithm for approximate string matching. It involves calculating a difference array.

TypoMatcher() can be modified to match a portion of a larger string. When used in this way, “cat” would match part of “catch”. When words are compared, these words would not match. To do so, the test after the loop would be modified so that it isn’t necessary for the second string to end. The changes are as follows:

C++

if (*pszWord1 != _T('\0')) {
    if (pLastGoodPos1 != nullptr && pszWord1 - pLastGoodPos1 < nMinSeparation) {
        return false;
    }

    if (*(pszWord1 + 1) == _T('\0')) {  // deletion
        AddDetails(vTypos, ETypos::Deletion, pszWord1 - pszWord1Start);
        ++nTypos;
    }
    else {
        return false;
    }
}

This function would be called in a loop that would advance the second string one character at time when it is found that there isn’t a match and report the location if there is a match. For the SQL version, this loop must be part of the stored function. It would return the starting point for the match but would not return the number of typos. See the following listing:

SQL

DELIMITER $$
CREATE FUNCTION TypoMatcherString(TypoWord1 VARCHAR(20), _
                TypoString2 VARCHAR(255), nMinSeparation INT)
RETURNS INT DETERMINISTIC
BEGIN

	DECLARE nTypos INT;
	DECLARE nLastGoodPos1 INT DEFAULT -10;
	DECLARE nWord1Pos INT;
	DECLARE nString2Start INT DEFAULT 1;
	DECLARE nString2Pos INT;
	DECLARE nWord1Length INT DEFAULT CHAR_LENGTH(TypoWord1);
	DECLARE nString2Length INT DEFAULT CHAR_LENGTH(TypoString2);
	DECLARE c1Upper CHAR;
	DECLARE c2Upper CHAR;
	DECLARE c1NextUpper CHAR;
	DECLARE c2NextUpper CHAR;

	outer_while: WHILE nString2Start <= nString2Length DO
		SET nWord1Pos = 1;
		SET nString2Pos = nString2Start;
		SET nTypos = 0;

		inner_while: WHILE nWord1Pos <= nWord1Length AND nString2Pos <= nString2Length DO
			SET c1Upper = UPPER(SUBSTR(TypoWord1, nWord1Pos, 1));
			SET c2Upper = UPPER(SUBSTR(TypoString2, nString2Pos, 1));
		
			IF c1Upper != c2Upper THEN
				IF nWord1Pos - nLastGoodPos1 < nMinSeparation THEN
					SET nTypos = -1;
					LEAVE inner_while;
				END IF;
				
				SET c1NextUpper = UPPER(SUBSTR(TypoWord1, nWord1Pos + 1, 1));
				SET c2NextUpper = UPPER(SUBSTR(TypoString2, nString2Pos + 1, 1));
				
				IF c1Upper = c2NextUpper AND c1NextUpper = c2Upper THEN		-- transposition
					SET nLastGoodPos1 = nWord1Pos + 2;
					SET nWord1Pos = nWord1Pos + 1;
					SET nString2Pos = nString2Pos + 1;
				ELSEIF c1Upper = c2NextUpper THEN							-- insertion
					SET nLastGoodPos1 = nWord1Pos;
					SET nString2Pos = nString2Pos + 1;
				ELSEIF c1NextUpper = c2Upper THEN							-- deletion
					SET nLastGoodPos1 = nWord1Pos + 1;
					SET nWord1Pos = nWord1Pos + 1;
				ELSE														-- substitution
					SET nLastGoodPos1 = nWord1Pos + 1;
				END IF;
				
				SET nTypos = nTypos + 1;
			END IF;

			SET nWord1Pos = nWord1Pos + 1;
			SET nString2Pos = nString2Pos + 1;
		END WHILE;
		
		-- Handles a deletion typo at the end of a string.
		IF nWord1Pos <= nWord1Length THEN
			IF nWord1Pos - nLastGoodPos1 < nMinSeparation THEN
				SET nTypos = -1;
			ELSEIF nWord1Pos = nWord1Length + 1 THEN		-- deletion
				SET nTypos = nTypos + 1;
			ELSE
				SET nTypos = -1;
			END IF;
		END IF;

		IF nTypos = -1 THEN
			SET nString2Start = nString2Start + 1;
		ELSE
			LEAVE outer_while;
		END IF;
	END WHILE;
    
    IF  nTypos = -1 THEN
		RETURN -1;
	ELSE
		RETURN nString2Start;
	END IF;

END $$
DELIMITER ;

The results are in the following table:

SQL table for TypoMatcherString

Note that "cat" now matches several items while it previously did not match any ones. It matches "do catch" with the word start at the fourth position. Also note that "transfer" matches "transfrs" when previously it did not. The reason is that, when matching as words, two typos are too close together while, when matching as a word in a string, extra characters are ignored.

Time Complexities

If a pattern string has m characters and a text string has n characters, approximate string matching has time complexity of O(m * n) [1]. This applies in all circumstances since an array is calculated to perform the matching.

The typo-insensitive string matching algorithm presented above has a time complexity of O(min(m, n)) when comparing words since the loop ends when the end of one string is reached or two typos are found to be too close together. This should be fast enough to allow for database searching.

When used to find a portion of a larger string, the time complexity has a worst case of O(m * n). However, since it would stop once two typos are found to be too close together, the average time complexity is lower.

References

[1] “Approximate string matching” (2021, Jan. 18). Wikipedia
{2] Binstock, A., & Rex, J. (1995). Practical Algorithms For Programmers, Addison-Wesley Publishing Company. pp. 148-156. ISBN 0-201-63208-X

History

7^th July, 2021: Initial version
9^th July, 2021: Uploaded demo program

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)