Introduction
There is no wonder everybody likes Google search engine and its search box suggestions. An intelligent and friendly search-match suggestion makes it a wonderful user experience. It is very rare that we can see a website without a search box. It would be more valuable to discuss something which can improve the search experience.
Pattern matching is a technique which has its own place in the computer era. Pattern matching is our brain's own way and hopefully it helps our websites communicate better to us. Here we discuss a simple pattern matching technique which can be used with search suggest boxes.
Background
If we look into bioinformatics, one can be familiar with a word "sequence alignment". It is nothing but align two genes and compares its similarity. This is true for searching a word which is in turn formed of sequence of letters. In fact, probably bioinformatics adapted the pattern matching from computer algorithms.
Once we align two words in such a way, it can be scored for the matches and penalty for the un-matching places. In that way, we would be able to sort the matching list of words according to the matching score.
For example, let us write 'sugest' which may be match with 'suggest' or 'sugging'. Let us align these words.
sug_est
suggest
The above words aligned is such a way for most matching and it got one gap in the search key. When we align the other pair we have four gaps as follows.
sug____est
sugging
So we could suggest the first pair is more matched. This kind of pattern matching could not be achieved by a simple SQL 'LIKE %sugest%' or a predefined regEx pattern as many sites uses it. It happens often misspelt a letter or missing a letter when typing in the search especially when typing names like movies, products, person etc.,
So let us try to have a improved pattern match search query, discussed further below.
Using the Code
This code is a mySql stored procedure. But the concept is not limited to mySql, but can be extended to any database like SQL Server, Oracle, etc. At the start, the code uses the exact word to search for the matched ones.
SET searchWord=CONCAT('%',' ', searchWord,'%');
This search pattern searches the word that contains the exact chars the user types in, allowing any other words on either side of it. If the search resulted in enough number of records, it won't search further. Next step, it iterates to the length of char
s user type in and inserts % in between the char
s at insert position and queries the database at each iteration. This allows number of char
s at insert position and increases the chance of finding a match if our initial search failed. For example, the search word at each iteration would look like winnin%g, winni%ng, winn%ing.......w%inning. Two things here accounted for that first the insertion starts from the last char
to the first. This is in view of keeping integrity of initial typed char
s. For example, if user types 'wining' then it can be 'win from the begining' (resulted by 'win%ing) but not 'war is begining' (resulted by 'w%ining') . The next thing it won't keep the % inserted in the previous iteration because to much of % results in not related matches. For example, 'h%a%p%p%y' can be anything but may not be 'happy'. At each iteration, it check for enough results obtained and stops further processing.
loop1:WHILE (i<(str_len-3)) DO
IF totalResultCount>resultLimit THEN
LEAVE loop1;
END IF;
SET i=i+1;
SET formattedString=INSERT(searchWord,str_len-i,0,'%');
CALL getResult(formattedString,totalResultCount,resultLimit);
END WHILE loop1;
Also the code scores the result at each iteration and if there are too many char
s inserted in between the match can be rejected using a threshold penaltylevel. This prevents unmatched matches got in the results.
IF score>lowScoreThreshold THEN
INSERT INTO RESULT (word,score) VALUES (tempWord,score);
END IF;
Next the code gets into a second set of iteration where everything happens the same as the first one but it replaces a char
with %
instead of just insert in between. This iteration accounts for a mispelling of a single char
like 'scoreing' instead of 'scoring'. Why this task is performed in a separate iteration because we assume mispelling might not happen and need not give much importance to it. So this iteration happens only when all the above search does not result in enough matches.
SET i=0;
loop2:WHILE (i<(str_len-3)) DO
IF totalResultCount>resultLimit THEN
LEAVE loop2;
END IF;
SET i=i+1;
SET formattedString=INSERT(searchWord,str_len-i,1,'%');
CALL getResult(formattedString,totalResultCount,resultLimit);
END WHILE loop2;
There is a procedure named 'getResult
' called on these iterations where it queries the database.
DECLARE word_list CURSOR FOR SELECT word, scoreWords(sanitizeSearchKeyWord(searchWord),_
word) AS score FROM words WHERE word LIKE searchWord OR word LIKE CONCAT_
(sanitizeSearchKeyWord(searchWord) ,'%');
This query includes other stored functions like scoreWord
which scores the resulted matches. This query has two parts on the where
clause one part has a beginning space and the other one doesn't have. This is to get the match for both 'Are you Happy' (% Happ%) and 'Happy X-Mas' (%Happ%) when the search key is 'Happ'. But there is a problem where 'sin' results in 'basin'. Anyway which is unavoidable while querying, the scoring procedure accounts for it and gives less score. The resulted word and the corresponding score are stored in a result table.
OPEN word_list ;
loop1:LOOP
SET i=i+1;
FETCH word_list INTO tempWord,score;
IF i>resultLimit THEN
LEAVE loop1;
END IF;
IF done=1 THEN
LEAVE loop1;
END IF;
IF score>lowScoreThreshold THEN
INSERT INTO RESULT (word,score) VALUES (tempWord,score);
END IF;
IF duplicate_key<>1 THEN
SET totalCount=totalCount+1;
END IF;
END LOOP loop1;
CLOSE word_list;
Let us discuss about the scoring function. What the user types may be a word, but the match result may be a sentence. It is meaningless to include the whole sentence for computing the matching score. So it needs to extract the matching word to the search key. The difficulty is that the result does not contain the exact search key. So it calls another function 'locateSearchWord
'. Once the word is located, it removes all the '%' from the search key and gets the length of it. Then comparing the length of the search key with the resulted match. The difference in the length is equal to the new char
s present in the resulted match which are accounted as penalty and calculate the score accordingly.
SET resultPart= locateSearchWord(searchWord,resultWord);
SET count1=LENGTH(REPLACE(searchWord,'%',''));
SET count2=LENGTH(resultPart);
SET diffCount=count2-count1;
IF diffCount<0 THEN
SET diffCount=diffCount*-1;
END IF;
IF LEFT(REPLACE(searchWord,'%',''),1)<>LEFT(resultPart,1) THEN
SET penalty==1*count1;
ELSE
SET penalty=diffCount * gapPenalty;
END IF
SET score=count1+penalty;
SET score=(score/count1)*100;
RETURN(score);
The resulted matches with score above the threshold will be sorted in a table and queried for suggesting at the Ajax suggest boxes. This temporary table may be having a session id column, so that the results of multiple users can be handled. This way this little utility can be used with Ajax suggest boxes efficiently.
Proof Of Concept
Attached is a SQL dump with the sample tables and stored procedures. Once setup, the mySql database you may test it. If your table has the word 'winning', then you run call suggest('wining'), still it suggests 'winning' or if your table has the word 'happy' and user types 'happty' by error, still it suggest 'happy'. Try a few more at the mySql prompt:
call suggest('Happi')
call suggest('Happty')
call suggest('Hapy') etc.,
The stored procedures have a definer declaration at the top like 'CREATE DEFINER=`root`@`localhost` PROCEDURE `suggest`'
. Change it according to your user name and host.
Further Ideas
When we look at Google, it has lot of intelligent features in the suggest box. So the searching is not limited to a few logics. It is open to new ideas. Let us look at a few.
A neural network can be formed relating search keys and user chosen suggestion. For example, user types 'New Year' and user chosen 'New Year Greetings' from the suggestion, then we can weigh the search key against the result and when this weight crosses the threshold on the neural node, the search key is stored as a valid one against the match. As this data grows, it can be used for suggestions.
When a user searches for something, that search key can be related to their region and weighted. If more people than the defined threshold use that key from that region, then that information can be used to suggest locally important matches as Google and YouTube does.
New ideas can be added here and let the website get what our brain wants.
Caution
This little piece of stored procedure is not aimed to handle millions of data like Google database. This article is written in view with a normal website.
Complete Code Dump
CREATE DEFINER=`root`@`localhost` PROCEDURE `suggest`(searchWord VARCHAR(250))
BEGIN
DECLARE formattedString VARCHAR(250)DEFAULT '';
DECLARE resultString VARCHAR(250) DEFAULT '';
DECLARE str_len INT DEFAULT 0;
DECLARE str_part VARCHAR(250) DEFAULT '';
DECLARE str_part1 VARCHAR(250) DEFAULT '';
DECLARE str_part2 VARCHAR(250) DEFAULT '';
DECLARE i INT DEFAULT 1;
DECLARE totalResultCount INT DEFAULT 0;
DECLARE resultLimit INT DEFAULT 10;
DELETE FROM RESULT;
SET searchWord=CONCAT('%',' ', searchWord,'%');
CALL getResult(searchWord,totalResultCount,resultLimit);
SET str_len=LENGTH(searchWord);
loop1:WHILE (i<(str_len-3)) DO
IF totalResultCount>resultLimit THEN
LEAVE loop1;
END IF;
SET i=i+1;
SET formattedString=INSERT(searchWord,str_len-i,0,'%');
CALL getResult(formattedString,totalResultCount,resultLimit);
END WHILE loop1;
SET i=0;
loop2:WHILE (i<(str_len-3)) DO
IF totalResultCount>resultLimit THEN
LEAVE loop2;
END IF;
SET i=i+1;
SET formattedString=INSERT(searchWord,str_len-i,1,'%');
CALL getResult(formattedString,totalResultCount,resultLimit)
END WHILE loop2;
END IF;
SELECT word,score FROM RESULT ORDER BY score DESC LIMIT 10;
END
CREATE DEFINER=`root`@`localhost` PROCEDURE `getResult`_
(searchWord VARCHAR(250),INOUT totalCount INT,resultLimit INT)
MODIFIES SQL DATA
BEGIN
DECLARE formatted_string VARCHAR(250);
DECLARE result_word VARCHAR(250) DEFAULT '';
DECLARE tempWord VARCHAR(250);
DECLARE done INT DEFAULT 0;
DECLARE score FLOAT DEFAULT 0;
DECLARE lowScoreThreshold INT DEFAULT 30;
DECLARE i INT DEFAULT 0;
DECLARE duplicate_key INT DEFAULT 0;
DECLARE word_list CURSOR FOR SELECT word,scoreWords_
(sanitizeSearchKeyWord(searchWord),word) AS score FROM words _
WHERE word LIKE searchWord OR word LIKE CONCAT_
(sanitizeSearchKeyWord(searchWord) ,'%');
DECLARE CONTINUE HANDLER FOR 1062
SET duplicate_key=1;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
OPEN word_list ;
loop1:LOOP
SET i=i+1;
FETCH word_list INTO tempWord,score;
IF i>resultLimit THEN
LEAVE loop1;
END IF;
IF done=1 THEN
LEAVE loop1;
END IF;
IF score>lowScoreThreshold THEN
INSERT INTO RESULT (word,score) VALUES (tempWord,score);
END IF;
IF duplicate_key<>1 THEN
SET totalCount=totalCount+1;
END IF;
END LOOP loop1;
CLOSE word_list;
END
CREATE DEFINER=`root`@`localhost` FUNCTION `scoreWords`_
(searchWord VARCHAR(250),resultWord VARCHAR(250)) RETURNS float
BEGIN
DECLARE gapPenalty FLOAT DEFAULT -0.5;
DECLARE count1 INT DEFAULT 0;
DECLARE count2 INT DEFAULT 0;
DECLARE diffCount INT DEFAULT 0;
DECLARE penalty FLOAT DEFAULT 0;
DECLARE score FLOAT DEFAULT 0;
DECLARE resultPart VARCHAR(250);
SET resultPart= locateSearchWord(searchWord,resultWord);
SET count1=LENGTH(REPLACE(searchWord,'%',''));
SET count2=LENGTH(resultPart);
SET diffCount=count2-count1;
IF diffCount<0 THEN
SET diffCount=diffCount*-1;
END IF;
IF LEFT(REPLACE(searchWord,'%',''),1)<>LEFT(resultPart,1) THEN
SET penalty==1*count1;
ELSE
SET penalty=diffCount * gapPenalty;
END IF
SET score=count1+penalty;
SET score=(score/count1)*100;
RETURN(score);
END
CREATE DEFINER=`root`@`localhost` FUNCTION `locateSearchWord`_
(searchWord VARCHAR(250),resultWord VARCHAR(250)) RETURNS varchar(250) CHARSET latin1
BEGIN
DECLARE strPart1 VARCHAR(250) DEFAULT '';
DECLARE strPart2 VARCHAR(250) DEFAULT '';
DECLARE tmpWord VARCHAR(250) DEFAULT '';
DECLARE spaceCount INT DEFAULT 0;
DECLARE wordPosition INT DEFAULT 1;
DECLARE resultPart VARCHAR(250) DEFAULT '';
SET strPart1=SUBSTRING_INDEX(searchWord,'%',1);
SET strPart2=SUBSTRING_INDEX(searchWord,'%',-1);
SET tmpWord=REPLACE(searchWord,' ','');
SET spaceCount =LENGTH(searchWord)-LENGTH(tmpWord);
IF LENGTH(strPart1)>LENGTH(strPart2) THEN
SET wordPosition=LOCATE(REVERSE(strPart1),REVERSE(resultWord));
IF LOCATE(' ',REVERSE(resultWord),_
wordPosition+LENGTH(strPart1))=0 THEN
SET wordPosition=LENGTH(resultWord);
ELSE
SET wordPosition= LOCATE(' ',_
REVERSE(resultWord),wordPosition+LENGTH(strPart1))-1;
END IF;
SET tmpWord=LEFT(REVERSE(resultWord),wordPosition);
SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
IF resultPart='' THEN
SET resultPart=tmpWord;
END IF;
SET resultPart=REVERSE(resultPart);
ELSEIF LENGTH(strPart1)<LENGTH(strPart2) _
OR(LENGTH(strPart1)=LENGTH(strPart2) AND LENGTH(strPart1)<>LENGTH(searchWord) _
AND LENGTH(strPart2)<>0) THEN
SET wordPosition=LOCATE(strPart2,resultWord);
IF LOCATE(' ',resultWord,wordPosition+LENGTH(strPart2))=0 THEN
SET wordPosition=LENGTH(resultWord);
ELSE
SET wordPosition= LOCATE(' ',_
resultWord,wordPosition+LENGTH(strPart2))-1;
END IF;
SET tmpWord=LEFT(resultWord,wordPosition);
SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
IF resultPart='' THEN
SET resultPart=tmpWord;
END IF;
ELSE
SET wordPosition=LOCATE(searchWord,resultWord);
SET tmpWord=LEFT(resultWord,(wordPosition-1)+LENGTH(searchWord));
SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
IF resultPart='' THEN
SET resultPart=tmpWord;
END IF;
END IF;
RETURN(resultPart);
END
CREATE DEFINER=`root`@`localhost` _
FUNCTION `sanitizeSearchKeyWord`(searchWord VARCHAR(250)) _
RETURNS varchar(250) CHARSET latin1
BEGIN
SET searchWord=MID(searchWord,3,LENGTH(searchWord)-3);
RETURN searchWord;
END
Points of Interest
I was very much inspired with Google search. There is nothing fancy other than a single searchbox and it does a lot. This article is my little thinking and innovated by Google.
History
- 20th December, 2010: Initial post