Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / database / MySQL

Simple Pattern Matching Technique to Improve the Search Boxes

4.33/5 (3 votes)
20 Dec 2010CPOL7 min read 28.8K   223  
Simple Pattern Matching Technique for Search Suggest Boxes

Introduction

There is no wonder everybody likes Google search engine and its search box suggestions. An intelligent and friendly search-match suggestion makes it a wonderful user experience. It is very rare that we can see a website without a search box. It would be more valuable to discuss something which can improve the search experience.

Pattern matching is a technique which has its own place in the computer era. Pattern matching is our brain's own way and hopefully it helps our websites communicate better to us. Here we discuss a simple pattern matching technique which can be used with search suggest boxes.

Background

If we look into bioinformatics, one can be familiar with a word "sequence alignment". It is nothing but align two genes and compares its similarity. This is true for searching a word which is in turn formed of sequence of letters. In fact, probably bioinformatics adapted the pattern matching from computer algorithms.

Once we align two words in such a way, it can be scored for the matches and penalty for the un-matching places. In that way, we would be able to sort the matching list of words according to the matching score.

For example, let us write 'sugest' which may be match with 'suggest' or 'sugging'. Let us align these words.
sug_est
suggest
The above words aligned is such a way for most matching and it got one gap in the search key. When we align the other pair we have four gaps as follows.
sug____est
sugging
So we could suggest the first pair is more matched. This kind of pattern matching could not be achieved by a simple SQL 'LIKE %sugest%' or a predefined regEx pattern as many sites uses it. It happens often misspelt a letter or missing a letter when typing in the search especially when typing  names like  movies, products, person etc.,

So let us try to have a improved pattern match search query, discussed further below.

Using the Code

This code is a mySql stored procedure. But the concept is not limited to mySql, but can be extended to any database like SQL Server, Oracle, etc. At the start, the code uses the exact word to search for the matched ones.

SQL
SET searchWord=CONCAT('%',' ', searchWord,'%');

This search pattern searches the word that contains the exact chars the user types in, allowing any other words on either side of it. If the search resulted in enough number of records, it won't search further. Next step, it iterates to the length of chars user type in and inserts % in between the chars at insert position and queries the database at each iteration. This allows number of chars at insert position and increases the chance of finding a match if our initial search failed. For example, the search word at each iteration would look like winnin%g, winni%ng, winn%ing.......w%inning. Two things here accounted for that first the insertion starts from the last char to the first. This is in view of keeping integrity of initial typed chars. For example, if user types 'wining' then it can be 'win from the begining' (resulted by 'win%ing) but not 'war is begining' (resulted by 'w%ining') . The next thing it won't keep the % inserted in the previous iteration because to much of % results in not related matches. For example, 'h%a%p%p%y' can be anything but may not be 'happy'. At each iteration, it check for enough results obtained and stops further processing.

SQL
loop1:WHILE (i<(str_len-3)) DO
    /* Limit check to avoid unnecessary loops */
    IF totalResultCount>resultLimit THEN
        LEAVE loop1;
    END IF;
    SET i=i+1;
    /*Addition of % in the search key */
    SET formattedString=INSERT(searchWord,str_len-i,0,'%');
    /*Get the matched ones. Watch for the result limit
    Result has a score for each matched selection*/
    CALL getResult(formattedString,totalResultCount,resultLimit);
END WHILE loop1;

Also the code scores the result at each iteration and if there are too many chars inserted in between the match can be rejected using a threshold penaltylevel. This prevents unmatched matches got in the results.

SQL
/*at the procedure getResult */
IF score>lowScoreThreshold THEN
                                INSERT INTO RESULT (word,score) VALUES (tempWord,score);
 END IF;

Next the code gets into a second set of iteration where everything happens the same as the first one but it replaces a char with % instead of just insert in between. This iteration accounts for a mispelling of a single char like 'scoreing' instead of 'scoring'. Why this task is performed in a separate iteration because we assume mispelling might not happen and need not give much importance to it. So this iteration happens only when all the above search does not result in enough matches.

SQL
SET i=0;
    loop2:WHILE (i<(str_len-3)) DO
          IF totalResultCount>resultLimit THEN
                  LEAVE loop2;
          END IF;
          SET i=i+1;
          SET formattedString=INSERT(searchWord,str_len-i,1,'%');
          CALL getResult(formattedString,totalResultCount,resultLimit);
    END WHILE loop2;

There is a procedure named 'getResult' called on these iterations where it queries the database.

SQL
DECLARE word_list CURSOR FOR SELECT word, scoreWords(sanitizeSearchKeyWord(searchWord),_
word) AS score FROM words WHERE word LIKE searchWord OR word LIKE CONCAT_
(sanitizeSearchKeyWord(searchWord) ,'%'); 

This query includes other stored functions like scoreWord which scores the resulted matches. This query has two parts on the where clause one part has a beginning space and the other one doesn't have. This is to get the match for both 'Are you Happy' (% Happ%) and 'Happy X-Mas' (%Happ%) when the search key is 'Happ'. But there is a problem where 'sin' results in 'basin'. Anyway which is unavoidable while querying, the scoring procedure accounts for it and gives less score. The resulted word and the corresponding score are stored in a result table.

SQL
OPEN word_list ;
                loop1:LOOP
                        SET i=i+1;
                        FETCH word_list INTO tempWord,score;
                        IF i>resultLimit THEN
                                LEAVE loop1;
                        END IF;
                        IF done=1 THEN
                                LEAVE loop1;
                        END IF;
                        IF score>lowScoreThreshold THEN
                                INSERT INTO RESULT (word,score) VALUES (tempWord,score);
                        END IF;
                        IF duplicate_key<>1 THEN
                                SET totalCount=totalCount+1;
                        END IF;
                END LOOP loop1;
        CLOSE word_list; 

Let us discuss about the scoring function. What the user types may be a word, but the match result may be a sentence. It is meaningless to include the whole sentence for computing the matching score. So it needs to extract the matching word to the search key. The difficulty is that the result does not contain the exact search key. So it calls another function 'locateSearchWord'. Once the word is located, it removes all the '%' from the search key and gets the length of it. Then comparing the length of the search key with the resulted match. The difference in the length is equal to the new chars present in the resulted match which are accounted as penalty and calculate the score accordingly.

SQL
SET resultPart= locateSearchWord(searchWord,resultWord);
      /*Find the length of the search key and the matched word resulted */
      SET count1=LENGTH(REPLACE(searchWord,'%',''));
      SET count2=LENGTH(resultPart);
      SET diffCount=count2-count1;
      IF diffCount<0 THEN
              SET diffCount=diffCount*-1;
      END IF;
      /*The difference between the lengths is the indication of unmatched
        chars which are subject to penalty */
      IF LEFT(REPLACE(searchWord,'%',''),1)<>LEFT(resultPart,1) THEN
          SET penalty==1*count1;
      ELSE
          SET penalty=diffCount * gapPenalty;
      END IF
      SET score=count1+penalty;
      /*if the search key and the matched result are same then the score
        will be equal to the length of the search key */
      /*score is converted to percentage */
      SET score=(score/count1)*100;
      RETURN(score);

The resulted matches with score above the threshold will be sorted in a table and queried for suggesting at the Ajax suggest boxes. This temporary table may be having a session id column, so that the results of multiple users can be handled. This way this little utility can be used with Ajax suggest boxes efficiently.

Proof Of Concept

Attached is a SQL dump with the sample tables and stored procedures. Once setup, the mySql database you may test it. If your table has the word 'winning', then you run call suggest('wining'), still it suggests 'winning' or if your table has the word 'happy' and user types 'happty' by error, still it suggest 'happy'.  Try a few more at the mySql prompt:

call suggest('Happi')
call suggest('Happty')
call suggest('Hapy') etc.,

The stored procedures have a definer declaration at the top like 'CREATE DEFINER=`root`@`localhost` PROCEDURE `suggest`' . Change it according to your user name and host.

Further Ideas

When we look at Google, it has lot of intelligent features in the suggest box. So the searching  is not limited to a few logics. It is open to new ideas. Let us look at a few.

A neural network can be formed relating search keys and user chosen suggestion. For example, user types 'New Year' and user chosen 'New Year Greetings' from the suggestion, then we can weigh the search key against the result and when this weight crosses the threshold on the neural node, the search key is stored as a valid one against the match. As this data grows, it can be used for suggestions.

When a user searches for something, that search key can be related to their region and weighted. If more people than the defined threshold use that key from that region, then that information can be used to suggest locally important matches as Google and YouTube does.

New ideas can be added here and let the website get what our brain wants.

Caution

This little piece of stored procedure is not aimed to handle millions of data like Google database. This article is written in view with a normal website.

Complete Code Dump

SQL
/*PROCEDURE suggest
Create search keywords and pass to query */
CREATE DEFINER=`root`@`localhost` PROCEDURE `suggest`(searchWord VARCHAR(250))
BEGIN
     DECLARE formattedString VARCHAR(250)DEFAULT '';
     DECLARE resultString VARCHAR(250) DEFAULT '';
     DECLARE str_len INT DEFAULT 0;
     DECLARE str_part VARCHAR(250) DEFAULT '';
     DECLARE str_part1 VARCHAR(250) DEFAULT '';
     DECLARE str_part2 VARCHAR(250) DEFAULT '';
     DECLARE i INT DEFAULT 1;
     DECLARE totalResultCount INT DEFAULT 0;
     DECLARE resultLimit INT DEFAULT 10;

     /* Result Table is to store the search results temporarily.
        Its field named 'word'is unique field.
        So it may throw duplicate entry error. 
        To prevent that a error handler is declared */
	 
	 /*Result Table may have an id column which can facilitate 
            different sessions can use the same table */

      /*delete the previous results */
      DELETE FROM RESULT;

     /* Initial search pattern */
     SET searchWord=CONCAT('%',' ', searchWord,'%');

     /*Get the matched ones. Watch for the result limit  */
     CALL getResult(searchWord,totalResultCount,resultLimit);

     SET str_len=LENGTH(searchWord);
     loop1:WHILE (i<(str_len-3)) DO
           /* Limit check to avoid unnecessary loops */
           IF totalResultCount>resultLimit THEN
                   LEAVE loop1;
           END IF;
           SET i=i+1;
			   
	   /*Addition of % in the search key */
           SET formattedString=INSERT(searchWord,str_len-i,0,'%');

	   /*Get the matched ones. Watch for the result limit
	   Result has a score for each matched selection*/
	   CALL getResult(formattedString,totalResultCount,resultLimit);
			   
    END WHILE loop1;

    /*This loop is similar to the above but replace the char at position with % */
    /*For examble %hello viewers%, %hello viewer%%, %hello viewe%s%.... %h%llo viewers% */

     SET i=0;
     loop2:WHILE (i<(str_len-3)) DO
				
         IF totalResultCount>resultLimit THEN
		LEAVE loop2;   
         END IF;
         SET i=i+1;
			   
         SET formattedString=INSERT(searchWord,str_len-i,1,'%');
        CALL getResult(formattedString,totalResultCount,resultLimit)

      END WHILE loop2;

     END IF;

     SELECT word,score FROM RESULT ORDER BY score DESC LIMIT 10;
     
END
SQL
/* PROCEDURE getResult
Execute the search query and store the results temporarily */
CREATE DEFINER=`root`@`localhost` PROCEDURE `getResult`_
	(searchWord VARCHAR(250),INOUT totalCount INT,resultLimit INT)
    MODIFIES SQL DATA
BEGIN
        /*This routine form various search key pattern and store the results in a table */
        DECLARE formatted_string VARCHAR(250);
        DECLARE result_word VARCHAR(250) DEFAULT '';
        DECLARE tempWord VARCHAR(250);
        DECLARE done INT DEFAULT 0;
        DECLARE score FLOAT DEFAULT 0;
        DECLARE lowScoreThreshold INT DEFAULT 30;
        DECLARE i INT DEFAULT 0;
        DECLARE duplicate_key INT DEFAULT 0;
		
        /* Here is the main search query with the search key passed. It is more 
           expensive query as it may results in many records and scoring each record */
        DECLARE word_list CURSOR FOR SELECT word,scoreWords_
	(sanitizeSearchKeyWord(searchWord),word) AS score FROM words _
	WHERE word LIKE searchWord OR word LIKE CONCAT_
	(sanitizeSearchKeyWord(searchWord) ,'%');

        DECLARE CONTINUE HANDLER FOR 1062 /* Duplicate key*/
        SET duplicate_key=1;
        DECLARE CONTINUE HANDLER  FOR NOT FOUND SET done=1;
		
        /*Loops through the selected records and validate with a threshold. 
          This loop only iterate up to the result limit. So not very costly
          This records the total records found*/
		
        OPEN word_list ;
                loop1:LOOP
                        SET i=i+1;
						
                        FETCH word_list INTO tempWord,score;
						
                        IF i>resultLimit THEN
                                LEAVE loop1;
                        END IF;

                        IF done=1 THEN
                                LEAVE loop1;
                        END IF;
						
                        IF score>lowScoreThreshold THEN
                                INSERT INTO RESULT (word,score) VALUES (tempWord,score);
                        END IF;
						
                        IF duplicate_key<>1 THEN
                                SET totalCount=totalCount+1;
                        END IF;
						
                END LOOP loop1;
        CLOSE word_list;
END
SQL
/*FUNCTION scoreWords
Calculate the score comparing the searchkey and the resulted match */
CREATE DEFINER=`root`@`localhost` FUNCTION `scoreWords`_
	(searchWord VARCHAR(250),resultWord VARCHAR(250)) RETURNS float
BEGIN
        /*This routine score the match results */
        DECLARE gapPenalty FLOAT DEFAULT -0.5;
        DECLARE count1 INT DEFAULT 0;
        DECLARE count2 INT DEFAULT 0;
        DECLARE diffCount INT DEFAULT 0;
        DECLARE penalty FLOAT DEFAULT 0;
        DECLARE score FLOAT DEFAULT 0;
        DECLARE resultPart VARCHAR(250);

        /*Identify the position of matched search key in the resulted sentence */
        SET resultPart= locateSearchWord(searchWord,resultWord);

        /*Find the length of the seach key and the matched word resulted */
        SET count1=LENGTH(REPLACE(searchWord,'%',''));
        SET count2=LENGTH(resultPart);
        SET diffCount=count2-count1;
        IF diffCount<0 THEN
                SET diffCount=diffCount*-1;
        END IF;
        /*The difference between the lengths is the indication of unmatched chars 
          which are subject to penalty */
        IF LEFT(REPLACE(searchWord,'%',''),1)<>LEFT(resultPart,1) THEN
		SET penalty==1*count1;
	ELSE
		SET penalty=diffCount * gapPenalty;
	END IF
        SET score=count1+penalty;
        /*if the search key and the matched result are same then the score 
          will be equal to the length of the search key */
        /*score is converted to percentage */
        SET score=(score/count1)*100;

        RETURN(score);

END
SQL
/*Function locateSearchWord
Locate the matching word from the result to calculate the score */
CREATE DEFINER=`root`@`localhost` FUNCTION `locateSearchWord`_
(searchWord VARCHAR(250),resultWord VARCHAR(250)) RETURNS varchar(250) CHARSET latin1
BEGIN
        /* This routine locates the matched searchkey 
           like part in the resulted sentence */
        /*It pass the matched part to score it against the search key */
        DECLARE strPart1 VARCHAR(250) DEFAULT '';
        DECLARE strPart2 VARCHAR(250) DEFAULT '';
        DECLARE tmpWord VARCHAR(250) DEFAULT '';
        DECLARE spaceCount INT DEFAULT 0;
        DECLARE wordPosition INT DEFAULT 1;
        DECLARE resultPart VARCHAR(250) DEFAULT '';

        /*The place where we include the % may have any number of characters. 
          Those may not match with the searchkey
        So we take the left and right part of the % from the search key */
        SET strPart1=SUBSTRING_INDEX(searchWord,'%',1);
        SET strPart2=SUBSTRING_INDEX(searchWord,'%',-1);

        /*Finding the spaces i.e. equivalent number of words  in the seach key */
        SET tmpWord=REPLACE(searchWord,' ','');
        SET spaceCount =LENGTH(searchWord)-LENGTH(tmpWord);

        /*To find the matched part of the search key in the result 
          it takes the part which is more lengthy */
        IF LENGTH(strPart1)>LENGTH(strPart2) THEN
                SET wordPosition=LOCATE(REVERSE(strPart1),REVERSE(resultWord));

                IF   LOCATE(' ',REVERSE(resultWord),_
		wordPosition+LENGTH(strPart1))=0 THEN
                        SET wordPosition=LENGTH(resultWord);
                ELSE
                        SET wordPosition= LOCATE(' ',_
			REVERSE(resultWord),wordPosition+LENGTH(strPart1))-1;
                END IF;

                SET tmpWord=LEFT(REVERSE(resultWord),wordPosition);
                
                SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
                IF resultPart='' THEN
                        SET resultPart=tmpWord;
                END IF;
                SET resultPart=REVERSE(resultPart);

        ELSEIF LENGTH(strPart1)<LENGTH(strPart2) _
	OR(LENGTH(strPart1)=LENGTH(strPart2) AND LENGTH(strPart1)<>LENGTH(searchWord) _
	AND LENGTH(strPart2)<>0) THEN

SET wordPosition=LOCATE(strPart2,resultWord);
                IF   LOCATE(' ',resultWord,wordPosition+LENGTH(strPart2))=0 THEN
                        SET wordPosition=LENGTH(resultWord);
                ELSE
                        SET wordPosition= LOCATE(' ',_
			resultWord,wordPosition+LENGTH(strPart2))-1;
                END IF;

                SET tmpWord=LEFT(resultWord,wordPosition);
                
                SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
                IF resultPart='' THEN
                        SET resultPart=tmpWord;
                END IF;

        ELSE
                SET wordPosition=LOCATE(searchWord,resultWord);
                SET tmpWord=LEFT(resultWord,(wordPosition-1)+LENGTH(searchWord));
                
                SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
                IF resultPart='' THEN
                        SET resultPart=tmpWord;
                END IF;

        END IF;

        RETURN(resultPart);

END
SQL
/*FUNCTION sanitizeSeachKeyWord
Just remove the start and trailing % */
CREATE DEFINER=`root`@`localhost` _
	FUNCTION `sanitizeSearchKeyWord`(searchWord VARCHAR(250)) _
	RETURNS varchar(250) CHARSET latin1
BEGIN
  SET searchWord=MID(searchWord,3,LENGTH(searchWord)-3);
  RETURN searchWord;
END

Points of Interest

I was very much inspired with Google search. There is nothing fancy other than a single searchbox and it does a lot. This article is my little thinking and innovated by Google.

History

  • 20th December, 2010: Initial post 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)