Introduction
This report details the proof of concept for using prime numbers for indexing text with the intention of searching for words within a data set. This technique is shown to significantly outperform regular string
comparison.
The targeted use case for this algorithm is in word frequency analysis for sentiment analysis.
Background
This technique utilizes the knowledge described by the unique-prime-factorization theorem (also known as the Fundamental theorem of arithmetic) where it is understood that every integer greater than 1 has a unique list of prime factors.
Algorithm Theory
The process is broken down into indexing and searching:
Indexing
- Defining the dictionary of the data set
- Index the dictionary by storing a unique prime number for each word in the dictionary
- Index the text records by storing the product of the prime numbers that make up each text record
Searching
- Fetch the unique prime number of the word to search for from the indexed dictionary
- Find the modulo of each indexed text record against the value of the indexed word being searched for
- If the modulo calculation equals
0
, then the search word exists in that record
Example
Take the data set of text:
"black cat on mat",
"black hat for you",
"cat sat on you"
The dictionary of this data set is:
"black",
"cat",
"on",
"mat",
"hat",
"for",
"you",
"sat"
Now assign prime number to each word in dictionary:
"black": 2,
"cat" : 3,
"on" : 5,
"mat" : 7,
"hat" : 11,
"for" : 13,
"you" : 17,
"sat" : 19
Now turn each text record into the product of its prime numbers:
"black cat on mat" = "black(2) cat(3) on(5) mat(7)" = 2 x 3 x 5 x 7 = 210
"black hat for you" = black(2) hat(11) for(13) you(17) = 2 x 11 x 13 x 17 = 4862
"cat sat on you" = "cat(3) sat(19) on(5) you(17)" = 3 x 19 x 5 x 17 = 4845
Now the indexed data set is:
Searching for a Single Word
To search for the word "sat"
(prime equivalent 19
) in the data set, we calculate the modulo of its equivalent prime across the data set and where the result is '0
', the word exists within that text record:
210 % 19 = 1
(not a factor) 4862 % 19 = 17
(not a factor) 4845 % 19 = 0
(is a factor "cat sat on you
")
Another example, search for "you"
(prime equivalent 17
):
210 % 17 = 6
(not a factor) 4862 % 17 = 0
(is a factor "black hat for you
") 4845 % 17 = 0
(is a factor "cat sat on you
")
Searching for Multiple Words
Because of the associative property of factors, to search for multiple words, you do the same as before except calculate the modulo using the product of the words that you are searching for:
Search for "you
" and "cat
"
"you"(17) x "cat"(3) = 51
210 % 51 = 6
(not a factor) 4862 % 51 = 17
(not a factor) 4845 % 51 = 0
(is a factor "cat sat on you
")
Search for "cat
" and "on
"
"cat"(3) x "on"(5) = 15
210 % 15 = 0
(is a factor "black cat on mat
") 4862 % 15 = 2
(not a factor) 4845 % 15 = 0
(is a factor "cat sat on you
")
Search for "black
", "hat
" and "you
"
"black"(2) x "hat"(11) x you(17) = 374
210 % 374 = 210
(not a factor) 4862 % 374 = 0
(is a factor "black hat for you
") 4845 % 374 = 231
(not a factor)
Performance vs 'numpy string find'
Normal string
comparison entails performing compound character comparison for each character within a string
which has a complexity of O(n) where n
is the length of the string
. After indexing: the prime factor approach only requires a single modulo operation per string
record which has a complexity of O(1) which in turn results in faster searching across records as illustrated in the plot below of the search speed against the average length of documents in the data set.
The slight gradient in the prime factor plot can be attributed the Pythons implementation which uses arbitrary precision arithmetic to store the indexed document value.
See the full python implementation and performance comparison technique and results here.
Points of Interest
Word Order
Due to the association of multiplication, when the documents are indexed into one product, the information regarding the order of the words is lost.
Appending Documents
When new documents need to be added, they can be indexed individually and any new words can be appended to the dictionary without affecting the currently index documents.
Large Numbers
The size of the stored product value can become very large with long documents so special care should be taken in ensuring the data type used to store this value is able to store the potentially large values..
Conclusion
Prime factors can be used for text searching provided that searching for word order isn't a requirement. This technique can therefore prove especially useful in word frequency analysis problems supplementing techniques like TF-IDF and Bag of Words.
Appendix