Introduction
According to Wikipedia, an n-gram “is a contiguous sequence of n items from a given sequence of text or speech”.[1] In simplistic terms this means an n-gram is a collection of tokens where “n” equals the number of tokens contained within the collection. A token within this context can basically be any portion of data divided into smaller logical pieces by the n-gram creator. One of the simplest examples to consider is a text document. A token within a text document might represent each individual word within the document as delimited by spaces with all punctuation characters removed. However, many alternative tokenization strategies could be devised for the production of tokens from a given text document. Considering the text document example, a 3-gram could represent an n-gram containing 3 word tokens and a 4-gram would contain 4 word tokens. Once the n-gram parameter “n” and a tokenization method has been decided upon, each n-gram can be produced by starting with the first “n” tokens and creating the first n-gram. From that point on, additional n-grams are produced by removing the first token contained within the current n-gram and concatenating one additional token to the end of the current n-gram. This process is continued until the last “n” tokens within the provided document are reached, and the last n-gram is then created.
N-grams are used for many types of applications including computational linguistics, DNA sequencing, protein sequencing, and data compression just to name a few [1]. In order to better understand the benefits of n-grams, we can once again consider the text document example where tokens are generated using only individual words. Using a standard “Bag of Words” approach to compare two documents for similarity, the Jaccard Index [2] for similarity would be calculated by comparing the intersection of all unique words contained within two target documents divided by the union of all unique words contained within the two target documents. While this approach does produce a somewhat reasonable estimate of document similarity, several challenges can arise. For instance, a document containing the sentence “SMU is the best college in Texas.” would be scored as very similar to a document containing the sentence “In Texas, the best college is TCU.” While the two documents in this example are similar, they are not as semantically similar as the score would suggest.
Similarity Examples Using Jaccard
- Document 1: ”SMU is the best college in Texas.”
- Document 2: ”In Texas, the best college is TCU.”
Jaccard Similarity Using Single Words
- Intersection = {the, best, college, in, texas, is} = 6
- Union = {the, best, college, in, texas, is, smu, tcu} = 8
- Jaccard Similarity = 6/8 = .75 or 75% using single words
Jaccard Similarity Using 3-Grams
Document 1 3-Grams
SMU is the best college in Texas
- SMU is the
- is the best
- the best college
- best college in
- college in Texas
Document 2 3-Grams
In Texas, the best college is TCU.
- in Texas the
- Texas the best
- the best college
- best college is
- college is TCU
Intersection = “the best college” = 1
Union = 9 total unique 3-grams
Jaccard = .11 or 11% using 3-grams
Which is Best?
As you can see from the examples above, using single words or n-grams to calculate a Jaccard Index for document similarity can produce very different results. However, it may still be unclear to you which method would be the best choice. This answer also depends on multiple factors. Since n-grams typically contain multiple overlapping tokens, a document similarity calculated using n-grams better reflects and preserves the actual order of words within a document. This means that a collection of documents sorted by a Jaccard Similarity calculated using n-grams would typically be more semantically accurate than one using only single words. However, there is also a performance trade-off to consider as producing n-grams is more complex than producing only the unique words contained within a document.
N-gram Production Using C#
The following program written in C# demonstrates a simple method of producing n-grams from a string of text. The program “yield returns” each n-gram produced as part of an IEnumerable<string>
collection. The yield return statement stops processing and releases each n-gram as they are produced in a pseudo-parallel fashion which allows downstream processing of n-grams to continue prior to all n-grams being produced. A FIFO(first in first out) queue is used to keep track of each token’s length within the current n-gram. Tokens are continually identified and added to the current n-gram buffer. When the current n-gram buffer’s token count exceeds the specified n-gram size, a new n-gram is created and the first token within the current n-gram buffer is removed using the correct length provided by token length queue.
Many heuristics could be developed for generating tokens from text. In the example below, I have devised a simple set of rules for determining when individual characters belong to a specific “word” token. Space characters are used to delimit or separate each word. Duplicate spaces are simply skipped. Letters or digits always belong to the current word token, and other characters belong to the current word token when they are surrounded by valid letters or digits. This does not cover all possible valid word tokens that might occur within various types of documents. However, additional heuristics can be included using the program below as a starting point. I have found that the n-grams produced by this program were very suitable for my own purposes.
Performance
I found that using a simple for()
loop when producing n-grams in C# provided two primary benefits. First, individual characters within strings in the C# language can be accessed as individual members of a character array representing all characters within the entire string. This allowed me to efficiently be able to look both forwards and backwards within the text for the previous and next characters using a simple and very efficient statement such as “char before = text[i-1];
”. Second, using other methods of separating words such as the split()
function or a regular expression proved to be much slower while providing less heuristics flexibility than the for()
loop in my own experiences.
The for()
loop could also be written to execute in parallel with minor modifications, but that is a topic for another blog!
References
- N-gram. (2013, March 31). In Wikipedia, The Free Encyclopedia. Retrieved 18:52, April 22, 2013, from http://en.wikipedia.org/w/index.php?title=N-gram&oldid=548024834
- Jaccard index. (2013, April 9). In Wikipedia, The Free Encyclopedia. Retrieved 19:14, April 22, 2013, from http://en.wikipedia.org/w/index.php?title=Jaccard_index&oldid=549575590
Links