Introduction
Bag-of-Words is a way of representing a document as a list of the words that occur in the document, along with the count of how often each word occurs. Most machine learning algorithms expect vectors as input. Using the bag of words representation makes concepts like Euclidean distance or cosine distance apply to text. This would be helpful if you wanted to classify documents by nearest neighbors, or index them in a metric tree.
Linux has powerful standard utilities for handling text. It is possible to do this conversion in a single command pipeline.
The Code
Given: A directory full of .txt files
Needed: A corresponding .bow file for each .txt
for i in *.txt; do cat "$i" | tr '[:lower:]'
'[:upper:]' | tr -cs 'A-Z0-9\047\-' '[\n*]' |
sed "s/^[-']*//" | sed "s/[-']*$//" | sort |
uniq -c | tail -n +2 > "`basename "$i" .txt`.bow"; done
for i in *.txt; do
begins a loop that executes once for each text file in the current directory. cat "$i"
reads the file and sends it to standard output. tr '[:lower:]' '[:upper:]'
translates lower to upper case. tr -cs 'A-Z0-9\047\-' '[\n*]'
The -s
option means squeeze repeats, and the -c
option means complement, so that all characters that are not A-Z 0-9 \047 (octal ASCII for apostrophe) or hyphens are translated a single newline for each contiguous instance. The document is now split into words. sed "s/^[-']*//"
selects leading hyphens and apostrophes, replacing them with nothing. sed "s/[-']*$//"
selects trailing hyphens and apostrophes, replacing them with nothing. uniq -c
takes the sorted list of words, and returns a list of unique words with a count of how often each occurs. tail -n +2
removes the lone newline that sorts to the beginning of the list. > "`basename "$i" .txt`.bow";
The backticks run the basename
command, which strips the .txt from the original filename, and inserts its results as part of the command. The >
redirects output to the filename made by appending .bow to the result of the basename command. done
closes the loop begun with for
.
Empty Files
Some of the texts may have consisted of nothing but punctuation marks. Null
files are bound to cause problems later, so delete them now with:
for i in *.bow; do if test ! -s "$i"; then rm "$i"; fi; done
test ! -s "$i"
returns true
when the filesize
is zero. fi;
closes the if
block.
Points of Interest
Some limitations:
It's not quick. If you have a lot of data to run, there will be plenty of time for snacks.
Numbers are not handled correctly if they contain punctuation marks. This includes periods and commas, so decimal numbers will be chopped up, as will anything with thousands separators. Phone numbers with parentheses will also be wrecked. Negative numbers will be counted as positives.
Hypens inside words are allowed, so MULTI-CHANNEL is treated as a distinct word. Hyphenation that splits words isn't accounted for, so...
bla bla bla bla bla bla bla mis-
take
...will be counted as:
1 MIS
1 TAKE
History