Text to Bag-of-Words with bash

Andy Allinger

5.00/5 (1 vote)

28 Apr 2016Public Domain2 min read

7.6K

Convert .txt files to .bow in one Linux shell command

Introduction

Bag-of-Words is a way of representing a document as a list of the words that occur in the document, along with the count of how often each word occurs. Most machine learning algorithms expect vectors as input. Using the bag of words representation makes concepts like Euclidean distance or cosine distance apply to text. This would be helpful if you wanted to classify documents by nearest neighbors, or index them in a metric tree.

Linux has powerful standard utilities for handling text. It is possible to do this conversion in a single command pipeline.

The Code

Given: A directory full of .txt files

Needed: A corresponding .bow file for each .txt

BAT

for i in *.txt; do cat "$i" | tr '[:lower:]' 
'[:upper:]' | tr -cs 'A-Z0-9\047\-' '[\n*]' | 
sed "s/^[-']*//" | sed "s/[-']*$//" | sort | 
uniq -c | tail -n +2 > "`basename "$i" .txt`.bow"; done

for i in *.txt; do begins a loop that executes once for each text file in the current directory.
cat "$i" reads the file and sends it to standard output.
tr '[:lower:]' '[:upper:]' translates lower to upper case.
tr -cs 'A-Z0-9\047\-' '[\n*]' The -s option means squeeze repeats, and the -c option means complement, so that all characters that are not A-Z 0-9 \047 (octal ASCII for apostrophe) or hyphens are translated a single newline for each contiguous instance. The document is now split into words.
sed "s/^[-']*//" selects leading hyphens and apostrophes, replacing them with nothing.
sed "s/[-']*$//" selects trailing hyphens and apostrophes, replacing them with nothing.
uniq -c takes the sorted list of words, and returns a list of unique words with a count of how often each occurs.
tail -n +2 removes the lone newline that sorts to the beginning of the list.
> "`basename "$i" .txt`.bow"; The backticks run the basename command, which strips the .txt from the original filename, and inserts its results as part of the command. The > redirects output to the filename made by appending .bow to the result of the basename command.
done closes the loop begun with for.

Empty Files

Some of the texts may have consisted of nothing but punctuation marks. Null files are bound to cause problems later, so delete them now with:

BAT

for i in *.bow; do if test ! -s "$i"; then rm "$i"; fi; done

test ! -s "$i" returns true when the filesize is zero.
fi; closes the if block.

Points of Interest

Some limitations:

It's not quick. If you have a lot of data to run, there will be plenty of time for snacks.

Numbers are not handled correctly if they contain punctuation marks. This includes periods and commas, so decimal numbers will be chopped up, as will anything with thousands separators. Phone numbers with parentheses will also be wrecked. Negative numbers will be counted as positives.

Hypens inside words are allowed, so MULTI-CHANNEL is treated as a distinct word. Hyphenation that splits words isn't accounted for, so...

bla bla bla bla bla bla bla mis-
take

...will be counted as:

1 MIS
1 TAKE

History

Written - Fall 2015

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication