Introduction
Machine learning provides us with the ability to use mathematics and statistical probabilities based on data to determine the outcome of our code. This allows us to create code that "evolves" over time as it is based on changes to the data instead of having specific hard-coded values or specific values stored somewhere.
For example, a customer's usage of their credit card over time changes and evolves based on their purchasing habits and the card company's need to continue to be able to identify fraudulent transactions. If a "threshold" is set in code or in a database, that value will periodically have to be updated and determining what that value should be would be prohibitively expensive/difficult for a large number of customers. Periodically training a machine learning model to identify fraudulent activities based on actual data is far more maintainable.
For this article, we will be using "supervised learning" to determine if a message is "spam
" or "ham
" (a spam email message or a non-spam email message). Supervised learning means that we have a set of data that contains messages which have already been identified as "spam
" or "ham
" and we will use this data to train a machine learning model to be able to identify new messages as spam
or ham
. This determination is based on the new messages' statistical similarity to the messages that we trained the model with.
Background
If you have a decent level of familiarity with programming and an interest in machine learning, you should be able to follow along with this tutorial. The data provided by CodeProject looks like this:
# Spam training data
Spam,<p>But could then once pomp to nor that glee glorious of deigned. The vexed times
childe none native. To he vast now in to sore nor flow and most fabled.
The few tis to loved vexed and all yet yea childe. Fulness consecrate of it before his
a a a that.</p><p>Mirthful and and pangs wrong. Objects isle with partings ancient
made was are. Childe and gild of all had to and ofttimes made soon from to long youth
way condole sore.</p>
Spam,<p>His honeyed and land vile are so and native from ah to ah it like flash in not.
That gild by in basked they lemans passed way who talethis forgot deigned nor friends
his before strange. Found long little the. Talethis have soon of hellas had
An initial value indicating "Spam
" or "Ham
" followed by a <p>
tag, then the contents of the message. Additionally, the file is split into training and testing sections (more on this below).
Importing Libraries
Here, as with many languages, we import the various libraries that we will need for our code. We'll go into the details of what we are using below:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle
from sklearn.metrics import precision_score, classification_report, accuracy_score
import time
Loading and Parsing the Data
Yes, while being a Data Scientist may be the sexiest job of the 21st century, it requires a lot of time performing the not-so-sexy process of parsing/cleaning/understanding the data that you are looking at. For this project, probably 85% of my time was spent doing just that.
def get_data():
file_name = './SpamDetectionData.txt'
rawdata = open(file_name, 'r')
lines = rawdata.readlines()
lines = lines[1:]
spam_train = lines[0:1000]
ham_train = lines[1002:2002]
test_mix = lines[2004:]
return (spam_train, ham_train, test_mix)
In the get_data()
function, we get the training and test data out of the file provided by CodeProject. We read the raw data from the file and store it in an array. In the lines like:
spam_train = lines[0:1000]
ham_train = lines[1002:2002]
test_mix = lines[2004:]
We are simply splitting out parts of the array into separate arrays with meaningful names. More details on how and why we use to test and training data below.
Creating a Pandas DataFrame
For the vast majority of data/feature engineering that you will be doing in machine learning, you will be using Pandas as it provides a very powerful (though sometimes confusing) array of tools for working with data. Here, we are creating a DataFrame, which is easiest to think of as an in-memory "table
" that contains rows and columns, where one column holds the contents of the spam
/ham
message and the other column holds a binary flag (or category in data science lingo) indicating if the message is "spam
" (1) or "ham
" (0).
def create_dataframe(input_array):
spam_indcator = 'Spam,<p>'
message_class = np.array([1 if spam_indcator in item else 0 for item in input_array])
data = pd.DataFrame()
data['class'] = message_class
data['message'] = input_array
return data
Here is what our data frame looks like before we clean it the data in the next step - including the "class
" column we just added.
DataFrame before the data is cleaned:
Removing Words and Shuffling the Data
Here, based on what I was seeing in the data, we want to remove any extraneous text that either doesn't mean anything, such as <p>,
or, in the case of what we have with the data provided - obvious data that is provided to indicate what type the message is, such as " Ham,<p>
" which won't be in the real messages that we will be trying to classify. When we find these, we will simply replace them win an empty string.
words_to_remove = ['Ham,<p>', 'Spam,<p>', '<p>', '</p>', '\n']
def remove_words(input_line, key_words=words_to_remove):
temp = input_line
for word in key_words:
temp = temp.replace(word, '')
return temp
Here, we are applying the filtering above to our data frame and then shuffling the data. While shuffling the data, it is not necessary for the data provided by CodeProject as it is already separated into sets of training data and test data, if you were performing this process on another data set, you would always want to shuffle the data before splitting it into training and test sets to ensure that there is a roughly equal number of examples in each set (in this case, spam and ham). If these sets are unbalanced, they could easily cause a bias in your training/testing process.
def remove_words_and_shuffle(input_dataframe, input_random_state=7):
input_dataframe['message'] = input_dataframe['message'].apply(remove_words)
messages, classes = shuffle(input_dataframe['message'], input_dataframe['class'],
random_state=input_random_state)
df_return = pd.DataFrame()
df_return['class'] = classes
df_return['message'] = messages
return df_return
Here is our data frame after we clean it up:
Training and Testing Our Models
This is what it is all about - using the training data to train our machine learning model, then using the test data to determine the accuracy of the model and how well it performed.
def test_models(X_train_input_raw, y_train_input, X_test_input_raw, y_test_input, models_dict):
return_trained_models = {}
return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())])
X_train = return_vectorizer.fit_transform(X_train_input_raw)
X_test = return_vectorizer.transform(X_test_input_raw)
for key in models_dict:
model_name = key
model = models_dict[key]
t1 = time.time()
model.fit(X_train, y_train_input)
t2 = time.time()
predicted_y = model.predict(X_test)
t3 = time.time()
output_accuracy(y_test_input, predicted_y, model_name, t2 - t1, t3 - t2)
return_trained_models[model_name] = model
return (return_trained_models, return_vectorizer)
There is a lot going on in this code so we will go through it line by line. First, let's look at the parameters:
X_train_input_data
- these are the "raw" spam
/ham
messages that we will use to train the model y_train_input
- this is the 0 or 1 indicating ham
or spam
for the X_train_input_data
parameter X_test_input_raw
- the "raw" spam
/ham
messages that we will use to test the accuracy of the trained model y_test_input
- the 0 or 1 indicating ham
or spam
for the X_test_input_raw
parameter
return_trained_models = {}
is a dictionary that will hold the models that we have trained so they can be used later
return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())])
sets up a TfidfVectorizer to be applied to passed in messages. Essentially, what we are doing is turning a string of words (the message) into a vector (an array) of the count of occurrences of those words.
Additionally, TF-IDF (Text Frequency-Inverse Document Frequency) applies weights to the frequency with which a term is present in the source document.
The tf - idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes; 83% of text-based recommender systems in the domain of digital libraries use tf - idf . (
source)
This means that those terms which show up less frequently overall, but more frequently in a particular type of document will carry more weight. For example, the words "free", "viagra", etc. which don't show up very frequently in messages overall (all spam
and ham
messages combined) but do show up very frequently in spam messages alone, so these words will be weighed more heavily to indicate that document is spam
.
There is a very large set of parameters that can be set and tuned to improve the accuracy of a model - you can find details about these here.
Next, now that we have our vectorizer created, we will "train" it on the training messages and use it to transform our set of test messages into vectors:
X_train = return_vectorizer.fit_transform(X_train_input_raw)
X_test = return_vectorizer.transform(X_test_input_raw)
The final step is to loop through the dictionary of models that was passed in to train each model, use the model to predict the test data, and output the accuracy of each model.
Outputting the Results of Training Our Models
When we train our models, we will want to see the name of the model, how long it took to train the model, and how accurate the model is. This function helps with doing that:
def output_accuracy(actual_y, predicted_y, model_name, train_time, predict_time):
print('Model Name: ' + model_name)
print('Train time: ', round(train_time, 2))
print('Predict time: ', round(predict_time, 2))
print('Model Accuracy: {:.4f}'.format(accuracy_score(actual_y, predicted_y)))
print('Model Precision: {:.4f}'.format(precision_score(actual_y, predicted_y)))
print('')
print(classification_report(actual_y, predicted_y, digits=4))
print("=========================================================================")
Creating Our Dictionary of Models to Test
Here, we create the dictionary of models that we want to train and test the accuracy of. This is where you would add more models to test, remove poorly performing models, or change the parameters of models to determine which model will best fit your needs.
def create_models():
models = {}
models['LinearSVC'] = LinearSVC()
models['LogisticRegression'] = LogisticRegression()
models['RandomForestClassifier'] = RandomForestClassifier()
models['DecisionTreeClassifier'] = DecisionTreeClassifier()
return models
Bringing It All Together
Here is where we bring all the steps together:
- Get the data and create the data frames.
- Clean and shuffle the data.
- Separate the data to the input (X) and output (y) for both the training and test sets.
- Create the models.
- Pass the models and data into the
test_models()
function to see their performance.
spam_train, ham_train, test_mix = get_data()
words_to_remove = ['Ham,<p>', 'Spam,<p>', '<p>', '</p>', '\n']
df_train_cleaned = remove_words_and_shuffle(df_train)
df_test_cleaned = remove_words_and_shuffle(df_test)
X_train_raw = df_train_cleaned['message']
y_train = df_train_cleaned['class']
X_test_raw = df_test_cleaned['message']
y_test = df_test_cleaned['class']
X_test_raw = df_test_cleaned['message']
y_test = df_test_cleaned['class']
models = create_models()
trained_models, fitted_vectorizer =
test_models(X_train_raw, y_train, X_test_raw, y_test, models)
The Output
When we run this, here is the output:
Model Name: LinearSVC
Train time: 0.01
Predict time: 0.0
Model Accuracy: 1.0000
Model Precision: 1.0000
precision recall f1-score support
0 1.0000 1.0000 1.0000 57
1 1.0000 1.0000 1.0000 43
avg / total 1.0000 1.0000 1.0000 100
======================================================
Model Name: LogisticRegression
Train time: 0.01
Predict time: 0.0
Model Accuracy: 0.4300
Model Precision: 0.4300
precision recall f1-score support
0 0.0000 0.0000 0.0000 57
1 0.4300 1.0000 0.6014 43
avg / total 0.1849 0.4300 0.2586 100
======================================================
Model Name: DecisionTreeClassifier
Train time: 0.02
Predict time: 0.0
Model Accuracy: 0.9800
Model Precision: 0.9556
precision recall f1-score support
0 1.0000 0.9649 0.9821 57
1 0.9556 1.0000 0.9773 43
avg / total 0.9809 0.9800 0.9800 100
======================================================
Model Name: RandomForestClassifier
Train time: 0.02
Predict time: 0.0
Model Accuracy: 0.9800
Model Precision: 0.9556
precision recall f1-score support
0 1.0000 0.9649 0.9821 57
1 0.9556 1.0000 0.9773 43
avg / total 0.9809 0.9800 0.9800 100
======================================================
We can see how long it took to train the model, how long it took to predict the test data, as well as the accuracy, precision, and recall of each model. Some of these terms warrant further explanation:
- Accuracy - the ratio of correctly predicted observations to the total number of observations (for us, what percentage of
spam
/ham
messages were correctly detected) - Precision - the ratio of correctly predicted positive observations to the total predicted positive observations (of the messages that we identified as
spam
, how many were correctly identified as spam
) - Recall - the ratio of correctly predicted positive observations to all observations of the actual class (of all of the messages that are actually
spam
, how many did we correctly identify) - F1 Score - the weighted average of precision and recall
Yes, these are some trick concepts to wrap your head around, and they are even trickier to try to explain. Those explanations were borrowed from here: http://blog.exsilio.com/ along with my tieing their relevance to our project. Please refer to that page as it provides a more in-depth discussion of these topics.
Try a Model on Your Own Message
Finally, let's try our own messages to see if they are correctly identified as spam
or ham
.
ham = 'door beguiling cushions did. Evermore from raven from is beak shall name'
spam = 'The vexed times childe none native'
test_messages = [spam, ham]
transformed_test_messages = fitted_vectorizer.transform(test_messages)
trained_models['DecisionTreeClassifier'].predict(transformed_test_messages)
And the output is:
array([1, 0])
Which is correctly identifying the spam
and ham
messages.
Conclusion
Machine learning, deep learning, and artificial intelligence are the future and we as software engineers need to understand and embrace the power that these technologies offer as we can leverage them to more effectively solve the problem that companies and clients we work for present to use and need our help in solving.
I have a blog that is dedicated to helping software engineers understand and develop their skills in the areas of machine learning, deep learning, and artificial intelligence. If you felt you learned something from this article, feel free to stop by my blog at CognitiveCoder.com.
Thanks for reading all the way to the end.
History
- 3rd March, 2018 - Initial release
- 3rd March, 2018 - Fixed broken image links