Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / artificial-intelligence / machine-learning

Create Your First Machine Learning Model to Filter Spam

5.00/5 (2 votes)
3 Mar 2018CPOL9 min read 11.2K  
Create a Spam Filter Using Machine Learning

Introduction

Machine learning provides us with the ability to use mathematics and statistical probabilities based on data to determine the outcome of our code. This allows us to create code that "evolves" over time as it is based on changes to the data instead of having specific hard-coded values or specific values stored somewhere.

For example, a customer's usage of their credit card over time changes and evolves based on their purchasing habits and the card company's need to continue to be able to identify fraudulent transactions. If a "threshold" is set in code or in a database, that value will periodically have to be updated and determining what that value should be would be prohibitively expensive/difficult for a large number of customers. Periodically training a machine learning model to identify fraudulent activities based on actual data is far more maintainable.

For this article, we will be using "supervised learning" to determine if a message is "spam" or "ham" (a spam email message or a non-spam email message). Supervised learning means that we have a set of data that contains messages which have already been identified as "spam" or "ham" and we will use this data to train a machine learning model to be able to identify new messages as spam or ham. This determination is based on the new messages' statistical similarity to the messages that we trained the model with.

Background

If you have a decent level of familiarity with programming and an interest in machine learning, you should be able to follow along with this tutorial. The data provided by CodeProject looks like this:

# Spam training data
Spam,<p>But could then once pomp to nor that glee glorious of deigned. The vexed times 
childe none native. To he vast now in to sore nor flow and most fabled. 
The few tis to loved vexed and all yet yea childe. Fulness consecrate of it before his 
a a a that.</p><p>Mirthful and and pangs wrong. Objects isle with partings ancient 
made was are. Childe and gild of all had to and ofttimes made soon from to long youth 
way condole sore.</p>
Spam,<p>His honeyed and land vile are so and native from ah to ah it like flash in not. 
That gild by in basked they lemans passed way who talethis forgot deigned nor friends 
his before strange. Found long little the. Talethis have soon of hellas had

An initial value indicating "Spam" or "Ham" followed by a <p> tag, then the contents of the message. Additionally, the file is split into training and testing sections (more on this below).

Importing Libraries

Here, as with many languages, we import the various libraries that we will need for our code. We'll go into the details of what we are using below:

Python
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion

from sklearn.linear_model.logistic import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.utils import shuffle
from sklearn.metrics import precision_score, classification_report, accuracy_score

import time

Loading and Parsing the Data

Yes, while being a Data Scientist may be the sexiest job of the 21st century, it requires a lot of time performing the not-so-sexy process of parsing/cleaning/understanding the data that you are looking at. For this project, probably 85% of my time was spent doing just that.

Python
def get_data():
    file_name = './SpamDetectionData.txt'
    rawdata = open(file_name, 'r')
    lines = rawdata.readlines()
    lines = lines[1:] #get rid of "header"
    spam_train = lines[0:1000]
    ham_train = lines[1002:2002]
    test_mix = lines[2004:]
    return (spam_train, ham_train, test_mix)

In the get_data() function, we get the training and test data out of the file provided by CodeProject. We read the raw data from the file and store it in an array. In the lines like:

Python
  spam_train = lines[0:1000]

  ham_train = lines[1002:2002]

  test_mix = lines[2004:]

We are simply splitting out parts of the array into separate arrays with meaningful names. More details on how and why we use to test and training data below.

Creating a Pandas DataFrame

For the vast majority of data/feature engineering that you will be doing in machine learning, you will be using Pandas as it provides a very powerful (though sometimes confusing) array of tools for working with data. Here, we are creating a DataFrame, which is easiest to think of as an in-memory "table" that contains rows and columns, where one column holds the contents of the spam/ham message and the other column holds a binary flag (or category in data science lingo) indicating if the message is "spam" (1) or "ham" (0).

Python
def create_dataframe(input_array):    
    spam_indcator = 'Spam,<p>'
    message_class = np.array([1 if spam_indcator in item else 0 for item in input_array])
    data = pd.DataFrame()
    data['class'] = message_class
    data['message'] = input_array
    return data

Here is what our data frame looks like before we clean it the data in the next step - including the "class" column we just added.

DataFrame before the data is cleaned:

DataFame before the data is cleaned

Removing Words and Shuffling the Data

Here, based on what I was seeing in the data, we want to remove any extraneous text that either doesn't mean anything, such as <p>, or, in the case of what we have with the data provided - obvious data that is provided to indicate what type the message is, such as " Ham,<p>" which won't be in the real messages that we will be trying to classify. When we find these, we will simply replace them win an empty string.

Python
words_to_remove = ['Ham,<p>', 'Spam,<p>', '<p>', '</p>', '\n']

def remove_words(input_line, key_words=words_to_remove):
    temp = input_line
    for word in key_words:
        temp = temp.replace(word, '')
    return temp

Here, we are applying the filtering above to our data frame and then shuffling the data. While shuffling the data, it is not necessary for the data provided by CodeProject as it is already separated into sets of training data and test data, if you were performing this process on another data set, you would always want to shuffle the data before splitting it into training and test sets to ensure that there is a roughly equal number of examples in each set (in this case, spam and ham). If these sets are unbalanced, they could easily cause a bias in your training/testing process.

Python
def remove_words_and_shuffle(input_dataframe, input_random_state=7):
   input_dataframe['message'] = input_dataframe['message'].apply(remove_words)
   messages, classes = shuffle(input_dataframe['message'], input_dataframe['class'],
             random_state=input_random_state)
   df_return = pd.DataFrame()
   df_return['class'] = classes
   df_return['message'] = messages
   return df_return

Here is our data frame after we clean it up:

Image 2DataFrame after it is cleaned up

Training and Testing Our Models

This is what it is all about - using the training data to train our machine learning model, then using the test data to determine the accuracy of the model and how well it performed.

Python
def test_models(X_train_input_raw, y_train_input, X_test_input_raw, y_test_input, models_dict):

   return_trained_models = {}

   return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())])

   X_train = return_vectorizer.fit_transform(X_train_input_raw)
   X_test = return_vectorizer.transform(X_test_input_raw)

   for key in models_dict:
       model_name = key
       model = models_dict[key]
       t1 = time.time()
       model.fit(X_train, y_train_input)
       t2 = time.time()
       predicted_y = model.predict(X_test)
       t3 = time.time()

       output_accuracy(y_test_input, predicted_y, model_name, t2 - t1, t3 - t2)
       return_trained_models[model_name] = model

   return (return_trained_models, return_vectorizer)

There is a lot going on in this code so we will go through it line by line. First, let's look at the parameters:

  • X_train_input_data - these are the "raw" spam/ham messages that we will use to train the model
  • y_train_input - this is the 0 or 1 indicating ham or spam for the X_train_input_data parameter
  • X_test_input_raw - the "raw" spam/ham messages that we will use to test the accuracy of the trained model
  • y_test_input - the 0 or 1 indicating ham or spam for the X_test_input_raw parameter

return_trained_models = {} is a dictionary that will hold the models that we have trained so they can be used later

return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())]) sets up a TfidfVectorizer to be applied to passed in messages. Essentially, what we are doing is turning a string of words (the message) into a vector (an array) of the count of occurrences of those words.

Additionally, TF-IDF (Text Frequency-Inverse Document Frequency) applies weights to the frequency with which a term is present in the source document.

 
The tf - idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes; 83% of text-based recommender systems in the domain of digital libraries use tf - idf . (source)

This means that those terms which show up less frequently overall, but more frequently in a particular type of document will carry more weight. For example, the words "free", "viagra", etc. which don't show up very frequently in messages overall (all spam and ham messages combined) but do show up very frequently in spam messages alone, so these words will be weighed more heavily to indicate that document is spam.

There is a very large set of parameters that can be set and tuned to improve the accuracy of a model - you can find details about these here.

Next, now that we have our vectorizer created, we will "train" it on the training messages and use it to transform our set of test messages into vectors:

C#
X_train = return_vectorizer.fit_transform(X_train_input_raw)    

X_test = return_vectorizer.transform(X_test_input_raw)

The final step is to loop through the dictionary of models that was passed in to train each model, use the model to predict the test data, and output the accuracy of each model.

Outputting the Results of Training Our Models

When we train our models, we will want to see the name of the model, how long it took to train the model, and how accurate the model is. This function helps with doing that:

Python
def output_accuracy(actual_y, predicted_y, model_name, train_time, predict_time):
   print('Model Name: ' + model_name)
   print('Train time: ', round(train_time, 2))
   print('Predict time: ', round(predict_time, 2))
   print('Model Accuracy: {:.4f}'.format(accuracy_score(actual_y, predicted_y)))
   print('Model Precision: {:.4f}'.format(precision_score(actual_y, predicted_y)))
   print('')
   print(classification_report(actual_y, predicted_y, digits=4))
   print("=========================================================================")

Creating Our Dictionary of Models to Test

Here, we create the dictionary of models that we want to train and test the accuracy of. This is where you would add more models to test, remove poorly performing models, or change the parameters of models to determine which model will best fit your needs.

Python
def create_models():
    models = {}
    models['LinearSVC'] = LinearSVC()
    models['LogisticRegression'] = LogisticRegression()
    models['RandomForestClassifier'] = RandomForestClassifier()
    models['DecisionTreeClassifier'] = DecisionTreeClassifier()
    return models

Bringing It All Together

Here is where we bring all the steps together:

  1. Get the data and create the data frames.
  2. Clean and shuffle the data.
  3. Separate the data to the input (X) and output (y) for both the training and test sets.
  4. Create the models.
  5. Pass the models and data into the test_models() function to see their performance.
Python
spam_train, ham_train, test_mix = get_data()

words_to_remove = ['Ham,<p>', 'Spam,<p>', '<p>', '</p>', '\n']

df_train_cleaned = remove_words_and_shuffle(df_train)
df_test_cleaned = remove_words_and_shuffle(df_test)

X_train_raw = df_train_cleaned['message']
y_train = df_train_cleaned['class']

X_test_raw = df_test_cleaned['message']
y_test = df_test_cleaned['class']

X_test_raw = df_test_cleaned['message'] 
y_test = df_test_cleaned['class']

models = create_models()

trained_models, fitted_vectorizer = 
       test_models(X_train_raw, y_train, X_test_raw, y_test, models)

The Output

When we run this, here is the output:

Model Name: LinearSVC
Train time:  0.01
Predict time:  0.0
Model Accuracy: 1.0000
Model Precision: 1.0000

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000        57
          1     1.0000    1.0000    1.0000        43

avg / total     1.0000    1.0000    1.0000       100

======================================================
Model Name: LogisticRegression
Train time:  0.01
Predict time:  0.0
Model Accuracy: 0.4300
Model Precision: 0.4300

             precision    recall  f1-score   support

          0     0.0000    0.0000    0.0000        57
          1     0.4300    1.0000    0.6014        43

avg / total     0.1849    0.4300    0.2586       100

======================================================
Model Name: DecisionTreeClassifier
Train time:  0.02
Predict time:  0.0
Model Accuracy: 0.9800
Model Precision: 0.9556

             precision    recall  f1-score   support

          0     1.0000    0.9649    0.9821        57
          1     0.9556    1.0000    0.9773        43

avg / total     0.9809    0.9800    0.9800       100

======================================================
Model Name: RandomForestClassifier
Train time:  0.02
Predict time:  0.0
Model Accuracy: 0.9800
Model Precision: 0.9556

             precision    recall  f1-score   support

          0     1.0000    0.9649    0.9821        57
          1     0.9556    1.0000    0.9773        43

avg / total     0.9809    0.9800    0.9800       100

======================================================

We can see how long it took to train the model, how long it took to predict the test data, as well as the accuracy, precision, and recall of each model. Some of these terms warrant further explanation:

  • Accuracy - the ratio of correctly predicted observations to the total number of observations (for us, what percentage of spam/ham messages were correctly detected)
  • Precision - the ratio of correctly predicted positive observations to the total predicted positive observations (of the messages that we identified as spam, how many were correctly identified as spam)
  • Recall - the ratio of correctly predicted positive observations to all observations of the actual class (of all of the messages that are actually spam, how many did we correctly identify)
  • F1 Score - the weighted average of precision and recall

Yes, these are some trick concepts to wrap your head around, and they are even trickier to try to explain. Those explanations were borrowed from here: http://blog.exsilio.com/ along with my tieing their relevance to our project. Please refer to that page as it provides a more in-depth discussion of these topics.

Try a Model on Your Own Message

Finally, let's try our own messages to see if they are correctly identified as spam or ham.

Python
#from the sample ham and spam
ham = 'door beguiling cushions did. Evermore from raven from is beak shall name'
spam = 'The vexed times childe none native'
test_messages = [spam, ham]
transformed_test_messages = fitted_vectorizer.transform(test_messages)
trained_models['DecisionTreeClassifier'].predict(transformed_test_messages) 

And the output is:

array([1, 0])

Which is correctly identifying the spam and ham messages.

Conclusion

Machine learning, deep learning, and artificial intelligence are the future and we as software engineers need to understand and embrace the power that these technologies offer as we can leverage them to more effectively solve the problem that companies and clients we work for present to use and need our help in solving.

I have a blog that is dedicated to helping software engineers understand and develop their skills in the areas of machine learning, deep learning, and artificial intelligence. If you felt you learned something from this article, feel free to stop by my blog at CognitiveCoder.com.

Thanks for reading all the way to the end.

History

  • 3rd March, 2018 - Initial release
  • 3rd March, 2018 - Fixed broken image links

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)