Use Machine Learning to Determine the Programming Language of Text

sjb_strat

5.00/5 (1 vote)

3 Mar 2018CPOL3 min read

6.7K

Use machine learning to determine the programming language of text

Introduction

A lot of the code used here to determine the programming language of a string of text has already been presented and discussed quite thoroughly in my previous article here: Create Your First Machine Learning Model to Filter Spam. In that article, we build a number of functions that are highly reusable in your machine learning pipeline.

In this article, we will focus mainly on those items that are unique to this problem.

Importing the Necessary Libraries

The norman Python import statements:

Python

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.utils import shuffle
from sklearn.metrics import precision_score, classification_report, accuracy_score

from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder

import re
import time

Retrieving and Parsing Data

Most of my time for this challenge was spent figuring out how to effectively parse the data to extract the name of the language from the text and then remove that information from the text so that it doesn't taint our training and test datasets.

Here is an example of two of the text strings/segments (which span multiple lines and contain carriage returns):

JavaScript

<pre lang="Swift">
@objc func handleTap(sender: UITapGestureRecognizer) {
    if let tappedSceneView = sender.view as? ARSCNView {
        let tapLocationInView = sender.location(in: tappedSceneView)
        let planeHitTest = tappedSceneView.hitTest(tapLocationInView,
            types: .existingPlaneUsingExtent)
        if !planeHitTest.isEmpty {
            addFurniture(hitTest: planeHitTest)
        }
    }
}</pre>

<pre lang="JavaScript">
var my_dataset = [
   {
       id: "1",
       text: "Chairman & CEO",
       title: "Henry Bennett"
   },
   {
       id: "2",
       text: "Manager",
       title: "Mildred Kim"
   },
   {
       id: "3",
       text: "Technical Director",
       title: "Jerry Wagner"
   },
   { id: "1-2", from: "1", to: "2", type: "line" },
   { id: "1-3", from: "1", to: "3", type: "line" }
];</pre>

The tricky part was getting the regular expression to return the data within the "<pre lang...><pre>" tags, then creating another regular expression to just return the "lang" portion of the "pre" tag.

It's not pretty and I am sure that it can be optimized, but it works:

Python

def get_data():
    file_name = './LanguageSamples.txt'
    rawdata = open(file_name, 'r')
    lines = rawdata.readlines()
    return lines

def clean_data(input_lines):
    #find matches for all data within the pre tags
    all_found = re.findall(r'<pre[\s\S]*?<\/pre>', input_lines, re.MULTILINE)
    
    #clean the string of various tags
    clean_string = lambda x: x.replace('&lt;', '<').replace('&gt;', '>').replace
                   ('</pre>', '').replace('\n', '')
    all_found = [clean_string(item) for item in all_found]
    
    #get the language for all of the pre tags
    get_language = lambda x: re.findall(r'<pre lang="(.*?)">', x, re.MULTILINE)[0]
    lang_items = [get_language(item) for item in all_found]
    
    #remove all of the pre tags that contain the language
    remove_lang = lambda x: re.sub(r'<pre lang="(.*?)">', "", x)
    all_found = [remove_lang(item) for item in all_found]
    
    #return let text between the pre tags and their corresponding language
    return (all_found, lang_items)

Create the Pandas DataFrame

Here we get the data, create a DataFrame and populate it with data.

Python

all_samples = ''.join(get_data())
cleaned_data, languages = clean_data(all_samples)

df = pd.DataFrame()
df['lang_text'] = languages
df['data'] = cleaned_data

Here is what our DataFrame looks like:

Initial DataFrame

Creating a Categorical Column

The next thing we need to do it to turn our "lang_text" column into a numeric column as this is what many machine learning models are expecting for the "Y" or output that it is attempting to determine. To do this, we will use the LabelEncoder and use it to transform our "lang_text" column to a categorical column.

Python

lb_enc = LabelEncoder()
df['language'] = lb_enc.fit_transform(df['lang_text'])

Now our DataFrame looks like this:

DataFame with new column

We can see how the column was encoded by running this:

Python

lb_enc.classes_

Which displays this (the position in the array matches the integer value in the new "language" categorical column):

JavaScript

array(['ASM', 'ASP.NET', 'Angular', 'C#', 'C++', 'CSS', 'Delphi', 'HTML',
       'Java', 'JavaScript', 'Javascript', 'ObjectiveC', 'PERL', 'PHP',
       'Pascal', 'PowerShell', 'Powershell', 'Python', 'Razor', 'React',
       'Ruby', 'SQL', 'Scala', 'Swift', 'TypeScript', 'VB.NET', 'XML'], dtype=object)

Boilerplate Code

As mentioned in the introduction, a lot of the code for this project was already discussed in my previous article Create Your First Machine Learning Model to Filter Spam which I recommend reading if you want more of the details on the code below.

In summary, here are the next steps:

Declare the function for outputting the results of the training
Declare the function for training and testing the models
Declare the function for creating the models to test
Shuffle the data
Split the training and test data
Pass the data and the models into the training and testing function and see the results:

Python

def output_accuracy(actual_y, predicted_y, model_name, train_time, predict_time):
    print('Model Name: ' + model_name)
    print('Train time: ', round(train_time, 2))
    print('Predict time: ', round(predict_time, 2))
    print('Model Accuracy: {:.4f}'.format(accuracy_score(actual_y, predicted_y)))
    print('')
    print(classification_report(actual_y, predicted_y, digits=4))
    print("=======================================================")

def test_models(X_train_input_raw, y_train_input, X_test_input_raw, y_test_input, models_dict):

    return_trained_models = {}
    
    return_vectorizer = FeatureUnion([('tfidf_vect', TfidfVectorizer())])
    
    X_train = return_vectorizer.fit_transform(X_train_input_raw)
    X_test = return_vectorizer.transform(X_test_input_raw)
    
    for key in models_dict:
        model_name = key
        model = models_dict[key]
        t1 = time.time()
        model.fit(X_train, y_train_input)
        t2 = time.time()
        predicted_y = model.predict(X_test)
        t3 = time.time()
        
        output_accuracy(y_test_input, predicted_y, model_name, t2 - t1, t3 - t2)        
        return_trained_models[model_name] = model
        
    return (return_trained_models, return_vectorizer)

def create_models():
    models = {}
    models['LinearSVC'] = LinearSVC()
    models['LogisticRegression'] = LogisticRegression()
    models['RandomForestClassifier'] = RandomForestClassifier()
    models['DecisionTreeClassifier'] = DecisionTreeClassifier()
    models['MultinomialNB'] = MultinomialNB()
    return models

X_input, y_input = shuffle(df['data'], df['language'], random_state=7)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_input, y_input, test_size=0.7)

models = create_models()
trained_models, fitted_vectorizer = test_models(X_train_raw, y_train, X_test_raw, y_test, models)

And this is the result:

Model Name: LinearSVC
Train time:  0.99
Predict time:  0.0
Model Accuracy: 0.9262

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.8968    1.0000    0.9456       339
          4     0.9695    0.8527    0.9074       224
          5     0.9032    1.0000    0.9492        28
          6     0.7000    1.0000    0.8235         7
          7     0.9032    0.7568    0.8235        74
          8     0.7778    0.5833    0.6667        36
          9     0.9613    0.9255    0.9430       161
         10     1.0000    0.5000    0.6667         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    1.0000    1.0000         6
         16     1.0000    0.4000    0.5714         5
         17     0.9589    0.9589    0.9589        73
         18     1.0000    1.0000    1.0000         8
         19     0.7600    0.9268    0.8352        41
         20     0.1818    1.0000    0.3077         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    0.8750    0.9333        24
         23     1.0000    1.0000    1.0000         7
         24     1.0000    1.0000    1.0000        25
         25     0.9571    0.9571    0.9571        70
         26     0.9211    0.9722    0.9459       108

avg / total     0.9339    0.9262    0.9255      1422

=========================================================================
Model Name: DecisionTreeClassifier
Train time:  0.13
Predict time:  0.0
Model Accuracy: 0.9388

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.9123    0.9204    0.9163       339
          4     0.8408    0.9196    0.8785       224
          5     1.0000    0.8929    0.9434        28
          6     1.0000    1.0000    1.0000         7
          7     1.0000    0.9595    0.9793        74
          8     0.9091    0.8333    0.8696        36
          9     0.9817    1.0000    0.9908       161
         10     1.0000    0.5000    0.6667         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    0.5000    0.6667         6
         16     1.0000    0.4000    0.5714         5
         17     1.0000    1.0000    1.0000        73
         18     1.0000    1.0000    1.0000         8
         19     0.9268    0.9268    0.9268        41
         20     1.0000    1.0000    1.0000         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    0.7500    0.8571        24
         23     1.0000    1.0000    1.0000         7
         24     0.6786    0.7600    0.7170        25
         25     1.0000    1.0000    1.0000        70
         26     1.0000    1.0000    1.0000       108

avg / total     0.9419    0.9388    0.9376      1422

=========================================================================
Model Name: LogisticRegression
Train time:  0.71
Predict time:  0.01
Model Accuracy: 0.9304

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.9040    1.0000    0.9496       339
          4     0.9569    0.8929    0.9238       224
          5     0.9032    1.0000    0.9492        28
          6     0.7000    1.0000    0.8235         7
          7     0.8929    0.6757    0.7692        74
          8     0.8750    0.5833    0.7000        36
          9     0.9281    0.9627    0.9451       161
         10     1.0000    0.5000    0.6667         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    1.0000    1.0000         6
         16     1.0000    0.4000    0.5714         5
         17     0.9589    0.9589    0.9589        73
         18     1.0000    1.0000    1.0000         8
         19     0.7600    0.9268    0.8352        41
         20     1.0000    1.0000    1.0000         2
         21     1.0000    0.9781    0.9889       137
         22     1.0000    0.8750    0.9333        24
         23     1.0000    1.0000    1.0000         7
         24     1.0000    1.0000    1.0000        25
         25     0.9571    0.9571    0.9571        70
         26     0.9211    0.9722    0.9459       108

avg / total     0.9329    0.9304    0.9272      1422

=========================================================================
Model Name: RandomForestClassifier
Train time:  0.04
Predict time:  0.01
Model Accuracy: 0.9374

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     1.0000    1.0000    1.0000         2
          2     1.0000    1.0000    1.0000         1
          3     0.8760    1.0000    0.9339       339
          4     0.9452    0.9241    0.9345       224
          5     0.9032    1.0000    0.9492        28
          6     0.7000    1.0000    0.8235         7
          7     1.0000    0.8378    0.9118        74
          8     1.0000    0.5278    0.6909        36
          9     0.9527    1.0000    0.9758       161
         10     1.0000    0.1667    0.2857         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     1.0000    1.0000    1.0000         2
         14     1.0000    0.4545    0.6250        11
         15     1.0000    0.5000    0.6667         6
         16     1.0000    0.4000    0.5714         5
         17     1.0000    1.0000    1.0000        73
         18     1.0000    0.6250    0.7692         8
         19     0.9268    0.9268    0.9268        41
         20     0.0000    0.0000    0.0000         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    1.0000    1.0000        24
         23     1.0000    0.5714    0.7273         7
         24     1.0000    1.0000    1.0000        25
         25     1.0000    0.9571    0.9781        70
         26     0.8889    0.8889    0.8889       108

avg / total     0.9411    0.9374    0.9324      1422

=========================================================================
Model Name: MultinomialNB
Train time:  0.01
Predict time:  0.0
Model Accuracy: 0.8776

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         6
          1     0.0000    0.0000    0.0000         2
          2     0.0000    0.0000    0.0000         1
          3     0.8380    0.9764    0.9019       339
          4     1.0000    0.8750    0.9333       224
          5     1.0000    1.0000    1.0000        28
          6     1.0000    1.0000    1.0000         7
          7     0.6628    0.7703    0.7125        74
          8     1.0000    0.5833    0.7368        36
          9     0.8952    0.6894    0.7789       161
         10     1.0000    0.3333    0.5000         6
         11     1.0000    1.0000    1.0000        14
         12     1.0000    1.0000    1.0000         5
         13     0.0000    0.0000    0.0000         2
         14     1.0000    0.7273    0.8421        11
         15     1.0000    1.0000    1.0000         6
         16     1.0000    0.4000    0.5714         5
         17     1.0000    0.9178    0.9571        73
         18     0.8000    1.0000    0.8889         8
         19     0.4607    1.0000    0.6308        41
         20     0.0000    0.0000    0.0000         2
         21     1.0000    1.0000    1.0000       137
         22     1.0000    1.0000    1.0000        24
         23     1.0000    1.0000    1.0000         7
         24     0.8462    0.8800    0.8627        25
         25     0.8642    1.0000    0.9272        70
         26     0.9630    0.7222    0.8254       108

avg / total     0.8982    0.8776    0.8770      1422

=========================================================================

Again for details on this code and the meaning or "accuracy", "precision", "recall", and "f1-support", please see my previous article - Create Your First Machine Learning Model to Filter Spam.

Conclusion

Machine learning, deep learning, and artificial intelligence are the future and we as software engineers need to understand and embrace the power that these technologies offer as we can leverage them to more effectively solve the problem that companies and clients we work for present to use and need our help in solving.

I have a blog that is dedicated to helping software engineers understand and develop their skills in the areas of machine learning, deep learning, and artificial intelligence. If you felt you learned something from this article, feel free to stop by my blog at CognitiveCoder.com.

Thanks for reading all the way to the bottom.

History

3/3/2018 - Initial version
3/3/2018 - Fixed broken image links

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)