Table of Contents
Introduction
This article is my entry for the language detector part of CodeProject's Machine Learning and Artificial Intelligence Challenge[^]. The goal of the challenge was to train a model to recognize programming languages, based on a provided training dataset with 677 code samples.
I've used C#. The solution has a LanguageRecognition.Core
project which is a library with the machine learning code, and a LanguageRecognition
project which is a console application that tests the code. The project has SharpLearning[^] (more specific, SharpLearning.Neural) as dependency.
The Algorithm
I decided to go with a neural network[^] to train my model. A neural network takes a vector[^] of floating-point numbers as input, the features of the object we're trying to classify. This vector is also known as the input layer. A unit of a layer is also known as a neuron. A neural network also has an output layer. In case of a network that's trained for classification (such as the one for this project), the output layer will have as many elements as there are classification categories, where the values indicate where the network would classify a given input. (For example, the result of a classification with 3 categories could be [0.15 0.87 0.23]
, indicating that the network preferred the second category). Between the input layer and the output layer, you can also have one or more hidden layers, with a number of units that you can choose. How do you get from one layer to the next? A matrix multiplication[^] is performed with the first layer and a weight matrix, and that result goes through an activation function[^] and then we have the values of the next layer. (For the network in this article, the rectifier[^] is used because this one is used by SharpLearning.) That layer is then used to calculate the values for the next layer, and so on. For the last layer, we'll also use the softmax function[^] and not just an activation function (one important difference is that an activation function works independently on all units of a layer, whereas the softmax function has to be applied on an array of values). 'Between' every two layers, there is a different weights matrix. It's really the values of these weight matrices that decide what the output of the network is going to look like. So if a neural network is going to be trained, the values of these weight matrices are going to be adjusted so the actual outputs will match the expected outputs better. SharpLearning uses gradient descent[^] for this (more specific, mini-batch gradient descent[^]).
I am not going to go in depth about the details and mathematics of neural networks, because SharpLearning takes care of that. I'm going to focus on how to apply it to recognizing programming languages. If you're interested in learning more, there is plenty of material available; the links in the previous paragraph can be used as a start.
I mentioned that a neural network takes a floating-point vector, the features, as input. What are those going to be here? For this challenge, the number of features (and the features themselves) cannot be pre-defined because that would require assumptions on how many languages and which languages we have to be able to classify. We don't want to make this assumption; instead, we're going to derive the number of features from the code samples that we use as training. Deriving the features is the first step in our training process.
First, the features that appear to be significant for each language are derived separately. I decided to derive three types of features: the most common symbols used in the code, the most common words and the most common combinations of two words. Those features seemed most important to me. For example, in HTML, the <
and >
symbols are significant, and so are keywords such as body
, table
and div
. The keyword import
would be significant for both Java and Python, and there the combinations will be a good help: combinations like import java
would be significant for Java, and combinations like import os
would be significant for Python.
After having derived those features per language, we combine them: we want to tell our neural network about the presence (or absence) of all features that could point to a specific language. The total number of input neurons will be the sum of all symbols, keywords and combinations selected for each language (duplicates are filtered out of course; we don't need multiple input neurons for the presence of the keyword import
for example). The number of output neurons will be the number of languages that are presented in our training dataset.
Let me clarify that with an example. Imagine there were only 3 languages in the training set: C#, Python and JavaScript. For all these languages, the 10 most common symbols, 20 most common words, and 30 most common combinations are selected. That's 60 features per language, so 180 for the three languages combined. However, most of the symbols and some of the keywords/combinations will be duplicated. For the sake of this example, let's say that there are 11 unique symbols overall, 54 unique words and 87 unique combinations, then our neural network will take 11+54+87 values as input. Each input value corresponds to one symbol/word/combination and the value will be the number of occurrences of the symbol/word/combination in an arbitrary piece of code.
What about the hidden layers, the ones between the input and the output layer? I went with 4 hidden layers: if S
is the sum of all symbols, keywords, combinations, and possible output languages, then the hidden layers respectively have S / 2
, S / 3
, S / 4
and S / 5
units. Why those numbers? Because those gave me the one of the best results when testing my model - there isn't much more to it. Using S
units in all those four layers gave comparable results (perhaps even slightly better on average), but the training was much slower then.
After having selected the features to use, it's time for the actual training. For the neural network maths, I'm using the SharpLearning
library. For each code sample, the previously selected symbols/words/combinations are counted and used as neural network inputs. All to-be-recognized languages get an index, and those indexes will be passed to SharpLearning
as outputs for the training data.
When the training is done, we have a model that can recognize languages for code samples. To predict a language, the inputted code sample will be transformed into an input vector exactly like the pre-processing for the training samples (i.e. counting of certain symbols, words, and combinations) and SharpLearning
will take care of the maths to return the index of the predicted programming language.
The Implementation
CharExtensions: Defining 'symbol'
In the previous section, I said we'd select the most common symbols for all given languages as a part of the neural network features. Let's first define what a "symbol" means in this context. The following seems like a sensible definition to me: a char
is a symbol if it's not a letter, not a digit, no whitespace, and no underscore (because underscores are perfectly valid in variable names). Translating that into code:
static class CharExtensions
{
internal static bool IsProgrammingSymbol(char x)
{
return !char.IsLetterOrDigit(x) && !char.IsWhiteSpace(x) && x != '_';
}
}
LanguageTrainingSet: Deriving Features Per Language
Next, we'll work on our classes that will derive the features from the given code samples. As I previously said, we'll first do this per language, then combine the features. The LanguageTrainingSet
class takes care of the former and also holds all training samples for one language. This class has the following properties to keep track of the samples and symbol/keyword/combination counts:
List<string> samples = new List<string>();
public List<string> Samples { get => samples; }
Dictionary<char, int> symbolCounters = new Dictionary<char, int>();
Dictionary<string, int> keywordCounters = new Dictionary<string, int>();
Dictionary<string, int> wordCombinationCounters = new Dictionary<string, int>();
These collections will be filled when a new training sample is added to the training set. That's what the AddSample
method is for:
public void AddSample(string code)
{
code = code.ToLowerInvariant();
samples.Add(code);
var symbols = code.Where(CharExtensions.IsProgrammingSymbol);
foreach (char symbol in symbols)
{
if (!symbolCounters.ContainsKey(symbol))
{
symbolCounters.Add(symbol, 0);
}
symbolCounters[symbol]++;
}
string[] words = Regex.Split(code, @"\W").Where
(x => !string.IsNullOrWhiteSpace(x)).ToArray();
foreach (string word in words)
{
if (!keywordCounters.ContainsKey(word))
{
keywordCounters.Add(word, 0);
}
keywordCounters[word]++;
}
for (int i = 0; i < words.Length - 1; i++)
{
string combination = words[i] + " " + words[i + 1];
if (!wordCombinationCounters.ContainsKey(combination))
{
wordCombinationCounters.Add(combination, 0);
}
wordCombinationCounters[combination]++;
}
}
Let's walk through this step by step:
- The code is converted to lowercase. For case-insensitive languages, not doing this would most likely harm the recognition results. And for case-sensitive languages, it won't matter too much.
- The training sample is added to the
samples
list. - All symbols are extracted from the code sample using LINQ's
Where
method and the IsProgrammingSymbol
method that we created before. - We iterate over all found symbols and for each symbol, we increment the value associated with the symbol in the
symbolCounters
dictionary. - The code is split on "non-word characters" (that is, all characters that are not A-Z, a-z, 0-9 or underscores) to extract all words.
- Exactly like the symbols, the counters in the
keywordCounters
dictionary are increased. - We do the same for all combinations of two words that follow on each other.
When more samples are added using this method, the counters will gradually increase and we can get a good ranking that indicates what keywords appear most often and what keywords appear less. Eventually, we want to know what keywords, symbols, and combinations appear most and use those as features for our neural network. To select those, the class has a ChooseSymbolsAndKeywords
method. It's internal
because we want to be able to call it from other classes in the LanguageRecognition.Core
assembly, but not outside the assembly.
const int SYMBOLS_NUMBER = 10;
const int KEYWORDS_NUMBER = 20;
const int COMBINATIONS_NUMBER = 30;
internal IEnumerable<char> Symbols { get; private set; }
internal IEnumerable<string> Keywords { get; private set; }
internal IEnumerable<string> Combinations { get; private set; }
internal void ChooseSymbolsAndKeywords()
{
Symbols = symbolCounters.OrderByDescending(x => x.Value).Select
(x => x.Key).Take(SYMBOLS_NUMBER);
Keywords = keywordCounters.OrderByDescending(x => x.Value).Select
(x => x.Key).Where(x => !int.TryParse(x, out int _)).Take(KEYWORDS_NUMBER);
Combinations = wordCombinationCounters.OrderByDescending
(x => x.Value).Select(x => x.Key).Take(COMBINATIONS_NUMBER);
}
The point of the .Where
call to select the keywords is to exclude 'keywords' that are only numbers. Those wouldn't be useful at all. Numbers in combinations with letters are not excluded (and they shouldn't be; for example 1px
is still useful).
TrainingSet: Bringing Together LanguageTrainingSets
The TrainingSet
class manages all LanguageTrainingSet
s so you don't need to worry about that when you use the LanguageRecognition.Core
library. And when the LanguageRecognizer
class (which we'll talk about later) wants to perform the neural network training, the TrainingSet
class will combine the .Symbols
, .Keywords
and .Combinations
that are picked by each LanguageTrainingSet
's ChooseSymbolsAndKeywords
so we also have TrainingSet.Symbols
, TrainingSet.Keywords
and TrainingSet.Combinations
-- the features that will be used in our neural network.
public class TrainingSet
{
Dictionary<string, LanguageTrainingSet> languageSets =
new Dictionary<string, LanguageTrainingSet>();
internal Dictionary<string, LanguageTrainingSet> LanguageSets { get => languageSets; }
internal char[] Symbols { get; private set; }
internal string[] Keywords { get; private set; }
internal string[] Combinations { get; private set; }
internal string[] Languages { get; private set; }
public void AddSample(string language, string code)
{
language = language.ToLowerInvariant();
if (!languageSets.ContainsKey(language))
{
languageSets.Add(language, new LanguageTrainingSet());
}
languageSets[language].AddSample(code);
}
internal void PrepareTraining()
{
List<char> symbols = new List<char>();
List<string> keywords = new List<string>();
List<string> combinations = new List<string>();
foreach (KeyValuePair<string, LanguageTrainingSet> kvp in languageSets)
{
LanguageTrainingSet lts = kvp.Value;
lts.ChooseSymbolsAndKeywords();
symbols.AddRange(lts.Symbols);
keywords.AddRange(lts.Keywords);
combinations.AddRange(lts.Combinations);
}
Symbols = symbols.Distinct().ToArray();
Keywords = keywords.Distinct().ToArray();
Combinations = combinations.Distinct().ToArray();
Languages = languageSets.Select(x => x.Key).ToArray();
}
}
The PrepareTraining
method will be called by the LanguageRecognizer
class when it needs to know all features for the network input, and the possible languages for the output.
LanguageRecognizer: Training and Prediction
The LanguageRecognizer
class is where the actual work happens: the neural network is trained, and we get a model that we can use to predict the language of a code sample. Let's first take a look at the fields of this class:
[Serializable]
public class LanguageRecognizer
{
NeuralNet network;
char[] symbols;
string[] keywords;
string[] combinations;
string[] languages;
ClassificationNeuralNetModel model = null;
First, let's note that the class is Serializable
: if you've trained the model and want to reuse it later, you shouldn't have to retrain it, but you can just serialize it and restore it later. The symbols
, keywords
, combinations
, and languages
fields are the features for the neural network input - they will be taken from a TrainingSet
. NeuralNet
is a class from SharpLearning
and so is ClassificationNeuralNetModel
, where the latter is the trained model and the former is used for the training.
Next, we have a static CreateFromTraining
method, taking a TrainingSet
and returning an instance of LanguageRecognizer
. I decided to go with a static method and not a constructor, because the constructor guidelines[^] say to do minimal work in a constructor, and training the model is not quite "minimal work".
The LanguageRecognizer.CreateFromTraining
method constructs the neural network and its layers in the way I described previously in this article. It will go through all training samples and transform the code into an input vector. These input vectors are combined into one input matrix, and this matrix is passed to SharpLearning
, alongside with the expected outputs.
public static LanguageRecognizer CreateFromTraining(TrainingSet trainingSet)
{
LanguageRecognizer recognizer = new LanguageRecognizer();
trainingSet.PrepareTraining();
recognizer.symbols = trainingSet.Symbols;
recognizer.keywords = trainingSet.Keywords;
recognizer.combinations = trainingSet.Combinations;
recognizer.languages = trainingSet.Languages;
recognizer.network = new NeuralNet();
recognizer.network.Add(new InputLayer(recognizer.symbols.Length +
recognizer.keywords.Length + recognizer.combinations.Length));
int sum = recognizer.symbols.Length + recognizer.keywords.Length +
recognizer.combinations.Length + recognizer.languages.Length;
recognizer.network.Add(new DenseLayer(sum / 2));
recognizer.network.Add(new DenseLayer(sum / 3));
recognizer.network.Add(new DenseLayer(sum / 4));
recognizer.network.Add(new DenseLayer(sum / 5));
recognizer.network.Add(new SoftMaxLayer(recognizer.languages.Length));
ClassificationNeuralNetLearner learner =
new ClassificationNeuralNetLearner(recognizer.network, loss: new AccuracyLoss());
List<double[]> inputs = new List<double[]>();
List<double> outputs = new List<double>();
foreach (KeyValuePair<string, LanguageTrainingSet> languageSet in trainingSet.LanguageSets)
{
string language = languageSet.Key;
LanguageTrainingSet set = languageSet.Value;
foreach (string sample in set.Samples)
{
inputs.Add(recognizer.PrepareInput(sample));
outputs.Add(recognizer.PrepareOutput(language));
}
}
F64Matrix inp = inputs.ToF64Matrix();
double[] outp = outputs.ToArray();
recognizer.model = learner.Learn(inp, outp);
return recognizer;
}
This method refers to PrepareInput
and PrepareOutput
. PrepareOutput
is very simple: for a given language, it returns the index of that language in the list of known languages. PrepareInput
constructs a double[]
with the features to feed to the neural network: the count of symbols we care about, keywords we care about, and keyword combinations we care about.
double[] PrepareInput(string code)
{
code = code.ToLowerInvariant();
double[] prepared = new double[symbols.Length + keywords.Length + combinations.Length];
double symbolCount = code.Count(CharExtensions.IsProgrammingSymbol);
for (int i = 0; i < symbols.Length; i++)
{
prepared[i] = code.Count(x => x == symbols[i]);
}
string[] codeKeywords = Regex.Split(code, @"\W").Where(x => keywords.Contains(x)).ToArray();
int offset = symbols.Length;
for (int i = 0; i < keywords.Length; i++)
{
prepared[offset + i] = codeKeywords.Count(x => x == keywords[i]);
}
string[] words = Regex.Split(code, @"\W").ToArray();
Dictionary<string, int> cs = new Dictionary<string, int>();
for (int i = 0; i < words.Length - 1; i++)
{
string combination = words[i] + " " + words[i + 1];
if (!cs.ContainsKey(combination))
{
cs.Add(combination, 0);
}
cs[combination]++;
}
offset = symbols.Length + keywords.Length;
for (int i = 0; i < combinations.Length; i++)
{
prepared[offset + i] = cs.ContainsKey(combinations[i]) ? cs[combinations[i]] : 0;
}
return prepared;
}
double PrepareOutput(string language)
{
return Array.IndexOf(languages, language);
}
Lastly, after having created and trained the recognizer, we obviously want to use it to actually recognize languages. That's a very simple piece of code: the input just need to be turned into an input vector with PrepareInput
, passed into SharpLearning's trained model which gives an index as output.
public string Recognize(string code)
{
return languages[(int)model.Predict(PrepareInput(code))];
}
The Testing
The downloadable LanguageRecognition
has two projects: LanguageRecognition.Core
as library with all learning-related code, and LanguageRecognition
as console application that trains the recognizer based on the dataset provided by CodeProject. The dataset contains 677 samples. 577 of this are used for training, the remaining 100 for testing how good the model turned out to be.
The test code extracts the code samples, shuffles them, takes the first 577, performs training with those, then tests serialization and de-serialization of the model, and eventually it performs the prediction testing.
static void Main(string[] args)
{
string sampleFileContents = File.ReadAllText("LanguageSamples.txt").Trim();
string[] samples = sampleFileContents.Split(new string[] { "</pre>" },
StringSplitOptions.RemoveEmptyEntries);
List<Tuple<string, string>> taggedSamples = new List<Tuple<string, string>>();
foreach (string sample in samples)
{
string s = sample.Trim();
string pre = s.Split(new char[] { '>' }, 2)[0];
string language = pre.Split('"')[1];
s = WebUtility.HtmlDecode(s.Replace(pre + ">", ""));
taggedSamples.Add(new Tuple<string, string>(language, s));
taggedSamples = taggedSamples.OrderBy(x => Guid.NewGuid()).ToList();
}
TrainingSet ts = new TrainingSet();
foreach (Tuple<string, string> sample in taggedSamples.Take(577))
{
ts.AddSample(sample.Item1, sample.Item2);
}
LanguageRecognizer recognizer = LanguageRecognizer.CreateFromTraining(ts);
BinaryFormatter binaryFormatter = new BinaryFormatter();
LanguageRecognizer restored;
using (MemoryStream stream = new MemoryStream())
{
binaryFormatter.Serialize(stream, recognizer);
stream.Seek(0, SeekOrigin.Begin);
restored = (LanguageRecognizer)binaryFormatter.Deserialize(stream);
}
int correct = 0;
int total = 0;
foreach (Tuple<string, string> sample in taggedSamples.Skip(577))
{
if (restored.Recognize(sample.Item2) == sample.Item1.ToLowerInvariant())
{
correct++;
}
total++;
}
Console.WriteLine($"{correct}/{total}");
}
The Results
On average, the accuracy on unseen samples appears to be approximately 85%. The accuracy differs every time you run the test application though, because the code samples are shuffled (so the selected features will be somewhat different) and the neural network is initialized with different random weights every time. Sometimes, the accuracy is just below 80%, sometimes it's just above 90% as well. I wanted to test with bigger training sets as well, but I did not have the time to gather these. I believe that it would increase the accuracy though, because a bigger training set means a better selection of features and a better training of the neural network.
History
- 3rd March, 2018: Initial version