Machine Learning with ML.Net and C#/VB.Net

Dirk Bahle

4.93/5 (37 votes)

28 Jun 2018CPOL18 min read

96.8K

8.6K

Solving the Classification problem with ML.Net Version 0.2.

Index

Introduction
Background
Overview
- Supervised Machine Learning
Binary Classification
- Sentiment Analysis Wikipedia
  - Training Stage
  - Prediction Stage
- You Got Spam
Multiclass Classification
- Language Detection
- Iris Flower Classification
  - Version 1
  - Version 2
Conclusions
References
History

Introduction

This article introduces machine learning in .Net without touching the mathematical side of things. It will focus on essential work-flows and their structures of the data handling in .Net to facilitate experimentation with what is available in an open source project ML.Net version 0.2.

The ML.Net project version 0.2 is available for .Net Core 2.0 and .Net Standard 2.0 with support for x64 architecture only (Any CPU will not compile right now). It should, thus, be applicable in any framework where .Net Standard 2.0 (eg.: .Net Framework 4.6.1) is applicable. The project is currently on review. APIs may change in the future.

Background

Learning the basics of machine learning has not not been easy, if you want to use an object oriented language like C# or VB.Net. Because most of the time you have to learn Python, before anything else, and then you have to find tutorials with sample data that can teach you more. Even looking at object oriented projects like [1] Accord.Net, Tensor.Flow, or CNTK is not easy because each of them comes with their own API, way of implementing same things differently, and so on. I was thrilled by the presentations at Build 2018 [2] because they indicated that we can use a generic work-flow approach that allows us to evaluate the subject with local data, local .Net programs, local models, and results, without having to use a service or another programming language like Python.

Overview

Machine learning is a subset of Artificial Intelligence (AI) and it can answer 5 types of questions [3]:

Supervised

Classification (Binary and Multiclass)
Question: What class does it belong to?
Regression
Question: How much or how many?

Unsupervised

Ranking
Question: What should I do next?
Clustering
Question: How is this organized?
Anomaly Detection
Question: Is this weird?

Each type of question has many applications and in order to use the correct machine learning approach we must first try to determine if we want to answer any of the given questions, and if so, whether we have the data to support it.

Supervised Machine Learning

This article discusses working .Net examples (source code including sample data) for binary and multiclass classifications. This type of machine learning algorithm assumes that we can tag an item to determine whether it belongs to:

One of two groups (binary classification) or
One of many groups (multiclass classification)

A binary classification can be applied when you want to answer a question with a true or false answer. You usually find yourself sorting an item (an image or text) into one of 2 classes. Consider, for instance, the question of whether a customer feedback to your recent survey is in a good mood (positive) or not (negative).

Answering this question with machine learning requires us to tag sample items (eg: images or text) as belonging to either group. The normal work-flow requires two independent sets of tagged data:

A Training Data Set (to train the machine learning algorithm) and
An Evaluation Data Set (to measure the efficiency of the ml algorithm).

A tagged line of text may look like this:

1 Grow up you biased child.
0 I hope this helps.

where "1" in the first column denotes a negative sentiment and "0" in the first column denotes a positive sentiment. The rule of thumb is usually that the ml algorithm will work better if we have more training data. And it should also be assured that the training data and the data used later on is clean and of high quality to support an effective algorithm.

The overall work-flow to determine an effective algorithm using KPIs is denoted by the diagram on the left side below, where we (ideally) find a model that reflects our classification problem best. The model is not explained in more detail here. It is, in the case of ML.Net, a zip file containing the persisted facts learned from the tagged training data.

The second independent data set for evaluation is used to determine KPIs towards the efficiency of the learned classification. This steps estimates how good our algorithm will classify items in the future by comparing the result from the machine learning algorithm with the available tag (without using the tag in the algorithm). A KPI to measure efficiency is, for example, the percentage of the number of items classified right versus the wrong classified items. We can always go back to the training step, and adjust parameters, or swap one algorithm for the other, if we find that our KPIs do not meet our expectations and we need ways to optimize the model.

The Training stage hopefully ends with an effective model which can be applied in the second Prediction stage to classify each item that we see in the future. This stage requires the model from the previous stage and the item to classify, which is used to output a prediction of a classification (eg.: positive or negative sentiment).

This is a brief overview on the work-flow for attended machine learning. We need to understand this to work with the code samples discussed in this article further below. So, lets look at each sample in turn.

Binary Classification

Sentiment Analysis Wikipedia

The sample discussed in this section is based on A Sentiment Analysis Binary Classification Scenario from the ML.Net tutorial.

Wikipedia_SentimentAnalysis.zip

Training Stage

The work-flow discussed in the previous section is implemented to some degree in the demo projects attached to this article. The demo project contains two executable projects:

Training and
Prediction

We get the following output if we compile and start the Training project:

Training Data Set
-----------------
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Processed 250 instances
Binning and forming Feature objects
Reserved memory for tree learner: 1943796 bytes
Starting to train ...
Not training a calibrator because it is not needed.

Evaluating Training Results
---------------------------

PredictionModel quality metrics evaluation
------------------------------------------
Accuracy: 61,11%
     Auc: 96,30%
 F1Score: 72,00%

We see here how the program first trains a model and evaluates the result in the second step.

The Training and the Prediction modul share a reference to the previously mentioned Model.zip file (most be copied manually - see details below), a reference to ML.Net library, and a common model of the data input and the classification output defined in the Models project:

public class ClassificationData
{
    [Column(ordinal: "0", name: "Label")]
    public float Sentiment;

    [Column(ordinal: "1")]
    public string Text;
}

public class ClassPrediction
{
    [ColumnName("PredictedLabel")]
    public bool Class;
}

Public Class ClassificationData
    <Column("0", "Label")>
    Public Sentiment As Single

    <Column("1")>
    Public Text As String
End Class

Public Class ClassPrediction
    <ColumnName("PredictedLabel")>
    Public [Class] As Boolean
End Class

The properties defined in ClassificationData map each column into an input that is present in the text input file. The Label column defines the item that contains the class definition that we want to train against for each line of text. The Text property itself cannot be labeled as a "Feature" because it consists of more than one "column" (int the text file). This is why we need to add the new TextFeaturizer("Features", "Text") line in the pipeline below to read the text into the input data structure.

The ClassificationData is a rough description of our input and how it should be mapped into either a Label or a Feature. Try removing the Label column definition, compile and execute, to verify that the system will throw an exception, if a column named Label cannot be found in the input text.

The ClassPrediction states only one binary output result, which is expected to be a Boolean value that maps the input to either binary class. This part is relevant to:

Verify whether learning was succesful (with known input in the test phase) and
Determine the actual classifiaction of the machine learning algorithm when using its model in production.

In summery: The ClassificationData is used to descripe how we whish to process the input (always consisting of Label and Features), and the ClassPrediction maps this input to a learned result.

The Training pipline that consumes the text input via the ClassificationData definition looks like this:

internal static async Task<PredictionModel<ClassificationData, ClassPrediction>>
      TrainAsync(string trainingDataFile, string modelPath)
  {
      var pipeline = new LearningPipeline();

      pipeline.Add(new TextLoader(trainingDataFile).CreateFrom<ClassificationData>());

      pipeline.Add(new TextFeaturizer("Features", "Text"));

      pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

      PredictionModel<ClassificationData, ClassPrediction> model =
                          pipeline.Train<ClassificationData, ClassPrediction>();

      // Saves the model we trained to a zip file.
      await model.WriteAsync(modelPath);

      // Returns the model we trained to use for evaluation.
      return model;
  }

Async Function TrainAsync(
                      ByVal trainingDataFile As String,
                      ByVal modelPath As String)
                      As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))

    Dim pipeline = New LearningPipeline()

    pipeline.Add(New TextLoader(trainingDataFile).CreateFrom(Of ClassificationData)())

    pipeline.Add(New TextFeaturizer("Features", "Text"))

    pipeline.Add(New StochasticDualCoordinateAscentBinaryClassifier())

    Dim model As PredictionModel(Of ClassificationData, ClassPrediction) =
                                 pipeline.Train(Of ClassificationData, ClassPrediction)()

    ' Saves the model we trained to a zip file.
    Await model.WriteAsync(modelPath)

    ' Returns the model we trained to use for evaluation.
    Return model
End Function

The ML.Net framework comes with an extensible pipeline concept in which the different processing steps can be plugged in as shown above. The TextLoader step loads the data from the text file and the TextFeaturizer step converts the given input text into a feature vector, which is a numerical representation of the given text. This numerical representation is then fed into something that the ML community calls a learner. The learner in this case a FastTreeBinaryClassifier.

A learner or trainer is the component that converts the numerical feature vectors into a model that can later be used to classify input in the future. The documentation for these learners is currently under construction and not all learners are fully implemented and tested, yet. For binary classifications, there are a few alternative learners that could be used as an alternative (just edit the constructor as shown below):

Classification Method	Accuracy	Auc	F1Score
`new AveragedPerceptronBinaryClassifier()`	61.11%	81.48%	72.00%
`new FastForestBinaryClassifier() { NumThreads=2, NumLeaves = 25, NumTrees = 25, MinDocumentsInLeafs = 2 }`	72.22%	97.53%	78.26%
`new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 }`	61,11%	96,30%	72,00%
`new GeneralizedAdditiveModelBinaryClassifier()`	50.00%	83.95%	66.67%
`new LinearSvmBinaryClassifier()`	72.22%	90.12%	76.19%
`new LogisticRegressionBinaryClassifier()`	50.00%	86.42%	66.67%
`new StochasticDualCoordinateAscentBinaryClassifier`	83.33%	98.77%	85.71%
`new StochasticGradientDescentBinaryClassifier()`	55.56%	90.12%	69.23%

Testing all of the above learners and shows that the StochasticDualCoordinateAscentBinaryClassifier works best based on the measured KPIs. These KPIs are measured by an instance of the BinaryClassificationMetrics which also offers other KPIs, such as, Precision and Recall. Note that you can still analyse more KPIs, , such as, memory consumption and processing time, which are also not measured here. The test presented is rather small and brief. We can also use different settings of individual learners which may still reveal significant improvements. Being able to play with these different scenarious looks like an interesting excercise when we face the problem of an automated classification of a large amount of items (text or images etc).

So, this in a nutshell how machine learning can work. The machine consumes data (text), converts it into numerical vectors, and integrates the vectorized data into a model. The Model is the main output of the first stage. Lets have a look at the Classification stage to understand the complete work-flow.

Prediction Stage

The Prediction stage is the modul that represents the code that runs in production and classifies data as new items arrive in the system. This part is implemented in the PredictAsync method of the Prediction project in the Wikipedia_SentimentAnalysis solution. The code for this method looks like this:

async Task<PredictionModel<ClassificationData, ClassPrediction>> PredictAsync(
    string modelPath,
    string[] classNames,
    IEnumerable<ClassificationData> predicts = null,
    PredictionModel<ClassificationData, ClassPrediction> model = null)
{
    if (model == null)
      model = await PredictionModel.ReadAsync<ClassificationData, ClassPrediction>(modelPath);

    if (predicts == null) // do we have input to predict a result?
        return model;

    IEnumerable<ClassPrediction> predictions = model.Predict(predicts);

    Console.WriteLine("Classification Predictions");

    IEnumerable<(ClassificationData sentiment, ClassPrediction prediction)> sentimentsAndPredictions =
        predicts.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));

    foreach (var item in sentimentsAndPredictions)
    {
        string textDisplay = item.sentiment.Text;

        if (textDisplay.Length > 80)
            textDisplay = textDisplay.Substring(0, 75) + "...";

        Console.WriteLine("Prediction: {0} | Text: '{1}'",
                          (item.prediction.Class ? classNames[0] : classNames[1]), textDisplay);
    }
    Console.WriteLine();

    return model;
}

Async Function PredictAsync(ByVal modelPath As String,
                            ByVal Optional classNames As String() = Nothing,
                            ByVal Optional predicts As IEnumerable(Of ClassificationData) = Nothing,
                            ByVal Optional model As PredictionModel(Of ClassificationData, ClassPrediction) = Nothing)
                            As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))
    If model Is Nothing Then
        model = Await PredictionModel.ReadAsync(Of ClassificationData, ClassPrediction)(modelPath)
    End If

    If predicts Is Nothing Then Return model

    Console.WriteLine()
    Console.WriteLine("Classification Predictions")
    Console.WriteLine("--------------------------")

    For Each Item In predicts
        Dim predictedResult = model.Predict(Item)

        Dim textDisplay As String = Item.Text

        If textDisplay.Length > 80 Then
            textDisplay = textDisplay.Substring(0, 75) & "..."
        End If

        Dim resultClass = classNames(0)
        If predictedResult.[Class] = False Then resultClass = classNames(1)

        Console.WriteLine("Prediction: {0} | Text: '{1}'", resultClass, textDisplay)
    Next
    Console.WriteLine()

    Return model
End Function

The PredictionModel.ReadAsync line in the method loads the model from the file system into an in-memory PredictionModel:

PredictionModel<ClassificationData, ClassPrediction> model = await
                           PredictionModel.ReadAsync<ClassificationData, ClassPrediction>(modelPath);

model = Await PredictionModel.ReadAsync(Of ClassificationData, ClassPrediction)(modelPath)

The model loaded is stored in the project's Learned folder. This Model.zip file has to be copied from the Training moduls output whenever we find a significant improvement and want to take advantage of it in the Prediction modul.

Everything below the model loading code line evaluates input against the loaded model and outputs a predicted classification in the last part of the method. You can use the interactive input prompt to test sample texts of your own and test on a small scale what was learned and what was not. Remember that the learned data is usually cleaned (not the same as the original input) and that you can only test on a small scale like this. A better and more reasonable test is probably to feed in the last n text lines from a real data source, get their classification, and see if an independent reviewer has a closely matching result or not.

We have seen in this section how binary classification can work for sentiment analysis in a very "simple" scenario. But the real strength of ml is that each type of question (here: Is this A or B?) can be applied in a wide variety of applications. Let's review one more sample in the next section to review another binary classification use case.

You Got Spam

The data for the sample discussed in this section is based on the codeproject article You've Got Spam. The aim of this binary classification project is that we want to know determine whether a given text should be classified as Spam or not.

The source code attached to this article for the YouGotSpam_Analysis solution is almost identical to the code explained in the last section. Even the execute-able projects are in fact almost identical. The only difference here is the datasource for training and test evaluation, which is in this case the test data from the above codeproject article (see Data folder in Training project). The Training project produces this output:

Training Data Set
-----------------
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Processed 2000 instances
Binning and forming Feature objects
Reserved memory for tree learner: 24082752 bytes
Starting to train ...
Not training a calibrator because it is not needed.

Evaluating Training Results
---------------------------

PredictionModel quality metrics evaluation
------------------------------------------
Accuracy: 100,00%
     Auc: 100,00%
 F1Score: 100,00%

...which indicates that we can reach the same KPIs as indicated in the original article based on Python.

You can again use the Prediction project to load a model from the file system and test it with further input.

The projects discussed so far have shown that ML.Net can be helpful to determine a binary classification in an automated fashion. But what if I want to classify more than 2 classes (eg: negative, neutral, and positive sentiment)? The next section examines classifying data for this use case.

Multiclass Classification

Language Detection

The data for the sample discussed in this section was downloaded from http://wortschatz.uni-leipzig.de and pre-processed (removed quote character ") for improved parsing experience.

The multiclass classifications use case discussed here is the detection of a language based on a given text. Just imagine, you have teams of social media agents and you are trying to relay online customer feedback (eg. chats), in different languages, to the correct team that speaks that language.

The LanguageDetection solution attached to this section follows the structure of the previously discussed binary classification samples. We have a Training project, a Prediction project, and a Models class library that is shared between the executables. The Training project can be used to create a model with a particular learner. A successful model can then be copied from the Training project to the Prediction project for consumation and multiclass classification of future input.

The Prediction project of the LanguageDetection solution differs in the way of how we define the LanguageClass property in the ClassificationData class and the Class property in the ClassPrediction class. Both properties must be of the data type float to support multible classifications:

public class ClassificationData
{
    [Column(ordinal: "0", name: "Label")]
    public float LanguageClass;

    [Column(ordinal: "1")]
    public string Text;
}

public class ClassPrediction
{
    [ColumnName("PredictedLabel")]
    public float Class;
}

Public Class ClassificationData
    <Column("0", "Label")>
    Public LanguageClass As Single

    <Column("1")>
    Public Text As String
End Class

Public Class ClassPrediction
    <ColumnName("PredictedLabel")>
    Public [Class] As Single
End Class

The input mapping in ClassificationData is the same as the one in the binary classification problem. The only difference is not that we have more than two values in the Label column of the text file that is being fed in.

The output mapping in ClassPrediction is different because we now have to map to a float value in order to classify towards more than one class.

The required training pipeling looks like this:

async Task<PredictionModel<ClassificationData, ClassPrediction>>
TrainAsync(string trainingDataFile, string modelPath)
{
    var pipeline = new LearningPipeline();

    pipeline.Add(new TextLoader(trainingDataFile).CreateFrom<ClassificationData>());

    pipeline.Add(new Dictionarizer("Label"));
    pipeline.Add(new TextFeaturizer("Features", "Text"));

    pipeline.Add(new StochasticDualCoordinateAscentClassifier());
    //pipeline.Add(new LogisticRegressionClassifier());
    //pipeline.Add(new NaiveBayesClassifier());

    pipeline.Add(new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" });

    // Train the pipeline based on the dataset that has been loaded, transformed.
    PredictionModel<ClassificationData, ClassPrediction> model =
                        pipeline.Train<ClassificationData, ClassPrediction>();

    await model.WriteAsync(modelPath); // Saves the model we trained to a zip file.

    return model;
}

Async Function TrainAsync(ByVal trainingDataFile As String,
                          ByVal modelPath As String)
                          As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))

    Dim pipeline = New LearningPipeline()

    pipeline.Add(New TextLoader(trainingDataFile).CreateFrom(Of ClassificationData)())

    pipeline.Add(New Dictionarizer("Label"))

    pipeline.Add(New TextFeaturizer("Features", "Text"))

    pipeline.Add(New StochasticDualCoordinateAscentClassifier())
    'pipeline.Add(new LogisticRegressionClassifier());
    'pipeline.Add(new NaiveBayesClassifier());

    pipeline.Add(New PredictedLabelColumnOriginalValueConverter() With
    {
        .PredictedLabelColumn = "PredictedLabel"
    })

    ' Train the pipeline based on the dataset that has been loaded, transformed.
    Dim model As PredictionModel(Of ClassificationData, ClassPrediction) =
                        pipeline.Train(Of ClassificationData, ClassPrediction)()

    ' Saves the model we trained to a zip file.
    Await model.WriteAsync(modelPath)

    ' Returns the model we trained to use for evaluation.
    Return model
End Function

The Dictionarizer("Label"); step maps each line with a labeled input value (0-5) into a bucket. The PredictedLabelColumnOriginalValueConverter maps the predicted value (a vector) to the original values datatype (a float).

Compiling and running the Training modul gets us this output:

Training Data Set
-----------------
Not adding a normalizer.
Using 4 threads to train.
Automatically choosing a check frequency of 4.
Auto-tuning parameters: maxIterations = 48.
Auto-tuning parameters: L2 = 2.778334E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 1.
Using best model from iteration 8.
Not training a calibrator because it is not needed.

Evaluating Training Results
---------------------------

PredictionModel quality metrics evaluation
------------------------------------------
  Accuracy Macro: 98.66%
  Accuracy Micro: 98.66%
   Top KAccuracy: 0.00%
         LogLoss: 7.50%

 PerClassLogLoss:
       Class: 0 - 11.18%
       Class: 1 - 4.08%
       Class: 2 - 5.95%
       Class: 3 - 10.43%
       Class: 4 - 7.86%
       Class: 5 - 5.52%

There are three multiclass classification learners in ML.Net Version 0.2. and their KPIs compare as indicated below:

Classification Method	Output
`new StochasticDualCoordinateAscentClassifier()`	Accuracy Macro: 98.66% Accuracy Micro: 98.66% Top KAccuracy: 0.00% LogLoss: 7.50% PerClassLogLoss: Class: 0 - 11.18% Class: 1 - 4.08% Class: 2 - 5.95% Class: 3 - 10.43% Class: 4 - 7.86% Class: 5 - 5.52%
`new LogisticRegressionClassifier()`	Accuracy Macro: 98.52% Accuracy Micro: 98.52% Top KAccuracy: 0.00% LogLoss: 8.63% PerClassLogLoss: Class: 0 - 13.32% Class: 1 - 4.67% Class: 2 - 7.09% Class: 3 - 11.50% Class: 4 - 8.98% Class: 5 - 6.19%
`new NaiveBayesClassifier()`	Accuracy Macro: 96.58% Accuracy Micro: 96.58% Top KAccuracy: 0.00% LogLoss: 3,453.88% PerClassLogLoss: Class: 0 - 3,453.88% Class: 1 - 3,453.88% Class: 2 - 3,453.88% Class: 3 - 3,453.88% Class: 4 - 3,453.88% Class: 5 - 3,453.88%

So, this is how we can multiclass classify text based on one Feature input column. The same machine learning approach (binary of multiclass) is also available for more than one feature input column, as we will see next.

Iris Flower Classification

Version 1

The Multiclass classification problem discussed in this section is a well known reference test in the pattern recognition community [4]. The original database was created by Ronald Fisher in 1936 and ML.Net sample reviewed here comes from the Get Started section of the ML.Net tutorial. The problem statement is to create an algorithm that will accept an input vector of multiple float values (representing properties of the flower), and the output of that algorithm should be the most likely name of the flower.

Doing this in ML.Net requires us to create an input mapping with more than one column:

public class ClassificationData
{
    [Column("0")]
    public float SepalLength;

    [Column("1")]
    public float SepalWidth;

    [Column("2")]
    public float PetalLength;

    [Column("3")]
    public float PetalWidth;

    [Column("4")]
    [ColumnName("Label")]
    public string Label;
}

Public Class ClassificationData
    <Column("0")>
    Public SepalLength As Single

    <Column("1")>
    Public SepalWidth As Single

    <Column("2")>
    Public PetalLength As Single

    <Column("3")>
    Public PetalWidth As Single

    <Column("4")>
    <ColumnName("Label")>
    Public Label As String
End Class

We are inputing a set of feature columns (namely SepalLength, SepalWidth, PetalLength, PetalWidth) that is later combined into one Features vector. The Label is in this case a string that is given as last column to identify each data row during the training and test stage of the algorithm.

The result of the predicted class should be (not surprisingly) be a string:

public class ClassPrediction
{
    [ColumnName("PredictedLabel")]
    public string Class;
}

Public Class ClassPrediction
    <ColumnName("PredictedLabel")>
    Public [Class] As String
End Class

The training code for this case is very similar to the previous section:

async Task<PredictionModel<ClassificationData, ClassPrediction>>
    TrainAsync(string trainingDataFile, string modelPath)
{
    var pipeline = new LearningPipeline();

    pipeline.Add(new TextLoader(trainingDataFile).CreateFrom<ClassificationData>(separator: ','));

    pipeline.Add(new Dictionarizer("Label"));

    pipeline.Add(new ColumnConcatenator("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"));

    pipeline.Add(new StochasticDualCoordinateAscentClassifier());

    pipeline.Add(new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" });

    // Train the pipeline based on the dataset that has been loaded, transformed.
    PredictionModel<ClassificationData, ClassPrediction> model =
                        pipeline.Train<ClassificationData, ClassPrediction>();

    await model.WriteAsync(modelPath);

    return model;
}

Async Function TrainAsync(ByVal trainingDataFile As String,
                          ByVal modelPath As String)
                          As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))

    Dim pipeline = New LearningPipeline()

    pipeline.Add(New TextLoader(trainingDataFile).CreateFrom(Of ClassificationData)(separator:=","c))

    pipeline.Add(New Dictionarizer("Label"))

    pipeline.Add(New ColumnConcatenator("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))

    pipeline.Add(New StochasticDualCoordinateAscentClassifier())

    pipeline.Add(New PredictedLabelColumnOriginalValueConverter() With {
        .PredictedLabelColumn = "PredictedLabel"
    })

    ' Train the pipeline based on the dataset that has been loaded, transformed.
    ' Saves the model we trained to a zip file.
    Dim model As PredictionModel(Of ClassificationData, ClassPrediction) =
                         pipeline.Train(Of ClassificationData, ClassPrediction)()
    
    Await model.WriteAsync(modelPath)

    ' Returns the model we trained to use for evaluation.
    Return model
End Function

There are only two new things here. The raw input data is in this case a comma seperated list, therefore, we have to use a separator: ',' parameter when loading the data from the text file in the pipeline. And we use the ColumnConcatenator to convert the set of feature columns into one column consisting of a vector named Features.

The output is similar to what we've seen before (and we can again experiment with the other two learners as shown in the last section):

Training Data Set
-----------------
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Using 4 threads to train.
Automatically choosing a check frequency of 4.
Auto-tuning parameters: maxIterations = 45452.
Auto-tuning parameters: L2 = 2.667051E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 0.
Using best model from iteration 1956.
Not training a calibrator because it is not needed.

Evaluating Training Results
---------------------------

PredictionModel quality metrics evaluation
------------------------------------------
  Accuracy Macro: 95.73%
  Accuracy Micro: 95.76%
   Top KAccuracy: 0.00%
         LogLoss: 8.19%

 PerClassLogLoss:
       Class: 0 - 0.72%
       Class: 1 - 10.62%
       Class: 2 - 13.43%

Again, we can use the Training module of the IrisClassification solution to train different learners and settings and use the Prediction module to predict new classifications with the previously determined model.

We have seen in this section how 4 input columns (SepalLength, SepalWidth, PetalLength, PetalWidth) are converted into one vectorized Features column using the ColumnConcatenator converter. An equivalent approach that does not require us to use a ColumnConcatenator in the pipeline code is to use the following input class definition:

public class ClassificationData
{
    public float SepalLength
    {
      get { return Features[0]; }
      set { Features[0] = value; }
    }
    
    public float SepalWidth 
    {
      get { return Features[1]; }
      set { Features[1] = value; }
    }
    
    public float PetalLength
    {
      get { return Features[2]; }
      set { Features[2] = value; }
    }
    
    public float PetalWidth 
    {
      get { return Features[3]; }
      set { Features[3] = value; }
    }

    [Column("0-3")]
    [ColumnName("Features")]
    [VectorType(4)] public float[] Features = new float[4];

    [Column("4")]
    [ColumnName("Label")]
    public string Label;
}

Public Class ClassificationData
    Public Property SepalLength As Single
        Get
            Return Features(0)
        End Get
        Set(ByVal value As Single)
            Features(0) = value
        End Set
    End Property

    Public Property SepalWidth As Single
        Get
            Return Features(1)
        End Get
        Set(ByVal value As Single)
            Features(1) = value
        End Set
    End Property

    Public Property PetalLength As Single
        Get
            Return Features(2)
        End Get
        Set(ByVal value As Single)
            Features(2) = value
        End Set
    End Property

    Public Property PetalWidth As Single
        Get
            Return Features(3)
        End Get
        Set(ByVal value As Single)
            Features(3) = value
        End Set
    End Property

    <Column("0-3","Features")>
    <VectorType(4)>
    Public Features As Single() = New Single(3) {}
    <Column("4", "Label")>
    Public Label As String
End Class

But it is a bad practice to define the actual feature set through the ClassificationData definition as shown above. We should, therefore, remove the [ColumnName("Features")] line and add the new ColumnConcatenator("Features", nameof(Digit.Features)) in the pipeline code instead. This design can give us more flexability when trying to evaluate different feature configurations.

Version 2

Let us suppose for a moment that we do not want the machine learning algorithm to handle strings (since we really want to localize that part of the application). It would be a better practice to go back to handling integer values and tread each integer as an index to indicate the classification (type of flower). But how exactly can this be done? We can change the definition of the input and predicted output like so:

public class ClassificationData
{
    public float SepalLength
    {
      get { return Features[0]; }
      set { Features[0] = value; }
    }
    
    public float SepalWidth
    {
      get { return Features[1]; }
      set { Features[1] = value; }
    }
    
    public float PetalLength
    {
      get { return Features[2]; }
      set { Features[2] = value; }
    }
    
    public float PetalWidth
    {
      get { return Features[3]; }
      set { Features[3] = value; }
    }

    [Column("0-3")]
    [ColumnName("Features")]
    [VectorType(4)] public float[] Features = new float[4];

    [Column("4")]
    [ColumnName("Label")]
    public float Label;
}

public class ClassPrediction
{
    [ColumnName("PredictedLabel")]
    public uint Class;
}

Public Class ClassificationData
    Public Property SepalLength As Single
        Get
            Return Features(0)
        End Get
        Set(ByVal value As Single)
            Features(0) = value
        End Set
    End Property

    Public Property SepalWidth As Single
        Get
            Return Features(1)
        End Get
        Set(ByVal value As Single)
            Features(1) = value
        End Set
    End Property

    Public Property PetalLength As Single
        Get
            Return Features(2)
        End Get
        Set(ByVal value As Single)
            Features(2) = value
        End Set
    End Property

    Public Property PetalWidth As Single
        Get
            Return Features(3)
        End Get
        Set(ByVal value As Single)
            Features(3) = value
        End Set
    End Property

    <Column("0-3")>
    <ColumnName("Features")>
    <VectorType(4)>
    Public Features As Single() = New Single(3) {}

    <Column("4")>
    <ColumnName("Label")>
    Public Label As Single
End Class

Public Class ClassPrediction
    <ColumnName("PredictedLabel")>
    Public [Class] As UInteger
End Class

Next, we will have to remove the PredictedLabelColumnOriginalValueConverter from the pipeline of the previous solution and this is how we can adjust for this scenario (assuming we adjusted the data as well). This approach can also be verified in the attached IrisClassification_uint solution.

Conclusions

The reviewed sample applications have shown that ML.Net has an interesting value (even at version 0.2) when it comes to delivering machine learning into the .Net framework. We have seen that binary and multiclass classification can be based on different types of input and output. This input and output always requires:

a Label and a Features column as input and
a PredictedLabel column as output.

The data types of the inputs and outputs are flexible because converters can be used to convert values into numbers and vectors when feeding the input into the engine and the same conversion is obviously possible when we have to interprete the result of a classification.

I hope this article was useful and helps getting started with the subject. Give me your feedback in the form of stars or let me know if you see essential things to add or change since this could help us all to develop this ML.Net based apps even further.

References

[1] Machine Learning Frameworks
[2] ML.Net at Build 2018
[3] Machine Learning for the Absolute Beginner
[4] Pattern Recognition with the Iris Data Set

History

2018-Jun-18 Added VB.Net samples and minor bugfix in Wikipedia sample (changed default learner to best default learner instead of using worst learner by default).

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)