Index
This article introduces machine learning in .Net without touching the mathematical side of things. It will focus on essential work-flows and their structures of the data handling in .Net to facilitate experimentation with what is available in an open source project ML.Net version 0.2.
The ML.Net project version 0.2 is available for .Net Core 2.0 and .Net Standard 2.0 with support for x64 architecture only (Any CPU will not compile right now). It should, thus, be applicable in any framework where .Net Standard 2.0 (eg.: .Net Framework 4.6.1) is applicable. The project is currently on review. APIs may change in the future.
Learning the basics of machine learning has not not been easy, if you want to use an object oriented language like C# or VB.Net. Because most of the time you have to learn Python, before anything else, and then you have to find tutorials with sample data that can teach you more. Even looking at object oriented projects like [1] Accord.Net, Tensor.Flow, or CNTK is not easy because each of them comes with their own API, way of implementing same things differently, and so on. I was thrilled by the presentations at Build 2018 [2] because they indicated that we can use a generic work-flow approach that allows us to evaluate the subject with local data, local .Net programs, local models, and results, without having to use a service or another programming language like Python.
Machine learning is a subset of Artificial Intelligence (AI) and it can answer 5 types of questions [3]:
Supervised
- Classification (Binary and Multiclass)
Question: What class does it belong to?
- Regression
Question: How much or how many?
Unsupervised
- Ranking
Question: What should I do next?
- Clustering
Question: How is this organized?
- Anomaly Detection
Question: Is this weird?
Each type of question has many applications and in order to use the correct machine learning approach we must first try to determine if we want to answer any of the given questions, and if so, whether we have the data to support it.
This article discusses working .Net examples (source code including sample data) for binary and multiclass classifications. This type of machine learning algorithm assumes that we can tag an item to determine whether it belongs to:
- One of two groups (binary classification) or
- One of many groups (multiclass classification)
A binary classification can be applied when you want to answer a question with a true or false answer. You usually find yourself sorting an item (an image or text) into one of 2 classes. Consider, for instance, the question of whether a customer feedback to your recent survey is in a good mood (positive) or not (negative).
Answering this question with machine learning requires us to tag sample items (eg: images or text) as belonging to either group. The normal work-flow requires two independent sets of tagged data:
- A Training Data Set (to train the machine learning algorithm) and
- An Evaluation Data Set (to measure the efficiency of the ml algorithm).
A tagged line of text may look like this:
- 1 Grow up you biased child.
- 0 I hope this helps.
where "1" in the first column denotes a negative sentiment and "0" in the first column denotes a positive sentiment. The rule of thumb is usually that the ml algorithm will work better if we have more training data. And it should also be assured that the training data and the data used later on is clean and of high quality to support an effective algorithm.
The overall work-flow to determine an effective algorithm using KPIs is denoted by the diagram on the left side below, where we (ideally) find a model that reflects our classification problem best. The model is not explained in more detail here. It is, in the case of ML.Net, a zip file containing the persisted facts learned from the tagged training data.
The second independent data set for evaluation is used to determine KPIs towards the efficiency of the learned classification. This steps estimates how good our algorithm will classify items in the future by comparing the result from the machine learning algorithm with the available tag (without using the tag in the algorithm). A KPI to measure efficiency is, for example, the percentage of the number of items classified right versus the wrong classified items. We can always go back to the training step, and adjust parameters, or swap one algorithm for the other, if we find that our KPIs do not meet our expectations and we need ways to optimize the model.
The Training stage hopefully ends with an effective model which can be applied in the second Prediction stage to classify each item that we see in the future. This stage requires the model from the previous stage and the item to classify, which is used to output a prediction of a classification (eg.: positive or negative sentiment).
This is a brief overview on the work-flow for attended machine learning. We need to understand this to work with the code samples discussed in this article further below. So, lets look at each sample in turn.
The sample discussed in this section is based on A Sentiment Analysis Binary Classification Scenario from the ML.Net tutorial.
The work-flow discussed in the previous section is implemented to some degree in the demo projects attached to this article. The demo project contains two executable projects:
We get the following output if we compile and start the Training project:
Training Data Set
-----------------
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Processed 250 instances
Binning and forming Feature objects
Reserved memory for tree learner: 1943796 bytes
Starting to train ...
Not training a calibrator because it is not needed.
Evaluating Training Results
---------------------------
PredictionModel quality metrics evaluation
------------------------------------------
Accuracy: 61,11%
Auc: 96,30%
F1Score: 72,00%
We see here how the program first trains a model and evaluates the result in the second step.
The Training and the Prediction modul share a reference to the previously mentioned Model.zip file (most be copied manually - see details below), a reference to ML.Net library, and a common model of the data input and the classification output defined in the Models
project:
public class ClassificationData
{
[Column(ordinal: "0", name: "Label")]
public float Sentiment;
[Column(ordinal: "1")]
public string Text;
}
public class ClassPrediction
{
[ColumnName("PredictedLabel")]
public bool Class;
}
Public Class ClassificationData
<Column("0", "Label")>
Public Sentiment As Single
<Column("1")>
Public Text As String
End Class
Public Class ClassPrediction
<ColumnName("PredictedLabel")>
Public [Class] As Boolean
End Class
The properties defined in ClassificationData
map each column into an input that is present in the text input file. The Label column defines the item that contains the class definition that we want to train against for each line of text. The Text
property itself cannot be labeled as a "Feature" because it consists of more than one "column" (int the text file). This is why we need to add the new TextFeaturizer("Features", "Text")
line in the pipeline below to read the text into the input data structure.
The ClassificationData
is a rough description of our input and how it should be mapped into either a Label or a Feature. Try removing the Label column definition, compile and execute, to verify that the system will throw an exception, if a column named Label cannot be found in the input text.
The ClassPrediction
states only one binary output result, which is expected to be a Boolean
value that maps the input to either binary class. This part is relevant to:
- Verify whether learning was succesful (with known input in the test phase) and
- Determine the actual classifiaction of the machine learning algorithm when using its model in production.
In summery: The ClassificationData
is used to descripe how we whish to process the input (always consisting of Label and Features), and the ClassPrediction
maps this input to a learned result.
The Training pipline that consumes the text input via the ClassificationData
definition looks like this:
internal static async Task<PredictionModel<ClassificationData, ClassPrediction>>
TrainAsync(string trainingDataFile, string modelPath)
{
var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader(trainingDataFile).CreateFrom<ClassificationData>());
pipeline.Add(new TextFeaturizer("Features", "Text"));
pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });
PredictionModel<ClassificationData, ClassPrediction> model =
pipeline.Train<ClassificationData, ClassPrediction>();
await model.WriteAsync(modelPath);
return model;
}
Async Function TrainAsync(
ByVal trainingDataFile As String,
ByVal modelPath As String)
As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))
Dim pipeline = New LearningPipeline()
pipeline.Add(New TextLoader(trainingDataFile).CreateFrom(Of ClassificationData)())
pipeline.Add(New TextFeaturizer("Features", "Text"))
pipeline.Add(New StochasticDualCoordinateAscentBinaryClassifier())
Dim model As PredictionModel(Of ClassificationData, ClassPrediction) =
pipeline.Train(Of ClassificationData, ClassPrediction)()
Await model.WriteAsync(modelPath)
Return model
End Function
The ML.Net framework comes with an extensible pipeline concept in which the different processing steps can be plugged in as shown above. The TextLoader
step loads the data from the text file and the TextFeaturizer
step converts the given input text into a feature vector, which is a numerical representation of the given text. This numerical representation is then fed into something that the ML community calls a learner. The learner in this case a FastTreeBinaryClassifier
.
A learner or trainer is the component that converts the numerical feature vectors into a model that can later be used to classify input in the future. The documentation for these learners is currently under construction and not all learners are fully implemented and tested, yet. For binary classifications, there are a few alternative learners that could be used as an alternative (just edit the constructor as shown below):
Classification Method | Accuracy | Auc | F1Score |
new AveragedPerceptronBinaryClassifier() | 61.11% | 81.48% | 72.00% |
new FastForestBinaryClassifier() { NumThreads=2, NumLeaves = 25, NumTrees = 25, MinDocumentsInLeafs = 2 } | 72.22% | 97.53% | 78.26% |
new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 } | 61,11% | 96,30% | 72,00% |
new GeneralizedAdditiveModelBinaryClassifier() | 50.00% | 83.95% | 66.67% |
new LinearSvmBinaryClassifier() | 72.22% | 90.12% | 76.19% |
new LogisticRegressionBinaryClassifier() | 50.00% | 86.42% | 66.67% |
new StochasticDualCoordinateAscentBinaryClassifier | 83.33% | 98.77% | 85.71% |
new StochasticGradientDescentBinaryClassifier() | 55.56% | 90.12% | 69.23% |
Testing all of the above learners and shows that the StochasticDualCoordinateAscentBinaryClassifier
works best based on the measured KPIs. These KPIs are measured by an instance of the BinaryClassificationMetrics
which also offers other KPIs, such as, Precision and Recall. Note that you can still analyse more KPIs, , such as, memory consumption and processing time, which are also not measured here. The test presented is rather small and brief. We can also use different settings of individual learners which may still reveal significant improvements. Being able to play with these different scenarious looks like an interesting excercise when we face the problem of an automated classification of a large amount of items (text or images etc).
So, this in a nutshell how machine learning can work. The machine consumes data (text), converts it into numerical vectors, and integrates the vectorized data into a model. The Model is the main output of the first stage. Lets have a look at the Classification stage to understand the complete work-flow.
The Prediction stage is the modul that represents the code that runs in production and classifies data as new items arrive in the system. This part is implemented in the PredictAsync method of the Prediction project in the Wikipedia_SentimentAnalysis solution. The code for this method looks like this:
async Task<PredictionModel<ClassificationData, ClassPrediction>> PredictAsync(
string modelPath,
string[] classNames,
IEnumerable<ClassificationData> predicts = null,
PredictionModel<ClassificationData, ClassPrediction> model = null)
{
if (model == null)
model = await PredictionModel.ReadAsync<ClassificationData, ClassPrediction>(modelPath);
if (predicts == null)
return model;
IEnumerable<ClassPrediction> predictions = model.Predict(predicts);
Console.WriteLine("Classification Predictions");
IEnumerable<(ClassificationData sentiment, ClassPrediction prediction)> sentimentsAndPredictions =
predicts.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
foreach (var item in sentimentsAndPredictions)
{
string textDisplay = item.sentiment.Text;
if (textDisplay.Length > 80)
textDisplay = textDisplay.Substring(0, 75) + "...";
Console.WriteLine("Prediction: {0} | Text: '{1}'",
(item.prediction.Class ? classNames[0] : classNames[1]), textDisplay);
}
Console.WriteLine();
return model;
}
Async Function PredictAsync(ByVal modelPath As String,
ByVal Optional classNames As String() = Nothing,
ByVal Optional predicts As IEnumerable(Of ClassificationData) = Nothing,
ByVal Optional model As PredictionModel(Of ClassificationData, ClassPrediction) = Nothing)
As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))
If model Is Nothing Then
model = Await PredictionModel.ReadAsync(Of ClassificationData, ClassPrediction)(modelPath)
End If
If predicts Is Nothing Then Return model
Console.WriteLine()
Console.WriteLine("Classification Predictions")
Console.WriteLine("--------------------------")
For Each Item In predicts
Dim predictedResult = model.Predict(Item)
Dim textDisplay As String = Item.Text
If textDisplay.Length > 80 Then
textDisplay = textDisplay.Substring(0, 75) & "..."
End If
Dim resultClass = classNames(0)
If predictedResult.[Class] = False Then resultClass = classNames(1)
Console.WriteLine("Prediction: {0} | Text: '{1}'", resultClass, textDisplay)
Next
Console.WriteLine()
Return model
End Function
The PredictionModel.ReadAsync
line in the method loads the model from the file system into an in-memory PredictionModel
:
PredictionModel<ClassificationData, ClassPrediction> model = await
PredictionModel.ReadAsync<ClassificationData, ClassPrediction>(modelPath);
model = Await PredictionModel.ReadAsync(Of ClassificationData, ClassPrediction)(modelPath)
The model loaded is stored in the project's Learned folder. This Model.zip file has to be copied from the Training moduls output whenever we find a significant improvement and want to take advantage of it in the Prediction modul.
Everything below the model loading code line evaluates input against the loaded model and outputs a predicted classification in the last part of the method. You can use the interactive input prompt to test sample texts of your own and test on a small scale what was learned and what was not. Remember that the learned data is usually cleaned (not the same as the original input) and that you can only test on a small scale like this. A better and more reasonable test is probably to feed in the last n text lines from a real data source, get their classification, and see if an independent reviewer has a closely matching result or not.
We have seen in this section how binary classification can work for sentiment analysis in a very "simple" scenario. But the real strength of ml is that each type of question (here: Is this A or B?) can be applied in a wide variety of applications. Let's review one more sample in the next section to review another binary classification use case.
The data for the sample discussed in this section is based on the codeproject article You've Got Spam. The aim of this binary classification project is that we want to know determine whether a given text should be classified as Spam or not.
The source code attached to this article for the YouGotSpam_Analysis solution is almost identical to the code explained in the last section. Even the execute-able projects are in fact almost identical. The only difference here is the datasource for training and test evaluation, which is in this case the test data from the above codeproject article (see Data folder in Training project). The Training project produces this output:
Training Data Set
-----------------
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Processed 2000 instances
Binning and forming Feature objects
Reserved memory for tree learner: 24082752 bytes
Starting to train ...
Not training a calibrator because it is not needed.
Evaluating Training Results
---------------------------
PredictionModel quality metrics evaluation
------------------------------------------
Accuracy: 100,00%
Auc: 100,00%
F1Score: 100,00%
...which indicates that we can reach the same KPIs as indicated in the original article based on Python.
You can again use the Prediction project to load a model from the file system and test it with further input.
The projects discussed so far have shown that ML.Net can be helpful to determine a binary classification in an automated fashion. But what if I want to classify more than 2 classes (eg: negative, neutral, and positive sentiment)? The next section examines classifying data for this use case.
The data for the sample discussed in this section was downloaded from http://wortschatz.uni-leipzig.de and pre-processed (removed quote character ") for improved parsing experience.
The multiclass classifications use case discussed here is the detection of a language based on a given text. Just imagine, you have teams of social media agents and you are trying to relay online customer feedback (eg. chats), in different languages, to the correct team that speaks that language.
The LanguageDetection solution attached to this section follows the structure of the previously discussed binary classification samples. We have a Training project, a Prediction project, and a Models class library that is shared between the executables. The Training project can be used to create a model with a particular learner. A successful model can then be copied from the Training project to the Prediction project for consumation and multiclass classification of future input.
The Prediction project of the LanguageDetection solution differs in the way of how we define the LanguageClass
property in the ClassificationData
class and the Class
property in the ClassPrediction
class. Both properties must be of the data type float
to support multible classifications:
public class ClassificationData
{
[Column(ordinal: "0", name: "Label")]
public float LanguageClass;
[Column(ordinal: "1")]
public string Text;
}
public class ClassPrediction
{
[ColumnName("PredictedLabel")]
public float Class;
}
Public Class ClassificationData
<Column("0", "Label")>
Public LanguageClass As Single
<Column("1")>
Public Text As String
End Class
Public Class ClassPrediction
<ColumnName("PredictedLabel")>
Public [Class] As Single
End Class
The input mapping in ClassificationData
is the same as the one in the binary classification problem. The only difference is not that we have more than two values in the Label column of the text file that is being fed in.
The output mapping in ClassPrediction
is different because we now have to map to a float
value in order to classify towards more than one class.
The required training pipeling looks like this:
async Task<PredictionModel<ClassificationData, ClassPrediction>>
TrainAsync(string trainingDataFile, string modelPath)
{
var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader(trainingDataFile).CreateFrom<ClassificationData>());
pipeline.Add(new Dictionarizer("Label"));
pipeline.Add(new TextFeaturizer("Features", "Text"));
pipeline.Add(new StochasticDualCoordinateAscentClassifier());
pipeline.Add(new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" });
PredictionModel<ClassificationData, ClassPrediction> model =
pipeline.Train<ClassificationData, ClassPrediction>();
await model.WriteAsync(modelPath);
return model;
}
Async Function TrainAsync(ByVal trainingDataFile As String,
ByVal modelPath As String)
As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))
Dim pipeline = New LearningPipeline()
pipeline.Add(New TextLoader(trainingDataFile).CreateFrom(Of ClassificationData)())
pipeline.Add(New Dictionarizer("Label"))
pipeline.Add(New TextFeaturizer("Features", "Text"))
pipeline.Add(New StochasticDualCoordinateAscentClassifier())
pipeline.Add(New PredictedLabelColumnOriginalValueConverter() With
{
.PredictedLabelColumn = "PredictedLabel"
})
Dim model As PredictionModel(Of ClassificationData, ClassPrediction) =
pipeline.Train(Of ClassificationData, ClassPrediction)()
Await model.WriteAsync(modelPath)
Return model
End Function
The Dictionarizer("Label");
step maps each line with a labeled input value (0-5) into a bucket. The PredictedLabelColumnOriginalValueConverter
maps the predicted value (a vector) to the original values datatype (a float).
Compiling and running the Training modul gets us this output:
Training Data Set
-----------------
Not adding a normalizer.
Using 4 threads to train.
Automatically choosing a check frequency of 4.
Auto-tuning parameters: maxIterations = 48.
Auto-tuning parameters: L2 = 2.778334E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 1.
Using best model from iteration 8.
Not training a calibrator because it is not needed.
Evaluating Training Results
---------------------------
PredictionModel quality metrics evaluation
------------------------------------------
Accuracy Macro: 98.66%
Accuracy Micro: 98.66%
Top KAccuracy: 0.00%
LogLoss: 7.50%
PerClassLogLoss:
Class: 0 - 11.18%
Class: 1 - 4.08%
Class: 2 - 5.95%
Class: 3 - 10.43%
Class: 4 - 7.86%
Class: 5 - 5.52%
There are three multiclass classification learners in ML.Net Version 0.2. and their KPIs compare as indicated below:
Classification Method | Output |
new StochasticDualCoordinateAscentClassifier() | Accuracy Macro: 98.66%
Accuracy Micro: 98.66%
Top KAccuracy: 0.00%
LogLoss: 7.50%
PerClassLogLoss:
Class: 0 - 11.18%
Class: 1 - 4.08%
Class: 2 - 5.95%
Class: 3 - 10.43%
Class: 4 - 7.86%
Class: 5 - 5.52%
|
new LogisticRegressionClassifier() | Accuracy Macro: 98.52%
Accuracy Micro: 98.52%
Top KAccuracy: 0.00%
LogLoss: 8.63%
PerClassLogLoss:
Class: 0 - 13.32%
Class: 1 - 4.67%
Class: 2 - 7.09%
Class: 3 - 11.50%
Class: 4 - 8.98%
Class: 5 - 6.19%
|
new NaiveBayesClassifier() | Accuracy Macro: 96.58%
Accuracy Micro: 96.58%
Top KAccuracy: 0.00%
LogLoss: 3,453.88%
PerClassLogLoss:
Class: 0 - 3,453.88%
Class: 1 - 3,453.88%
Class: 2 - 3,453.88%
Class: 3 - 3,453.88%
Class: 4 - 3,453.88%
Class: 5 - 3,453.88%
|
So, this is how we can multiclass classify text based on one Feature input column. The same machine learning approach (binary of multiclass) is also available for more than one feature input column, as we will see next.
The Multiclass classification problem discussed in this section is a well known reference test in the pattern recognition community [4]. The original database was created by Ronald Fisher in 1936 and ML.Net sample reviewed here comes from the Get Started section of the ML.Net tutorial. The problem statement is to create an algorithm that will accept an input vector of multiple float values (representing properties of the flower), and the output of that algorithm should be the most likely name of the flower.
Doing this in ML.Net requires us to create an input mapping with more than one column:
public class ClassificationData
{
[Column("0")]
public float SepalLength;
[Column("1")]
public float SepalWidth;
[Column("2")]
public float PetalLength;
[Column("3")]
public float PetalWidth;
[Column("4")]
[ColumnName("Label")]
public string Label;
}
Public Class ClassificationData
<Column("0")>
Public SepalLength As Single
<Column("1")>
Public SepalWidth As Single
<Column("2")>
Public PetalLength As Single
<Column("3")>
Public PetalWidth As Single
<Column("4")>
<ColumnName("Label")>
Public Label As String
End Class
We are inputing a set of feature columns (namely SepalLength, SepalWidth, PetalLength, PetalWidth) that is later combined into one Features vector. The Label is in this case a string that is given as last column to identify each data row during the training and test stage of the algorithm.
The result of the predicted class should be (not surprisingly) be a string:
public class ClassPrediction
{
[ColumnName("PredictedLabel")]
public string Class;
}
Public Class ClassPrediction
<ColumnName("PredictedLabel")>
Public [Class] As String
End Class
The training code for this case is very similar to the previous section:
async Task<PredictionModel<ClassificationData, ClassPrediction>>
TrainAsync(string trainingDataFile, string modelPath)
{
var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader(trainingDataFile).CreateFrom<ClassificationData>(separator: ','));
pipeline.Add(new Dictionarizer("Label"));
pipeline.Add(new ColumnConcatenator("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"));
pipeline.Add(new StochasticDualCoordinateAscentClassifier());
pipeline.Add(new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" });
PredictionModel<ClassificationData, ClassPrediction> model =
pipeline.Train<ClassificationData, ClassPrediction>();
await model.WriteAsync(modelPath);
return model;
}
Async Function TrainAsync(ByVal trainingDataFile As String,
ByVal modelPath As String)
As Task(Of PredictionModel(Of ClassificationData, ClassPrediction))
Dim pipeline = New LearningPipeline()
pipeline.Add(New TextLoader(trainingDataFile).CreateFrom(Of ClassificationData)(separator:=","c))
pipeline.Add(New Dictionarizer("Label"))
pipeline.Add(New ColumnConcatenator("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))
pipeline.Add(New StochasticDualCoordinateAscentClassifier())
pipeline.Add(New PredictedLabelColumnOriginalValueConverter() With {
.PredictedLabelColumn = "PredictedLabel"
})
Dim model As PredictionModel(Of ClassificationData, ClassPrediction) =
pipeline.Train(Of ClassificationData, ClassPrediction)()
Await model.WriteAsync(modelPath)
Return model
End Function
There are only two new things here. The raw input data is in this case a comma seperated list, therefore, we have to use a separator: ','
parameter when loading the data from the text file in the pipeline. And we use the ColumnConcatenator
to convert the set of feature columns into one column consisting of a vector named Features.
The output is similar to what we've seen before (and we can again experiment with the other two learners as shown in the last section):
Training Data Set
-----------------
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Using 4 threads to train.
Automatically choosing a check frequency of 4.
Auto-tuning parameters: maxIterations = 45452.
Auto-tuning parameters: L2 = 2.667051E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 0.
Using best model from iteration 1956.
Not training a calibrator because it is not needed.
Evaluating Training Results
---------------------------
PredictionModel quality metrics evaluation
------------------------------------------
Accuracy Macro: 95.73%
Accuracy Micro: 95.76%
Top KAccuracy: 0.00%
LogLoss: 8.19%
PerClassLogLoss:
Class: 0 - 0.72%
Class: 1 - 10.62%
Class: 2 - 13.43%
Again, we can use the Training module of the IrisClassification solution to train different learners and settings and use the Prediction module to predict new classifications with the previously determined model.
We have seen in this section how 4 input columns (SepalLength, SepalWidth, PetalLength, PetalWidth) are converted into one vectorized Features column using the ColumnConcatenator
converter. An equivalent approach that does not require us to use a ColumnConcatenator
in the pipeline code is to use the following input class definition:
public class ClassificationData
{
public float SepalLength
{
get { return Features[0]; }
set { Features[0] = value; }
}
public float SepalWidth
{
get { return Features[1]; }
set { Features[1] = value; }
}
public float PetalLength
{
get { return Features[2]; }
set { Features[2] = value; }
}
public float PetalWidth
{
get { return Features[3]; }
set { Features[3] = value; }
}
[Column("0-3")]
[ColumnName("Features")]
[VectorType(4)] public float[] Features = new float[4];
[Column("4")]
[ColumnName("Label")]
public string Label;
}
Public Class ClassificationData
Public Property SepalLength As Single
Get
Return Features(0)
End Get
Set(ByVal value As Single)
Features(0) = value
End Set
End Property
Public Property SepalWidth As Single
Get
Return Features(1)
End Get
Set(ByVal value As Single)
Features(1) = value
End Set
End Property
Public Property PetalLength As Single
Get
Return Features(2)
End Get
Set(ByVal value As Single)
Features(2) = value
End Set
End Property
Public Property PetalWidth As Single
Get
Return Features(3)
End Get
Set(ByVal value As Single)
Features(3) = value
End Set
End Property
<Column("0-3","Features")>
<VectorType(4)>
Public Features As Single() = New Single(3) {}
<Column("4", "Label")>
Public Label As String
End Class
But it is a bad practice to define the actual feature set through the ClassificationData
definition as shown above. We should, therefore, remove the [ColumnName("Features")]
line and add the new ColumnConcatenator("Features", nameof(Digit.Features))
in the pipeline code instead. This design can give us more flexability when trying to evaluate different feature configurations.
Let us suppose for a moment that we do not want the machine learning algorithm to handle strings (since we really want to localize that part of the application). It would be a better practice to go back to handling integer values and tread each integer as an index to indicate the classification (type of flower). But how exactly can this be done? We can change the definition of the input and predicted output like so:
public class ClassificationData
{
public float SepalLength
{
get { return Features[0]; }
set { Features[0] = value; }
}
public float SepalWidth
{
get { return Features[1]; }
set { Features[1] = value; }
}
public float PetalLength
{
get { return Features[2]; }
set { Features[2] = value; }
}
public float PetalWidth
{
get { return Features[3]; }
set { Features[3] = value; }
}
[Column("0-3")]
[ColumnName("Features")]
[VectorType(4)] public float[] Features = new float[4];
[Column("4")]
[ColumnName("Label")]
public float Label;
}
public class ClassPrediction
{
[ColumnName("PredictedLabel")]
public uint Class;
}
Public Class ClassificationData
Public Property SepalLength As Single
Get
Return Features(0)
End Get
Set(ByVal value As Single)
Features(0) = value
End Set
End Property
Public Property SepalWidth As Single
Get
Return Features(1)
End Get
Set(ByVal value As Single)
Features(1) = value
End Set
End Property
Public Property PetalLength As Single
Get
Return Features(2)
End Get
Set(ByVal value As Single)
Features(2) = value
End Set
End Property
Public Property PetalWidth As Single
Get
Return Features(3)
End Get
Set(ByVal value As Single)
Features(3) = value
End Set
End Property
<Column("0-3")>
<ColumnName("Features")>
<VectorType(4)>
Public Features As Single() = New Single(3) {}
<Column("4")>
<ColumnName("Label")>
Public Label As Single
End Class
Public Class ClassPrediction
<ColumnName("PredictedLabel")>
Public [Class] As UInteger
End Class
Next, we will have to remove the PredictedLabelColumnOriginalValueConverter
from the pipeline of the previous solution and this is how we can adjust for this scenario (assuming we adjusted the data as well). This approach can also be verified in the attached IrisClassification_uint solution.
The reviewed sample applications have shown that ML.Net has an interesting value (even at version 0.2) when it comes to delivering machine learning into the .Net framework. We have seen that binary and multiclass classification can be based on different types of input and output. This input and output always requires:
- a Label and a Features column as input and
- a PredictedLabel column as output.
The data types of the inputs and outputs are flexible because converters can be used to convert values into numbers and vectors when feeding the input into the engine and the same conversion is obviously possible when we have to interprete the result of a classification.
I hope this article was useful and helps getting started with the subject. Give me your feedback in the form of stars or let me know if you see essential things to add or change since this could help us all to develop this ML.Net based apps even further.
- [1] Machine Learning Frameworks
- [2] ML.Net at Build 2018
- [3] Machine Learning for the Absolute Beginner
- [4] Pattern Recognition with the Iris Data Set
- 2018-Jun-18 Added VB.Net samples and minor bugfix in Wikipedia sample (changed default learner to best default learner instead of using worst learner by default).