The objective of this article is to share my learnings of how to embed a machine learning algorithm like extreme gradient boosting in a C# application.
Table of Contents
Introduction
Image source: Wikipedia
In this article, I demonstrate how to use the C# wrapper of the popular XGBoost unmanaged library. XGBoost stands for "Extreme Gradient Boosting". I use the famous IRIS dataset to train and test a model. My objective is to share my learnings of how to embed a machine learning algorithm like extreme gradient boosting in your C# application. Before I move forward, I must extend my gratitude to the developers of the XGBoost unmanaged library and to the developers of .NET wrapper library.
top
Background
This article expects the user to be comfortable with an intermediate knowledge of the following:
- Decision tree algorithm
- Gradient boosting algorithm
- Data normalization
- C#
This article and the accompanying code refrains from providing an indepth tutorial of decision trees and gradient boosting algorithms. I have provided links to Youtube training videos which, in my opinion, are of immense educational importance.
top
Overview of Gradient Boost Classification Algorithm
Intro to Decision Trees (StatQuest)
top
Understanding Gini Index while Constructing a Decision Tree
top
Intro to AdaBoost
top
Intro to Gradient Boost
top
XGBoost Library (C#)
Managed Wrapper
The C/C++ source code for the original XGBoost
library is available on Github. You can find build instructions for Windows. Thanks to the efforts of PicNet, we can skip the step of compiling the unmanaged sources and directly jump to the managed wrapper.
top
Simple Linear Classification Problem
We will carry out a simple exercise where we will train a model to classify two clusters of points which are nicely linearly separable:
[TestMethod]
public void LinearClassification1()
{
var xgb = new XGBoost.XGBClassifier();
float[][] vectorsTrain = new float[][]
{
new[] {0.5f,0.5f},
new[] {0.6f,0.6f},
new[] {0.6f,0.4f},
new[] {0.4f,0.6f},
new[] {0.4f,0.4f},
new[] {-0.5f,-0.5f},
new[] {-0.6f,-0.6f},
new[] {-0.6f,-0.4f},
new[] {-0.4f,-0.6f},
new[] {-0.4f,-0.4f},
};
var lablesTrain = new[]
{
1.0f,
1.0f,
1.0f,
1.0f,
1.0f,
0.0f,
0.0f,
0.0f,
0.0f,
0.0f,
};
Assert.AreEqual(vectorsTrain.Length, lablesTrain.Length);
xgb.Fit(vectorsTrain, lablesTrain);
float[][] vectorsTest = new float[][]
{
new[] {0.55f,0.55f},
new[] {0.55f,0.45f},
new[] {0.45f,0.55f},
new[] {0.45f,0.45f},
new[] {-0.55f,-0.55f},
new[] {-0.55f,-0.45f},
new[] {-0.45f,-0.55f},
new[] {-0.45f,-0.45f},
};
var labelsTestExpected = new[]
{
1.0f,
1.0f,
1.0f,
1.0f,
0.0f,
0.0f,
0.0f,
0.0f,
};
float[] labelsTestPredicted = xgb.Predict(vectorsTest);
CollectionAssert.AreEqual(labelsTestPredicted, labelsTestExpected);
}
top
Implementing XOR Logic
The XOR logic is more complex than a linear classification. The data points are not directly linearly separable.
XOR Truth Table
X | Y | OUTPUT
--------------
1 | 0 | 1
--------------
0 | 1 | 1
--------------
0 | 0 | 0
--------------
1 | 1 | 0
--------------
Sample Code
[TestMethod]
public void TestMethod1()
{
var xgb = new XGBoost.XGBClassifier();
int countTrainingPoints = 50;
entity.XGBArray trainClass_0_1 =
Util.GenerateRandom2dPoints(countTrainingPoints / 2,
0.0, 0.5,
0.5, 1.0, 1.0);
entity.XGBArray trainClass_1_0 =
Util.GenerateRandom2dPoints(countTrainingPoints / 2,
0.5, 1.0,
0.0, 0.5, 1.0);
entity.XGBArray trainClass_0_0 =
Util.GenerateRandom2dPoints(countTrainingPoints / 2,
0.0, 0.5,
0.0, 0.5, 0.0);
entity.XGBArray trainClass_1_1 =
Util.GenerateRandom2dPoints(countTrainingPoints / 2,
0.5, 1.0,
0.5, 1.0, 0.0);
entity.XGBArray allVectorsTraining =
Util.UnionOfXGBArrays(trainClass_0_1,trainClass_1_0,
trainClass_0_0,trainClass_1_1);
xgb.Fit(allVectorsTraining.Vectors, allVectorsTraining.Labels);
int countTestingPoints = 10;
entity.XGBArray testClass_0_1 =
Util.GenerateRandom2dPoints(countTestingPoints ,
0.1, 0.4,
0.6, 0.9, 1.0);
entity.XGBArray testClass_1_0 =
Util.GenerateRandom2dPoints(countTestingPoints,
0.6, 0.9,
0.1, 0.4, 1.0);
entity.XGBArray testClass_0_0 =
Util.GenerateRandom2dPoints(countTestingPoints,
0.1, 0.4,
0.1, 0.4, 0.0);
entity.XGBArray testClass_1_1 =
Util.GenerateRandom2dPoints(countTestingPoints,
0.6, 0.9,
0.6, 0.9, 0.0);
entity.XGBArray allVectorsTest =
Util.UnionOfXGBArrays(testClass_0_1, testClass_1_0,
testClass_0_0,testClass_1_1);
var resultsActual = xgb.Predict(allVectorsTest.Vectors);
CollectionAssert.AreEqual(resultsActual, allVectorsTest.Labels);
}
top
Persisting a Model to File
Once a model has been trained and found to produce satisfactory results, you would like to use this model in production. The method SaveModelToFile
will persist the model to a binary file. The static
method LoadClassifierFromFile
will rehydrate the saved model.
var xgbTrainer = new XGBoost.XGBClassifier();
xgbTrainer.SaveModelToFile("SimpleLinearClassifier.dat");
var xgbProduction = XGBoost.XGBClassifier.LoadClassifierFromFile(fileModel);
Iris Dataset
Overview
Source: Wikipedia
The data set contains 50 records from each of the three species of the Iris flower. This data set is a test case to demonstrate many statistical classification techniques. Describe the columns:
Iris-setosa
Iris-versicolor
Iris-virginica
top
Data Structure
Source: Wikipedia
top
Parsing IRIS Records from CSV
public class Iris
{
public float Col1 { get; set; }
public float Col2 { get; set; }
public float Col3 { get; set; }
public float Col4 { get; set; }
public string Petal { get; set; }
}
private Iris[] LoadIris(string filename)
{
string pathFull = System.IO.Path.Combine(Util.GetProjectDir2(), filename);
List<Iris> records = new List<Iris>();
using (var parser = new TextFieldParser(pathFull))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
var fields = parser.ReadFields();
Iris oRecord = new Iris();
oRecord.Col1 = float.Parse(fields[0]);
oRecord.Col2 = float.Parse(fields[1]);
oRecord.Col3 = float.Parse(fields[2]);
oRecord.Col4 = float.Parse(fields[3]);
oRecord.Petal = fields[4];
records.Add(oRecord);
}
}
top
Creating a Feature Vector From CSV
internal static XGVector<Iris>[] ConvertFromIrisToFeatureVectors(Iris[] records)
{
List<XGVector<Iris>> vectors = new List<XGVector<Iris>>();
foreach (var rec in records)
{
XGVector<Iris> newVector = new XGVector<Iris>();
newVector.Original = rec;
newVector.Features = new float[]
{
rec.Col1, rec.Col2,rec.Col3,rec.Col4
};
newVector.Label = ConvertLabelFromStringToNumeric(rec.Petal);
vectors.Add(newVector);
}
return vectors.ToArray();
}
internal static float ConvertLabelFromStringToNumeric(string petal)
{
if (petal.Contains("setosa"))
{
return 0;
}
else if (petal.Contains("versicolor"))
{
return 1.0f;
}
else if (petal.Contains("virginica"))
{
return 2.0f;
}
else
{
throw new NotImplementedException();
}
}
top
Loading IRIS-Putting It All Together
[TestMethod]
public void BasicLoadData()
{
string filename = "Iris\\Iris.train.data";
iris.Iris[] records = IrisUtils.LoadIris(filename);
entity.XGVector<iris.Iris>[] vectors =
IrisUtils.ConvertFromIrisToFeatureVectors(records);
Assert.IsTrue(records.Length >= 140);
}
top
Training and Testing IRIS
[TestMethod]
public void TrainAndTestIris()
{
string filenameTrain = "Iris\\Iris.train.data";
iris.Iris[] recordsTrain = IrisUtils.LoadIris(filenameTrain);
entity.XGVector<iris.Iris>[] vectorsTrain =
IrisUtils.ConvertFromIrisToFeatureVectors(recordsTrain);
string filenameTest = "Iris\\Iris.test.data";
iris.Iris[] recordsTest = IrisUtils.LoadIris(filenameTest);
entity.XGVector<iris.Iris>[] vectorsTest =
IrisUtils.ConvertFromIrisToFeatureVectors(recordsTest);
int noOfClasses = 3;
var xgbc = new XGBoost.XGBClassifier
(objective: "multi:softprob", numClass:3);
entity.XGBArray arrTrain = Util.ConvertToXGBArray(vectorsTrain);
entity.XGBArray arrTest = Util.ConvertToXGBArray(vectorsTest);
xgbc.Fit(arrTrain.Vectors, arrTrain.Labels);
var outcomeTest=xgbc.Predict(arrTest.Vectors);
for(int index=0;index<arrTest.Vectors.Length;index++)
{
string sExpected = IrisUtils.ConvertLabelFromNumericToString
(arrTest.Labels[index]);
float[] arrResults = new float[]
{
outcomeTest[index*noOfClasses +0],
outcomeTest[index*noOfClasses +1],
outcomeTest[index*noOfClasses +2]
};
float max = arrResults.Max();
int indexWithMaxValue = Util.GetIndexWithMaxValue(arrResults);
string sActualClass = IrisUtils.ConvertLabelFromNumericToString
((float)indexWithMaxValue);
Trace.WriteLine($"{index} Expected={sExpected}
Actual={sActualClass}");
Assert.AreEqual(sActualClass, sExpected);
}
string pathFull = System.IO.Path.Combine(Util.GetProjectDir2(),
_fileModelIris);
xgbc.SaveModelToFile(pathFull);
}
top
Using the Code
Github
Solution Structure
|
|-----XGBoost
|
|-----XGBoostTests
| |
| |---iris
| | |
| | |--Iris.data
| | |
| | |--Iris.test.data
| | |
| | |--Iris.train.data
| | |
| | |--Iris.cs
| | |
| |
| |---IrisUtils.cs
| |
| |---IrisUnitTest.cs
| |
| |---SimpleLinearClassifierTests.cs
| |
| |---XORClassifierTests.cs
| |
|
|
top
History
- 4th September, 2019: Initial version