Here we will preprocess a coin dataset for later training in a supervised learning model. Preprocess a dataset in machine learning usually involves tasks such as the following:
- Clean the data - Filling in the holes that missing or corrupted data leave by averaging the values of the surrounding data or using some other strategy.
- Normalize the data - Scaling values into a standard range, usually 0 to 1. Data that has a wide range of values can cause irregularities, so we bring everything into a common range.
- One Hot Encode labels - Encoding the labels, or classes, of the objects in the dataset as binary N-dimensional vectors, where N is the total number of classes. The array elements are all set to 0, except for the element that corresponds to the class of the object, which is set to 1. This means that in each array there is a single element whose value is 1.
- Divide the input dataset into a training set and a validation set - The training set is used for training the model and the validation set is used for checking how accurate our training was by evaluating the resulting model (after training) against a subset that was not trained on the original dataset.
In this example we will use Numpy.NET, which is basically the .NET version of the popular Numpy library in Python. Numpy is a library focused on working with matrices.
To implement our dataset processor, we create the Utils
and DataSet
classes in a PreProcessing folder. The Utils
class incorporates a static Normalize
method shown here:
public class Utils
{
public static NDarray Normalize(string path)
{
var colorMode = Settings.Channels == 3 ? "rgb" : "grayscale";
var img = ImageUtil.LoadImg(path, color_mode: colorMode, target_size: (Settings.ImgWidth, Settings.ImgHeight));
return ImageUtil.ImageToArray(img) / 255;
}
}
In this method, we load an image with a given color mode (RGB or grayscale) and resize it to a given width and height. Then we return the matrix that contains the image with each element divided by 255. Dividing each element by 255 normalizes them, since the value of any pixel in an image is between 0 and 255, so by dividing them by 255 we make sure that the new range is 0 to 1, inclusive.
We also use in the code a Settings
class. This class contains constants for the possible values of many parameters used across the application. The other class, DataSet
, represents the dataset we are going to use to train the machine learning model. Here we have the following fields:
_pathToFolder
- The path to the folder containing the images. _extList
- The list of file extensions to consider. _labels
- The labels, or classes, of the images in _pathToFolder
. _objs
- The images themselves, represented as a Numpy.NDarray
. _validationSplit
- The percentage used for dividing the total number of images into a validation set and training set, in this case the percentage will define the size of the validation set in relation to the total number of images. NumberClasses
- The total number of unique classes in the dataset. TrainX
- The training data, represented as a Numpy.NDarray
. TrainY
- The training labels, represented as a Numpy.NDarray
. ValidationX
- The validation data, represented as a Numpy.NDarray
. ValidationY
-The validation labels, represented as a Numpy.NDarray
.
Here’s the DataSet
class:
public class DataSet
{
private string _pathToFolder;
private string[] _extList;
private List<int> _labels;
private List<NDarray> _objs;
private double _validationSplit;
public int NumberClasses { get; set; }
public NDarray TrainX { get; set; }
public NDarray ValidationX { get; set; }
public NDarray TrainY { get; set; }
public NDarray ValidationY { get; set; }
public DataSet(string pathToFolder, string[] extList, int numberClasses, double validationSplit)
{
_pathToFolder = pathToFolder;
_extList = extList;
NumberClasses = numberClasses;
_labels = new List<int>();
_objs = new List<NDarray>();
_validationSplit = validationSplit;
}
public void LoadDataSet()
{
string[] fileEntries = Directory.GetFiles(_pathToFolder);
foreach (string fileName in fileEntries)
if (IsRequiredExtFile(fileName))
ProcessFile(fileName);
MapToClassRange();
GetTrainValidationData();
}
private bool IsRequiredExtFile(string fileName)
{
foreach (var ext in _extList)
{
if (fileName.Contains("." + ext))
{
return true;
}
}
return false;
}
private void MapToClassRange()
{
HashSet<int> uniqueLabels = _labels.ToHashSet();
var uniqueLabelList = uniqueLabels.ToList();
uniqueLabelList.Sort();
_labels = _labels.Select(x => uniqueLabelList.IndexOf(x)).ToList();
}
private NDarray OneHotEncoding(List<int> labels)
{
var npLabels = np.array(labels.ToArray()).reshape(-1);
return Util.ToCategorical(npLabels, num_classes: NumberClasses);
}
private void ProcessFile(string path)
{
_objs.Add(Utils.Normalize(path));
ProcessLabel(Path.GetFileName(path));
}
private void ProcessLabel(string filename)
{
_labels.Add(int.Parse(ExtractClassFromFileName(filename)));
}
private string ExtractClassFromFileName(string filename)
{
return filename.Split('_')[0].Replace("class", "");
}
private void GetTrainValidationData()
{
var listIndices = Enumerable.Range(0, _labels.Count).ToList();
var toValidate = _objs.Count * _validationSplit;
var random = new Random();
var xValResult = new List<NDarray>();
var yValResult = new List<int>();
var xTrainResult = new List<NDarray>();
var yTrainResult = new List<int>();
for (var i = 0; i < toValidate; i++)
{
var randomIndex = random.Next(0, listIndices.Count);
var indexVal = listIndices[randomIndex];
xValResult.Add(_objs[indexVal]);
yValResult.Add(_labels[indexVal]);
listIndices.RemoveAt(randomIndex);
}
listIndices.ForEach(indexVal =>
{
xTrainResult.Add(_objs[indexVal]);
yTrainResult.Add(_labels[indexVal]);
});
TrainY = OneHotEncoding(yTrainResult);
ValidationY = OneHotEncoding(yValResult);
TrainX = np.array(xTrainResult);
ValidationX = np.array(xValResult);
}
}
Here is an explanation of each method:
LoadDataSet()
- The main method of the class that we call to load a dataset in _pathToFolder
. It calls the other methods listed below to do this. IsRequiredExtFile(filename)
- Checks if a given file contains at least one of the extensions (listed in _extList
) that should be processed for this dataset. MapToClassRange()
- Gets the list of unique labels in the dataset. ProcessFile(path)
- Uses the Utils.Normalize
method to normalize an image and calls the ProcessLabel
method. ProcessLabel(filename)
- Adds as label the result of the ExtractClassFromFileName
method. ExtractClassFromFileName(filename)
- Extracts the class from the filename of the image. GetTrainValidationData()
- Divides the dataset into training and validation sub-datasets.
In this series we will be using the coin image dataset at https://cvl.tuwien.ac.at/research/cvl-databases/coin-image-dataset/.
To load the dataset we can include the following in the main class of our console application:
var numberClasses = 60;
var fileExt = new string[] { ".png" };
var dataSetFilePath = @"C:/Users/arnal/Downloads/coin_dataset";
var dataSet = new PreProcessing.DataSet(dataSetFilePath, fileExt, numberClasses, 0.2);
dataSet.LoadDataSet();
Our data can now be inputted in a machine learning model. The next article will go over the basics of supervised machine learning and what the training and verification phases consist of. It is intended for readers with none to little experience with AI.