This is a series of articles about my ongoing journey into the dark forest of Kaggle competitions as a .NET developer.
I will be focusing on (almost) pure neural networks in this and the following articles. It means, that most of the boring parts of the dataset preparation, like filling out missing values, feature selection, outliers analysis, etc. will be intentionally skipped.
The tech stack will be C# + TensorFlow tf.keras API. As of today, it will also require Windows. Larger models in the future articles may need a suitable GPU for their training time to remain sane.
Let's Predict Real Estate Prices!
House Prices is a great competition for novices to start with. Its dataset is small, there are no special rules, public leaderboard has many participants, and you can submit up to 4 entries a day.
Register on Kaggle, if you have not done that yet, join this competition, and download the data. The goal is to predict sale price (SalePrice
column) for entries in test.csv. Archive contains train.csv, which has about 1500 entries with known sale price to train on. We'll begin with loading that dataset
, and exploring it a little bit, before getting into neural networks.
Analyze Training Data
Did I say we will skip the dataset preparation? I lied! You have to take a look at least once.
To my surprise, I did not find an easy way to load a .csv file in the .NET standard class library, so I installed a NuGet package, called CsvHelper. To simplify data manipulation, I also got my new favorite LINQ extension package MoreLinq.
static DataTable LoadData(string csvFilePath) {
var result = new DataTable();
using (var reader = new CsvDataReader(new CsvReader(new StreamReader(csvFilePath)))) {
result.Load(reader);
}
return result;
}
Using DataTable
for training data manipulation is, actually, a bad idea.
ML.NET is supposed to have the .csv loading and many of the data preparation and exploration operations. However, it was not ready for that particular purpose yet, when I just entered House Prices competition.
The data looks like this (only a few rows and columns):
Id | MSSubClass | MSZoning | LotFrontage | LotArea |
1 | 60 | RL | 65 | 8450 |
2 | 20 | RL | 80 | 9600 |
3 | 60 | RL | 68 | 11250 |
4 | 70 | RL | 60 | 9550 |
After loading data, we need to remove the Id
column, as it is actually unrelated to the house prices:
var trainData = LoadData("train.csv");
trainData.Columns.Remove("Id");
Analyzing the Column Data Types
DataTable
does not automatically infer data types of the columns, and assumes it's all string
s. So the next step is to determine what we actually have. For each column, I computed the following statistics: number of distinct values, how many of them are integers, and how many of them are floating point numbers (a source code with all helper methods will be linked at the end of the article):
var values = rows.Select(row => (string)row[column]);
double floats = values.Percentage(v => double.TryParse(v, out _));
double ints = values.Percentage(v => int.TryParse(v, out _));
int distincts = values.Distinct().Count();
Numeric Columns
It turns out that most columns are actually int
s, but since neural networks mostly work on floating numbers, we will convert them to double
s anyway.
Categorical Columns
Other columns describe categories the property on sale belonged to. None of them have too many different values, which is good. To use them as an input for our future neural network, they have to be converted to double
too.
Initially, I simply assigned numbers from 0
to distinctValueCount - 1
to them, but that does not make much sense, as there is actually no progression from "Facade: Blue
" through "Facade: Green
" into "Facade: White
". So early on, I changed that to what's called a one-hot encoding, where each unique value gets a separate input column. E.g. "Facade: Blue
" becomes [1,0,0]
, and "Facade: White
" becomes [0,0,1]
.
Getting Them All Together
CentralAir: 2 values, ints: 0.00%, floats: 0.00%
Street: 2 values, ints: 0.00%, floats: 0.00%
Utilities: 2 values, ints: 0.00%, floats: 0.00%
....
LotArea: 1073 values, ints: 100.00%, floats: 100.00%
Many value columns:
Exterior1st: AsbShng, AsphShn, BrkComm, BrkFace, CBlock, CemntBd, HdBoard,
ImStucc, MetalSd, Plywood, Stone, Stucco, VinylSd, Wd Sdng, WdShing
Exterior2nd: AsbShng, AsphShn, Brk Cmn, BrkFace, CBlock, CmentBd, HdBoard,
ImStucc, MetalSd, Other, Plywood, Stone, Stucco, VinylSd, Wd Sdng, Wd Shng
Neighborhood: Blmngtn, Blueste, BrDale, BrkSide, ClearCr, CollgCr, Crawfor,
Edwards, Gilbert, IDOTRR, MeadowV, Mitchel, NAmes, NoRidge, NPkVill,
NridgHt, NWAmes, OldTown, Sawyer, SawyerW, Somerst,
StoneBr, SWISU, Timber, Veenker
non-parsable floats
GarageYrBlt: NA
LotFrontage: NA
MasVnrArea: NA
float ranges:
BsmtHalfBath: 0...2
HalfBath: 0...2
...
GrLivArea: 334...5642
LotArea: 1300...215245
With that in mind, I built the following ValueNormalizer
, which takes some information about the values inside the column, and returns a function, that transforms a value (a string
) into a numeric feature vector for the neural network (double[]
):
static Func<string, double[]> ValueNormalizer(double floats, IEnumerable<string> values) {
if (floats > 0.01) {
double max = values.AsDouble().Max().Value;
return s => new[] { double.TryParse(s, out double v) ? v / max : -1 };
} else {
string[] domain = values.Distinct().OrderBy(v => v).ToArray();
return s => new double[domain.Length+1]
.Set(Array.IndexOf(domain, s)+1, 1);
}
}
Now we've got the data converted into a format, suitable for a neural network. It is time to build one.
Build a Neural Network
If you already have Python 3.6 and TensorFlow 1.10.x installed, all you need is:
<PackageReference Include="Gradient" Version="0.1.10-tech-preview4" />
in your modern .csproj file. Otherwise, refer to the Gradient manual to do the initial setup.
Once the package is up and running, we can create our first shallow deep network.
using tensorflow;
using tensorflow.keras;
using tensorflow.keras.layers;
using tensorflow.train;
...
var model = new Sequential(new Layer[] {
new Dense(units: 16, activation: tf.nn.relu_fn),
new Dropout(rate: 0.1),
new Dense(units: 10, activation: tf.nn.relu_fn),
new Dense(units: 1, activation: tf.nn.relu_fn),
});
model.compile(optimizer: new AdamOptimizer(), loss: "mean_squared_error");
This will create an untrained neural network with 3 neuron layers, and a dropout layer, that helps to prevent overfitting.
tf.nn.relu_fn is the activation function for our neurons. ReLU is known to work well in deep networks, because it solves vanishing gradient problem: derivatives of original non-linear activation functions tended to become very small when the error propagated back from the output layer in deep networks. That meant, that the layers closer to the input would only adjust very slightly, which slowed training of deep networks significantly.
Dropout is a special-function layer in neural networks, which actually does not contain neurons as such. Instead, it operates by taking each individual input, and randomly replaces it with 0
on self output (otherwise, it just passes the original value along). By doing so, it helps to prevent overfitting to less relevant features in a small dataset
. For example, if we did not remove the Id
column, the network could have potentially memorized <Id>
-><SalePrice>
mapping exactly, which would give us 100% accuracy on the training set, but completely unrelated numbers on any other data. Why do we need dropout? Our training data only has ~1500 examples, and this tiny neural network we've built has > 1800 tunable weights. If it would be a simple polynomial, it could match the price function, we are trying to approximate exactly. But then, it would have enormous values on any inputs outside of the original training set.
Feed the Data
TensorFlow expects its data either in NumPy arrays, or existing tensors. I am converting DataRow
s into NumPy arrays:
using numpy;
...
const string predict = "SalePrice";
ndarray GetInputs(IEnumerable<DataRow> rowSeq) {
return np.array(rowSeq.Select(row => np.array(
columnTypes
.Where(c => c.column.ColumnName != predict)
.SelectMany(column => column.normalizer(
row.Table.Columns.Contains(column.column.ColumnName)
? (string)row[column.column.ColumnName]
: "-1"))
.ToArray()))
.ToArray()
);
}
var predictColumn = columnTypes.Single(c => c.column.ColumnName == predict);
ndarray trainOutputs = np.array(predictColumn.trainValues
.AsDouble()
.Select(v => v ?? -1)
.ToArray());
ndarray trainInputs = GetInputs(trainRows);
In the code above, we convert each DataRow
into an ndarray
by taking every cell in it, and applying the ValueNormalizer
corresponding to its column. Then, we put all rows into another ndarray
, getting an array of arrays.
No such transform is needed for outputs, where we just convert train values to another ndarray
.
Time to Get Down the Gradient
With this setup, all we need to do to train our network is to call model's fit
function:
model.fit(trainInputs, trainOutputs,
epochs: 2000,
validation_split: 0.075,
verbose: 2);
This call will actually set aside the last 7.5% of the training set for validation, then repeat the following 2000 times:
- Split the rest of
trainInputs
into batches - Feed these batches one by one into the neural network
- Compute error using the loss function we defined above
- Backpropagate the error through the gradients of individual neuron connections, adjusting weights
While training, it will output the network's error on the data it set aside for validation as val_loss
and the error on the training data itself as just loss
. Generally, if val_loss
becomes much greater, than the loss
, it means the network started overfitting. I will address that in more detail in the following articles.
If you did everything correctly, a square root of one of your losses should be on the order of 20000.
Submission
I won't talk much about generating the file to submit here. The code to compute outputs is simple:
const string SubmissionInputFile = "test.csv";
DataTable submissionData = LoadData(SubmissionInputFile);
var submissionRows = submissionData.Rows.Cast<DataRow>();
ndarray submissionInputs = GetInputs(submissionRows);
ndarray sumissionOutputs = model.predict(submissionInputs);
which mostly uses functions, that were defined earlier.
Then you need to write them into a .csv file, which is simply a list of Id
, predicted_value
pairs.
When you submit your result, you should get a score on the order of 0.17
, which would be somewhere in the last quarter of the public leaderboard table. But hey, if it was as simple as a 3 layer network with 27 neurons, those pesky data scientists would not be getting $300k+/y total compensations from the major US companies.
Wrapping Up
The full source code for this entry (with all of the helpers, and some of the commented out parts of my earlier exploration and experiments) is about 200 lines on the PasteBin.
In the next article, you will see my shenanigans trying to get into top 50% of that public leaderboard. It's going to be an amateur journeyman's adventure, a fight with The Windmill of Overfitting with the only tool the wanderer has - a bigger model (e.g., deep NN, remember, no manual feature engineering!). It will be less of a coding tutorial, and more of a thought quest with really crooky math and a weird conclusion.
Stay tuned!
Links
History
- 23rd February, 2019: Initial version