Introduction
As the title of this article suggests, we are going to implement the K-Means Clustering algorithm is C#. It’s been a while since I started to study this algorithm for one of my projects. It’s a nice, straightforward, yet efficient algorithm to classify some data into different groups. A couple of days ago, I decided to implement the algorithm in C# and here I want to take you all through the steps of doing so. But, first things first. Let’s take a quick look at the K-Means Clustering algorithm itself.
The K-Means Clustering Algorithm
There are a lot of pages and websites which explain the K-Means Clustering algorithm just to make you even more confused. Let’s be honest, there are also very useful and straightforward explanations out there. I try my best to show you this algorithm in the simplest way possible. As I said earlier, this algorithm is used to classify some pieces of data into different groups or, as we will call it from now on, clusters. It might have happened to you as well that at one point you needed to group similar items in a way that members of the same group have a lot in common while members of different groups vary greatly. This is what this algorithm tries to do.
To make things clearer, let’s take a look at the following pictures which I got from Wikipedia.
I will refer to these four pictures from left to right. So, let’s take a look at the left-hand side picture. In this picture, you see a group of gray squares which we want to cluster. You can also see three red, green and blue circles. These are called centroids. Simply put, these little guys will help us to group the gray squares. Simple enough, huh? Ok. Just stay with me.
Now, take a look at the second picture. Here, the gray squares are no longer gray. In fact, they have got the color of the circles which indicates that they are a member of that cluster. The whole area is also segmented into three smaller colored regions to show the boundaries of the three clusters. So, now we have got three clusters, all of which contain some of the squares.
The next picture shows some computation. This is the part of the algorithm in which the squares might move from a cluster to another one. But how do we decide that? I’m going to skip the answer to this question for now and get back to it later on.
Finally, the last picture shows the final result of the algorithm in which all of the squares are grouped into the three clusters. The final step of the algorithm clusters the squares in such a way that members of the same cluster have a lot in common while members of different groups are substantially different.
Now, the question that you might ask yourself is why we call it "K" and why "Means"? Well, simply put, "K" is the number of centroids that you decide to have and "Mean" is the criterion that decides which cluster a piece of data should be in. I will elaborate more on this later on. Also, please visit this page for further information on the K-Means Clustering algorithm.
The K-Means Clustering Algorithm in C#
The Data Point Data Model
Now that we know a little bit about the overall goal of the algorithm, let’s try to implement it in C#. The first thing that I have done is to create a data model to store the data I want to cluster. The data you wish to cluster could be about anything. In my case, I need to cluster some objects of a type called "Data Point". This class has two properties: Height
and Weight
which may correspond to the two physical attributes of a person. There is also another property called "Cluster". This property stores the cluster index which a Data Point is a member of.
public class DataPoint
{
public double Height { get; set; }
public double Weight { get; set; }
public int Cluster { get; set; }
public DataPoint(double height, double weight)
{
Height = height;
Weight = weight;
Cluster = 0;
}
public DataPoint()
{
}
public override string ToString()
{
return string.Format("{{{0},{1}}}", Height.ToString("f" + 1), Weight.ToString("f" + 1));
}
}
There is also the ToString
method used to insert data points into a listbox
on the main form. Here is the event handler to do this:
private void txtWeight_KeyDown(object sender, KeyEventArgs e)
{
if (e.KeyCode==Keys.Enter)
{
if (string.IsNullOrEmpty(txtHeight.Text) || string.IsNullOrEmpty(txtWeight.Text))
{
MessageBox.Show("Please enter the values for both Height and Weight.");
return;
}
DataPoint dp = new DataPoint();
dp.Height = double.Parse(txtHeight.Text);
dp.Weight = double.Parse(txtWeight.Text);
lstRawData.Items.Add(dp.ToString());
}
}
The Class Level Objects
I have also declared some objects at the class level for the main form of the application:
List<DataPoint> _rawDataToCluster = new List<DataPoint>();
List<DataPoint> _normalizedDataToCluster = new List<DataPoint>();
List<DataPoint> _clusters = new List<DataPoint>();
private int _numberOfClusters = 0;
I think the code above is clear enough, but let’s just say what the objects will be responsible for:
· _rawDataToCluster
: stores the data entered by the user.· _normalizedDataToCluster
: stores the normalized version of the data stored in the _rawDataToCluster
. What is normalization? I will talk about it later on. · _clusters
: stores the means of each cluster. Means? What is that? Don’t worry. Again, I will get back to this question. But, for the time being, I should say that the data stored in this collection decides whether a data point should move between two clusters or stay with the cluster where it already is. · _numberOfClusters
: Number of clusters which we want to group the data in. In the pictures above, the number of clusters is three.
The InitilizeRawData Method
The first thing to do is to populate the _rawDataToCluster
collection using the data entered by the user. I should mention that the user is supposed to enter two pieces of data into two textboxes for the Height
and Weight
properties. Upon pressing the Enter key, the data is added to a listbox
using the following format: {32.0,230.0}.
Now, it is time to populate the _rawDataToCluster
collection using the data in the listbox
. Of course, there are a lot of ways to do that. I tried to stick to the most simple one:
private void InitilizeRawData()
{
if (__rawDataToCluster .Count == 0)
{
string lstRawDataItem = string.Empty;
for (int i = 0; i < lstRawData.Items.Count; i++)
{
DataPoint dp = new DataPoint();
lstRawDataItem = lstRawData.Items[i].ToString();
lstRawDataItem = lstRawDataItem.Replace("{", "");
lstRawDataItem = lstRawDataItem.Replace("}", "");
string[] data = lstRawDataItem.Split(',');
dp.Height = double.Parse(data[0]);
dp.Weight = double.Parse(data[1]);
__rawDataToCluster .Add(dp);
}
}
}
The method above simply enumerates through the items in the listbox
, extracts the two pieces of data in each row using the Split
method of type String
, instantiates a new object of type Data Point, sets the properties and adds the object to the _rawDataToCluster
.
The NormalizeData Method
The data obtained so far cannot possibly be used yet. First, we need to normalize it. Why? It’s so simple. You see, we are dealing with two different types of data: Height
and Weight
. They are somehow about different measurements. To give another example, if we are dealing with some data about number of days, age, number of times a customer has purchased something, etc. we need to normalize the data. We do the normalization to make sure that one piece of data (Height
, for example) doesn’t overwhelm the other (Weight
). Again, we may also skip the normalization step, but best practice is to do it. So, here is the code:
private void NormalizeData()
{
double heightSum = 0.0;
double weightSum = 0.0;
foreach (DataPoint dataPoint in _rawDataToCluster)
{
heightSum += dataPoint.Height;
weightSum += dataPoint.Weight;
}
double heightMean = heightSum / _rawDataToCluster.Count;
double weightMean = weightSum / _rawDataToCluster.Count;
double sumHeight = 0.0;
double sumWeight = 0.0;
foreach (DataPoint dataPoint in _rawDataToCluster)
{
sumHeight += Math.Pow(dataPoint.Height - heightMean, 2);
sumWeight += Math.Pow(dataPoint.Weight - weightMean, 2);
}
double heightSD = sumHeight / _rawDataToCluster.Count;
double weightSD = sumWeight / _rawDataToCluster.Count;
foreach (DataPoint dataPoint in _rawDataToCluster)
{
_normalizedDataToCluster.Add(new DataPoint()
{
Height = (dataPoint.Height - heightMean)/heightSD,
Weight = (dataPoint.Weight - weightMean) / weightSD
}
);
}
}
There are different methods to normalize the data we have got so far. The method used here is the Gaussian (also called normal) normalization. How does it work? Very simple. The first loop and the next two lines compute the mean of both Height
and Weight
for the data points. How? Just sum the values of Height
and Weight
up and divide by the number of items in the collection. The next step is to do something for all of the items in the collection which is subtracting both Height
and Weight
by the means calculated in the previous step (heightMean
, weightMean
) and then to the power two. The next step is to divide these two values by the number of items in the collection. Finally, the last loop populates the _normalizedDataToCluster
collection by adding new objects of type DataPoint
to it. The values used for the two properties are the normalized versions of the values in the _rawDataToCluster
collection. Now, most normalized values will be between -3
and +3
. Please, feel free to visit this page for more information about this method of normalization.
The ShowData Method
Now that we have got the normalized data, how about showing it to the user? Well, the following method just does that:
private void ShowData(List<DataPoint> data, int decimals, TextBox outPut)
{
foreach (DataPoint dp in data)
{
outPut.Text += dp.ToString() + Environment.NewLine;
}
}
This method accepts a generic list of type DataPoint
, the number of floating point numbers and also a reference to a textbox
where the output is to be printed. There are actually three textbox
es on the form. One for showing the normalized data, one for showing the clustering history and yet another one for showing the last result of the algorithm. Again, the code is simple enough: just enumerate through all the members in the list, call the ToString
method and that’s it. We get all of the items in the collection in a format like this: {Height,Weight
}. Remember the ToString
method?
The InitializeCentroids Method
Well, here starts the core of the K-Means Clustering algorithm. Do you remember the centroids? Those little colored circles which helped us group the gray squares. One of things about the K-Means Clustering algorithm is to set the centroids for each data item initially. How do we do this? The most basic way is to set the centroids randomly. There are also other methods which we skip for now. This is what we are going to do. So, for every DataPoint
that we have got so far, we need to set the Cluster
property using a random mechanism:
private void InitializeCentroids()
{
Random random = new Random(_numberOfClusters);
for (int i = 0; i < _numberOfClusters; ++i)
{
_normalizedDataToCluster[i].Cluster = _rawDataToCluster[i].Cluster = i;
}
for (int i = _numberOfClusters; i < _normalizedDataToCluster.Count; i++)
{
_normalizedDataToCluster[i].Cluster = _rawDataToCluster[i].Cluster = random.Next(0, _numberOfClusters);
}
}
One thing to bear in mind about this clustering algorithm is that a cluster must always have at least one member (here DataPoint
). That’s what the first loop is about. The first loop arbitrarily adds one data point to each cluster by setting the Cluster
property in both the items in _normalizedDataToCluster
and _rawDataToCluster
collections. Then, the second loop sets the Cluster
property for the remaining data points to random numbers between zero and the number of clusters entered by the user. Why do we set the Cluster
property for both the items in the _normalizedDataToCluster
and _rawDataToCluster
collections? That’s because at the end, we use the _rawDataToCluster
to show the result of the algorithm. In other words, we use the Cluster
property in the _normalizedDataToCluster
to actually do the clustering and the Cluster
property in the _rawDataToCluster
to show the result to the user. It’s something I have decided to do. You, if you wish, can omit this part and use the _rawDataToCluster
to print out the final results.
The UpdateDataPointMeans Method
Do you remember I told you that the "Means" word in the name of the algorithm is responsible to decide which cluster should hold a piece of data? If you don’t, please look back and read again. Now, the process is a little bit complex but I try to explain it in the simplest way possible. You see, we have already set the centroids of the data points. That is, we have already put the data points in some arbitrary clusters. Now, it is time to actually find the most suitable cluster for each data point. How do we decide that? Again, very simple. First, we need to calculate the mean of all of the items in a cluster (for both Height
and Weight
). For example, if a cluster contains three data points such as {32,65}, {16,87} and {17,60}, the mean of this cluster is 32+16+17/3 and 65+87+60/3. We store these two values in the _cluster
collection. Just to remind you, the _cluster
collection is generic list of type DataPoint
declared at the class level. However, the Height
and the Weight
properties of the items in this collection no longer represent the actual values for a DataPoint
, but rather the means for the corresponding properties for all of the data points in a cluster. The Cluster
property for the items in this collection is a simple integer number to show the index of a cluster. However, this property for DataPoint
objects in the _normalizedDataToCluster
and _rawDataToCluster
represent the index of the cluster that holds that DataPoint
. Let’s talk about this example a little more to make things simpler. Suppose there is a DataPoint
object in the _cluster
collection with these properties:
DataPoint{21.6,70.6,1}
There are also three items in the _rawDataToCluster
or the _normalizedDataToCluster
collections with these properties:
DataPoint{32,65,1}, DataPoint{16,87,1}, DataPoint{17,60,1}
All this means that the three data points in the _rawDataToCluster
collection are members of a cluster with index 1
that also has the means of the two properties of its members in its Height
and Weight
properties: 21.6
and 70.6
.
Now, enough with these explanations, let’s look at the code:
private bool UpdateDataPointMeans()
{
if (EmptyCluster(_normalizedDataToCluster)) return false;
var groupToComputeMeans = _normalizedDataToCluster.GroupBy(s => s.Cluster).OrderBy(s => s.Key);
int clusterIndex = 0;
double height = 0.0;
double weight = 0.0;
foreach (var item in groupToComputeMeans)
{
foreach (var value in item)
{
height += value.Height;
weight += value.Weight;
}
_clusters[clusterIndex].Height = height / item.Count();
_clusters[clusterIndex].Weight = weight / item.Count();
clusterIndex++;
height = 0.0;
weight = 0.0;
}
return true;
}
This method makes use of LINQ to group the _normalizedDataToCluster
based on the Cluster
property. Then, it iterates through each item in each group, calculates the mean and updates the clusters collection. But, first, it makes sure that none of the clusters is empty. If even one of the clusters is empty, it returns false
and doesn’t continue. This is exactly what I talked about before: At any point of time, each cluster must have at least one member.
The EmptyCluster Method
This is the method which tests the clusters to make sure each one of them has at least one member:
private bool EmptyCluster(List<DataPoint> data)
{
var emptyCluster =
data.GroupBy(s => s.Cluster).OrderBy(s => s.Key).Select(g => new {Cluster = g.Key, Count = g.Count()});
foreach (var item in emptyCluster)
{
if (item.Count == 0)
{
return true;
}
}
return false;
}
The ElucidanDistance Method
Now for the most important part of our algorithm. The idea behind the K-Means Clustering algorithm is to put similar items together, right? Correct. But what does this similarity mean? Well, it is the difference between an item and the mean of the current members of a cluster. How do you compute this difference? Well, there are a lot of ways to do this. The simplest one is the Elucidan distance method which I have used here:
private double ElucidanDistance(DataPoint dataPoint, DataPoint mean)
{
double _diffs = 0.0;
_diffs = Math.Pow(dataPoint.Height - mean.Height, 2);
_diffs += Math.Pow(dataPoint.Weight - mean.Weight, 2);
return Math.Sqrt(_diffs);
}
This method is pretty straightforward. It accepts two arguments of type DataPoint
: one being the DataPoint
which we want to move to a more suitable cluster and the other being the DataPoint
storing the mean of the items in the target cluster. The idea behind the Elucidan method is also simple. In this case, we just need to subtract the Height
and Weight
of the data point in question from the means of these two properties in the target cluster, then increase value to the power two, add them up, and then return the root square of this value which is the Elucidan difference between these two data points. For more information about this method, you can refer to this page. Now that we have got the distance of a data point from the mean of the target cluster, we can decide whether to move that data point to the new cluster or not. How do we decide that? Continue reading.
The UpdataClusterMembership Method
To move a data point from a cluster to a new one, we just need to update the value stored in its Cluster
property. But, the question is that how do we decide this. Well, we need to compare the Height
and Weight
properties of a data point with the means of these two properties in all of the clusters. Then, we need to choose the cluster with the minimum difference and move the data point to that cluster. Why? Because that cluster is the most suitable one as the difference between its mean and the data point is at the lowest level. Remember, the final goal of the algorithm is to group similar items together; items which have the least amount of difference between their values. The following code is responsible to do all this:
private bool UpdateClusterMembership()
{
bool changed = false;
double[] distances = new double[_numberOfClusters];
StringBuilder sb = new StringBuilder();
for (int i = 0; i < _normalizedDataToCluster.Count; ++i)
{
for (int k = 0; k < _numberOfClusters; ++k)
distances[k] = ElucidanDistance(_normalizedDataToCluster[i], _clusters[k]);
int newClusterId = MinIndex(distances);
if (newClusterId != _normalizedDataToCluster[i].Cluster)
{
changed = true;
_normalizedDataToCluster[i].Cluster = _rawDataToCluster[i].Cluster = newClusterId;
sb.AppendLine("Data Point >> Height: " + _rawDataToCluster[i].Height + ", Weight: " +
_rawDataToCluster[i].Weight + " moved to Cluster # " + newClusterId);
}
else
{
sb.AppendLine("No change.");
}
sb.AppendLine("------------------------------");
txtIterations.Text += sb.ToString();
}
if (changed == false)
return false;
if (EmptyCluster(_normalizedDataToCluster)) return false;
return true;
}
Let’s talk a little about this method. The outer-most loop iterates trough the items of the _normalizedDataToCluster
collection. The next loop within this one calculates the Elucidan distance between each item in the _normalizedDataToCluster
collection and the means of clusters stored in the _clusters
collection and then stores the result of each of these comparisons in an array called distances
. After that, the code calls a method named MinIndex
which returns the index of the minimum value in an array. It’s very simple and I will put the code shortly. Now that we have the index of the cluster with minimum difference, we have to check whether this data point is already in that cluster. This is what the conditional part of the code does. If the data point is not already in the cluster with the minimum distance, we should move it there. How? By simply updating the value in the Cluster
property of that data point. We set the changed flag to true
in order to show that a change occurred in the cluster membership. Then, we also put a little bit of string
in a StringBuilder
object to notify the user that a data point changed cluster. If the data point is already in the cluster with the minimum amount of distance, we just print out "No Change". Simple enough? At the end of this method, we check that the changed flag is not false
, we also check that none of the clusters is empty. If any of these conditions fail to meet, the method returns false
. Why? Very simple. The idea behind the K-Means Clustering algorithm is that we should try to move items to more suitable clusters until no change in cluster membership happens or, as said previously, at least one of the clusters is empty.
The MinIndex Method
This method which was used in the UpdataClusterMembership
method simply accepts an array and returns the index of the minimum value stored in it. It is used to find the cluster with the minimum difference for a data point to move to:
private int MinIndex(double[] distances)
{
int _indexOfMin = 0;
double _smallDist = distances[0];
for (int k = 0; k < distances.Length; ++k)
{
if (distances[k] < _smallDist)
{
_smallDist = distances[k];
_indexOfMin = k;
}
}
return _indexOfMin;
}
The Cluster Method
Here, we put the last piece of our code to cluster the data. It is a very simple method that uses most of the code we already talked about:
public void Cluster(List<DataPoint> data, int numClusters)
{
bool _changed = true;
bool _success = true;
InitializeCentroids();
int maxIteration = data.Count * 10;
int _threshold = 0;
while (_success == true && _changed == true && _threshold < maxIteration)
{
++_threshold;
_success = UpdateDataPointMeans();
_changed = UpdateClusterMembership();
}
}
This method accepts a collection of type DataPoint
and a desired number of clusters. It calls the InitializeCentroids
method to set some random centroids for the data points. Finally, it calls the UpdateDataPointMeans
and UpdateClusterMembership
methods in a loop. I have used a while
loop. The three conditions to stop the loop are also important. First, the _success
flag must be true
; this is set by the UpdateDataPointMeans
and is always true
unless one of the clusters is empty. The next condition is that the _changed
flag must be true
; this is also set by the UpdateClusterMembership
method and is always true
except when one of the clusters is empty or no change happens in the cluster membership for the data points. A threshold is also set for the number of times the clustering operation is performed. This is set to ten times larger than the number of data points that need to be clustered.
Clustering and Printing the Result
The last thing to do is to use all of this and cluster some data. This is the event handler for a button which calls the Cluster
method:
private void btnCluster_Click(object sender, EventArgs e)
{
InitilizeRawData();
_numberOfClusters = int.Parse(txtNumberOfClusters.Text);
for (int i = 0; i < _numberOfClusters; i++)
{
_clusters.Add(new DataPoint(){Cluster = i});
}
Cluster(_normalizedDataToCluster, _numberOfClusters);
StringBuilder sb = new StringBuilder();
var group = _rawDataToCluster.GroupBy(s => s.Cluster).OrderBy(s => s.Key);
foreach (var g in group)
{
sb.AppendLine("Cluster # " + g.Key + ":");
foreach (var value in g)
{
sb.Append(value.ToString());
sb.AppendLine();
}
sb.AppendLine("------------------------------");
}
txtOutput.Text = sb.ToString();
}
This method is very straightforward. It reads the number of clusters specified by the user and adds the right number of data points to the _clusters
collection. It then calls the Cluster
method passing the _normalizedDataToCluster
collection and the number of clusters. After the clustering operation is done, it groups the items in the _ rawDataToCluster
collection according to their Cluster
property and then it prints out the members in each cluster.
Conclusion
There you have it! The K-Means Clustering algorithm in C#. This is a very beautiful and efficient clustering algorithm. I tried to explain the procedure in the simplest words possible. Hope you enjoyed it.