(untagged)

Implementing The K-Means Clustering Algorithm in C#.NET

Morteza Giti

0.00/5 (No votes)

26 Apr 2015

How to implement the K-means clustering algorithm in C#.NET

Download project - 37.2 KB

Introduction

As the title of this article suggests, we are going to implement the K-Means Clustering algorithm is C#. It’s been a while since I started to study this algorithm for one of my projects. It’s a nice, straightforward, yet efficient algorithm to classify some data into different groups. A couple of days ago, I decided to implement the algorithm in C# and here I want to take you all through the steps of doing so. But, first things first. Let’s take a quick look at the K-Means Clustering algorithm itself.

The K-Means Clustering Algorithm

There are a lot of pages and websites which explain the K-Means Clustering algorithm just to make you even more confused. Let’s be honest, there are also very useful and straightforward explanations out there. I try my best to show you this algorithm in the simplest way possible. As I said earlier, this algorithm is used to classify some pieces of data into different groups or, as we will call it from now on, clusters. It might have happened to you as well that at one point you needed to group similar items in a way that members of the same group have a lot in common while members of different groups vary greatly. This is what this algorithm tries to do.

To make things clearer, let’s take a look at the following pictures which I got from Wikipedia.

I will refer to these four pictures from left to right. So, let’s take a look at the left-hand side picture. In this picture, you see a group of gray squares which we want to cluster. You can also see three red, green and blue circles. These are called centroids. Simply put, these little guys will help us to group the gray squares. Simple enough, huh? Ok. Just stay with me.

Now, take a look at the second picture. Here, the gray squares are no longer gray. In fact, they have got the color of the circles which indicates that they are a member of that cluster. The whole area is also segmented into three smaller colored regions to show the boundaries of the three clusters. So, now we have got three clusters, all of which contain some of the squares.

The next picture shows some computation. This is the part of the algorithm in which the squares might move from a cluster to another one. But how do we decide that? I’m going to skip the answer to this question for now and get back to it later on.

Finally, the last picture shows the final result of the algorithm in which all of the squares are grouped into the three clusters. The final step of the algorithm clusters the squares in such a way that members of the same cluster have a lot in common while members of different groups are substantially different.

Now, the question that you might ask yourself is why we call it "K" and why "Means"? Well, simply put, "K" is the number of centroids that you decide to have and "Mean" is the criterion that decides which cluster a piece of data should be in. I will elaborate more on this later on. Also, please visit this page for further information on the K-Means Clustering algorithm.

The K-Means Clustering Algorithm in C#

The Data Point Data Model

Now that we know a little bit about the overall goal of the algorithm, let’s try to implement it in C#. The first thing that I have done is to create a data model to store the data I want to cluster. The data you wish to cluster could be about anything. In my case, I need to cluster some objects of a type called "Data Point". This class has two properties: Height and Weight which may correspond to the two physical attributes of a person. There is also another property called "Cluster". This property stores the cluster index which a Data Point is a member of.

public class DataPoint
{
public double Height { get; set; }
public double Weight { get; set; }
public int Cluster { get; set; }
public DataPoint(double height, double weight)
{
Height = height;
Weight = weight;
Cluster = 0;
}
 
public DataPoint()
{
 
}
 
public override string ToString()
{
return string.Format("{{{0},{1}}}", Height.ToString("f" + 1), Weight.ToString("f" + 1));
}
}

There is also the ToString method used to insert data points into a listbox on the main form. Here is the event handler to do this:

private void txtWeight_KeyDown(object sender, KeyEventArgs e)
{ 
if (e.KeyCode==Keys.Enter)
{
if (string.IsNullOrEmpty(txtHeight.Text) || string.IsNullOrEmpty(txtWeight.Text))
{
MessageBox.Show("Please enter the values for both Height and Weight.");
return;
}
DataPoint dp = new DataPoint();
dp.Height = double.Parse(txtHeight.Text);
dp.Weight = double.Parse(txtWeight.Text);
lstRawData.Items.Add(dp.ToString());
}
}

The Class Level Objects

I have also declared some objects at the class level for the main form of the application:

List<DataPoint> _rawDataToCluster = new List<DataPoint>();
List<DataPoint> _normalizedDataToCluster = new List<DataPoint>();
List<DataPoint> _clusters = new List<DataPoint>();
private int _numberOfClusters = 0;

I think the code above is clear enough, but let’s just say what the objects will be responsible for:

· _rawDataToCluster: stores the data entered by the user.· _normalizedDataToCluster: stores the normalized version of the data stored in the _rawDataToCluster. What is normalization? I will talk about it later on. · _clusters: stores the means of each cluster. Means? What is that? Don’t worry. Again, I will get back to this question. But, for the time being, I should say that the data stored in this collection decides whether a data point should move between two clusters or stay with the cluster where it already is. · _numberOfClusters: Number of clusters which we want to group the data in. In the pictures above, the number of clusters is three.

The InitilizeRawData Method

The first thing to do is to populate the _rawDataToCluster collection using the data entered by the user. I should mention that the user is supposed to enter two pieces of data into two textboxes for the Height and Weight properties. Upon pressing the Enter key, the data is added to a listbox using the following format: {32.0,230.0}.

Now, it is time to populate the _rawDataToCluster collection using the data in the listbox. Of course, there are a lot of ways to do that. I tried to stick to the most simple one:

private void InitilizeRawData()
{
if (__rawDataToCluster .Count == 0)
{
string lstRawDataItem = string.Empty;
for (int i = 0; i < lstRawData.Items.Count; i++)
{
DataPoint dp = new DataPoint();
lstRawDataItem = lstRawData.Items[i].ToString();
lstRawDataItem = lstRawDataItem.Replace("{", "");
lstRawDataItem = lstRawDataItem.Replace("}", "");
string[] data = lstRawDataItem.Split(',');
dp.Height = double.Parse(data[0]);
dp.Weight = double.Parse(data[1]);
__rawDataToCluster .Add(dp);
}
}
}

The method above simply enumerates through the items in the listbox, extracts the two pieces of data in each row using the Split method of type String, instantiates a new object of type Data Point, sets the properties and adds the object to the _rawDataToCluster.

The NormalizeData Method

The data obtained so far cannot possibly be used yet. First, we need to normalize it. Why? It’s so simple. You see, we are dealing with two different types of data: Height and Weight. They are somehow about different measurements. To give another example, if we are dealing with some data about number of days, age, number of times a customer has purchased something, etc. we need to normalize the data. We do the normalization to make sure that one piece of data (Height, for example) doesn’t overwhelm the other (Weight). Again, we may also skip the normalization step, but best practice is to do it. So, here is the code:

private void NormalizeData()
{ 
double heightSum = 0.0;
double weightSum = 0.0;
foreach (DataPoint dataPoint in _rawDataToCluster)
{
heightSum += dataPoint.Height;
weightSum += dataPoint.Weight;
}
double heightMean = heightSum / _rawDataToCluster.Count;
double weightMean = weightSum / _rawDataToCluster.Count;
double sumHeight = 0.0;
double sumWeight = 0.0;
foreach (DataPoint dataPoint in _rawDataToCluster)
{
sumHeight += Math.Pow(dataPoint.Height - heightMean, 2);
sumWeight += Math.Pow(dataPoint.Weight - weightMean, 2);
}
double heightSD = sumHeight / _rawDataToCluster.Count;
double weightSD = sumWeight / _rawDataToCluster.Count;
foreach (DataPoint dataPoint in _rawDataToCluster)
{
_normalizedDataToCluster.Add(new DataPoint()
{
Height = (dataPoint.Height - heightMean)/heightSD,
Weight = (dataPoint.Weight - weightMean) / weightSD
}
);
} 
}

There are different methods to normalize the data we have got so far. The method used here is the Gaussian (also called normal) normalization. How does it work? Very simple. The first loop and the next two lines compute the mean of both Height and Weight for the data points. How? Just sum the values of Height and Weight up and divide by the number of items in the collection. The next step is to do something for all of the items in the collection which is subtracting both Height and Weight by the means calculated in the previous step (heightMean, weightMean) and then to the power two. The next step is to divide these two values by the number of items in the collection. Finally, the last loop populates the _normalizedDataToCluster collection by adding new objects of type DataPoint to it. The values used for the two properties are the normalized versions of the values in the _rawDataToCluster collection. Now, most normalized values will be between -3 and +3. Please, feel free to visit this page for more information about this method of normalization.

The ShowData Method

Now that we have got the normalized data, how about showing it to the user? Well, the following method just does that:

private void ShowData(List<DataPoint> data, int decimals, TextBox outPut)
{ 
foreach (DataPoint dp in data)
{
outPut.Text += dp.ToString() + Environment.NewLine;
} 
}

This method accepts a generic list of type DataPoint, the number of floating point numbers and also a reference to a textbox where the output is to be printed. There are actually three textboxes on the form. One for showing the normalized data, one for showing the clustering history and yet another one for showing the last result of the algorithm. Again, the code is simple enough: just enumerate through all the members in the list, call the ToString method and that’s it. We get all of the items in the collection in a format like this: {Height,Weight}. Remember the ToString method?

The InitializeCentroids Method

Well, here starts the core of the K-Means Clustering algorithm. Do you remember the centroids? Those little colored circles which helped us group the gray squares. One of things about the K-Means Clustering algorithm is to set the centroids for each data item initially. How do we do this? The most basic way is to set the centroids randomly. There are also other methods which we skip for now. This is what we are going to do. So, for every DataPoint that we have got so far, we need to set the Cluster property using a random mechanism:

private void InitializeCentroids()
{
Random random = new Random(_numberOfClusters);
for (int i = 0; i < _numberOfClusters; ++i)
{
_normalizedDataToCluster[i].Cluster = _rawDataToCluster[i].Cluster = i;
}
for (int i = _numberOfClusters; i < _normalizedDataToCluster.Count; i++)
{
_normalizedDataToCluster[i].Cluster = _rawDataToCluster[i].Cluster = random.Next(0, _numberOfClusters);
} 
}

One thing to bear in mind about this clustering algorithm is that a cluster must always have at least one member (here DataPoint). That’s what the first loop is about. The first loop arbitrarily adds one data point to each cluster by setting the Cluster property in both the items in _normalizedDataToCluster and _rawDataToCluster collections. Then, the second loop sets the Cluster property for the remaining data points to random numbers between zero and the number of clusters entered by the user. Why do we set the Cluster property for both the items in the _normalizedDataToCluster and _rawDataToCluster collections? That’s because at the end, we use the _rawDataToCluster to show the result of the algorithm. In other words, we use the Cluster property in the _normalizedDataToCluster to actually do the clustering and the Cluster property in the _rawDataToCluster to show the result to the user. It’s something I have decided to do. You, if you wish, can omit this part and use the _rawDataToCluster to print out the final results.

The UpdateDataPointMeans Method

Do you remember I told you that the "Means" word in the name of the algorithm is responsible to decide which cluster should hold a piece of data? If you don’t, please look back and read again. Now, the process is a little bit complex but I try to explain it in the simplest way possible. You see, we have already set the centroids of the data points. That is, we have already put the data points in some arbitrary clusters. Now, it is time to actually find the most suitable cluster for each data point. How do we decide that? Again, very simple. First, we need to calculate the mean of all of the items in a cluster (for both Height and Weight). For example, if a cluster contains three data points such as {32,65}, {16,87} and {17,60}, the mean of this cluster is 32+16+17/3 and 65+87+60/3. We store these two values in the _cluster collection. Just to remind you, the _cluster collection is generic list of type DataPoint declared at the class level. However, the Height and the Weight properties of the items in this collection no longer represent the actual values for a DataPoint, but rather the means for the corresponding properties for all of the data points in a cluster. The Cluster property for the items in this collection is a simple integer number to show the index of a cluster. However, this property for DataPoint objects in the _normalizedDataToCluster and _rawDataToCluster represent the index of the cluster that holds that DataPoint. Let’s talk about this example a little more to make things simpler. Suppose there is a DataPoint object in the _cluster collection with these properties:

DataPoint{21.6,70.6,1}

There are also three items in the _rawDataToCluster or the _normalizedDataToCluster collections with these properties:

DataPoint{32,65,1}, DataPoint{16,87,1}, DataPoint{17,60,1}

All this means that the three data points in the _rawDataToCluster collection are members of a cluster with index 1 that also has the means of the two properties of its members in its Height and Weight properties: 21.6 and 70.6.

Now, enough with these explanations, let’s look at the code:

private bool UpdateDataPointMeans()
{
if (EmptyCluster(_normalizedDataToCluster)) return false;
 
var groupToComputeMeans = _normalizedDataToCluster.GroupBy(s => s.Cluster).OrderBy(s => s.Key);
int clusterIndex = 0;
double height = 0.0;
double weight = 0.0;
foreach (var item in groupToComputeMeans)
{
foreach (var value in item)
{
height += value.Height;
weight += value.Weight;
}
_clusters[clusterIndex].Height = height / item.Count();
_clusters[clusterIndex].Weight = weight / item.Count();
clusterIndex++;
height = 0.0;
weight = 0.0;
}
return true; 
}

This method makes use of LINQ to group the _normalizedDataToCluster based on the Cluster property. Then, it iterates through each item in each group, calculates the mean and updates the clusters collection. But, first, it makes sure that none of the clusters is empty. If even one of the clusters is empty, it returns false and doesn’t continue. This is exactly what I talked about before: At any point of time, each cluster must have at least one member.

The EmptyCluster Method

This is the method which tests the clusters to make sure each one of them has at least one member:

private bool EmptyCluster(List<DataPoint> data)
{
var emptyCluster =
data.GroupBy(s => s.Cluster).OrderBy(s => s.Key).Select(g => new {Cluster = g.Key, Count = g.Count()});
 
foreach (var item in emptyCluster)
{
if (item.Count == 0)
{
return true;
}
}
return false;
}

The ElucidanDistance Method

Now for the most important part of our algorithm. The idea behind the K-Means Clustering algorithm is to put similar items together, right? Correct. But what does this similarity mean? Well, it is the difference between an item and the mean of the current members of a cluster. How do you compute this difference? Well, there are a lot of ways to do this. The simplest one is the Elucidan distance method which I have used here:

private double ElucidanDistance(DataPoint dataPoint, DataPoint mean)
{ 
double _diffs = 0.0;
_diffs = Math.Pow(dataPoint.Height - mean.Height, 2);
_diffs += Math.Pow(dataPoint.Weight - mean.Weight, 2);
return Math.Sqrt(_diffs);
}

This method is pretty straightforward. It accepts two arguments of type DataPoint: one being the DataPoint which we want to move to a more suitable cluster and the other being the DataPoint storing the mean of the items in the target cluster. The idea behind the Elucidan method is also simple. In this case, we just need to subtract the Height and Weight of the data point in question from the means of these two properties in the target cluster, then increase value to the power two, add them up, and then return the root square of this value which is the Elucidan difference between these two data points. For more information about this method, you can refer to this page. Now that we have got the distance of a data point from the mean of the target cluster, we can decide whether to move that data point to the new cluster or not. How do we decide that? Continue reading.

The UpdataClusterMembership Method

To move a data point from a cluster to a new one, we just need to update the value stored in its Cluster property. But, the question is that how do we decide this. Well, we need to compare the Height and Weight properties of a data point with the means of these two properties in all of the clusters. Then, we need to choose the cluster with the minimum difference and move the data point to that cluster. Why? Because that cluster is the most suitable one as the difference between its mean and the data point is at the lowest level. Remember, the final goal of the algorithm is to group similar items together; items which have the least amount of difference between their values. The following code is responsible to do all this:

private bool UpdateClusterMembership()
{ 
bool changed = false; 
 
double[] distances = new double[_numberOfClusters];
 
StringBuilder sb = new StringBuilder();
for (int i = 0; i < _normalizedDataToCluster.Count; ++i)
{
 
for (int k = 0; k < _numberOfClusters; ++k)
distances[k] = ElucidanDistance(_normalizedDataToCluster[i], _clusters[k]); 
 
int newClusterId = MinIndex(distances);
if (newClusterId != _normalizedDataToCluster[i].Cluster)
{
changed = true;
_normalizedDataToCluster[i].Cluster = _rawDataToCluster[i].Cluster = newClusterId;
sb.AppendLine("Data Point >> Height: " + _rawDataToCluster[i].Height + ", Weight: " +
_rawDataToCluster[i].Weight + " moved to Cluster # " + newClusterId);
}
else
{
sb.AppendLine("No change.");
}
sb.AppendLine("------------------------------");
txtIterations.Text += sb.ToString();
}
if (changed == false)
return false;
if (EmptyCluster(_normalizedDataToCluster)) return false;
return true; 
}

Let’s talk a little about this method. The outer-most loop iterates trough the items of the _normalizedDataToCluster collection. The next loop within this one calculates the Elucidan distance between each item in the _normalizedDataToCluster collection and the means of clusters stored in the _clusters collection and then stores the result of each of these comparisons in an array called distances. After that, the code calls a method named MinIndex which returns the index of the minimum value in an array. It’s very simple and I will put the code shortly. Now that we have the index of the cluster with minimum difference, we have to check whether this data point is already in that cluster. This is what the conditional part of the code does. If the data point is not already in the cluster with the minimum distance, we should move it there. How? By simply updating the value in the Cluster property of that data point. We set the changed flag to true in order to show that a change occurred in the cluster membership. Then, we also put a little bit of string in a StringBuilder object to notify the user that a data point changed cluster. If the data point is already in the cluster with the minimum amount of distance, we just print out "No Change". Simple enough? At the end of this method, we check that the changed flag is not false, we also check that none of the clusters is empty. If any of these conditions fail to meet, the method returns false. Why? Very simple. The idea behind the K-Means Clustering algorithm is that we should try to move items to more suitable clusters until no change in cluster membership happens or, as said previously, at least one of the clusters is empty.

The MinIndex Method

This method which was used in the UpdataClusterMembership method simply accepts an array and returns the index of the minimum value stored in it. It is used to find the cluster with the minimum difference for a data point to move to:

private int MinIndex(double[] distances)
{ 
int _indexOfMin = 0;
double _smallDist = distances[0];
for (int k = 0; k < distances.Length; ++k)
{
if (distances[k] < _smallDist)
{
_smallDist = distances[k];
_indexOfMin = k;
}
}
return _indexOfMin;
}

The Cluster Method

Here, we put the last piece of our code to cluster the data. It is a very simple method that uses most of the code we already talked about:

public void Cluster(List<DataPoint> data, int numClusters)
{
bool _changed = true;
bool _success = true;
InitializeCentroids(); 
 
int maxIteration = data.Count * 10; 
int _threshold = 0;
while (_success == true && _changed == true && _threshold < maxIteration)
{
++_threshold;
_success = UpdateDataPointMeans();
_changed = UpdateClusterMembership(); 
} 
}

This method accepts a collection of type DataPoint and a desired number of clusters. It calls the InitializeCentroids method to set some random centroids for the data points. Finally, it calls the UpdateDataPointMeans and UpdateClusterMembership methods in a loop. I have used a while loop. The three conditions to stop the loop are also important. First, the _success flag must be true; this is set by the UpdateDataPointMeans and is always true unless one of the clusters is empty. The next condition is that the _changed flag must be true; this is also set by the UpdateClusterMembership method and is always true except when one of the clusters is empty or no change happens in the cluster membership for the data points. A threshold is also set for the number of times the clustering operation is performed. This is set to ten times larger than the number of data points that need to be clustered.

Clustering and Printing the Result

The last thing to do is to use all of this and cluster some data. This is the event handler for a button which calls the Cluster method:

private void btnCluster_Click(object sender, EventArgs e)
{
InitilizeRawData();
_numberOfClusters = int.Parse(txtNumberOfClusters.Text);
 
for (int i = 0; i < _numberOfClusters; i++)
{
_clusters.Add(new DataPoint(){Cluster = i});
}
 
Cluster(_normalizedDataToCluster, _numberOfClusters);
StringBuilder sb = new StringBuilder();
var group = _rawDataToCluster.GroupBy(s => s.Cluster).OrderBy(s => s.Key);
foreach (var g in group)
{ 
sb.AppendLine("Cluster # " + g.Key + ":"); 
foreach (var value in g)
{
sb.Append(value.ToString());
sb.AppendLine();
}
sb.AppendLine("------------------------------"); 
}
txtOutput.Text = sb.ToString();
}

This method is very straightforward. It reads the number of clusters specified by the user and adds the right number of data points to the _clusters collection. It then calls the Cluster method passing the _normalizedDataToCluster collection and the number of clusters. After the clustering operation is done, it groups the items in the _ rawDataToCluster collection according to their Cluster property and then it prints out the members in each cluster.

Conclusion

There you have it! The K-Means Clustering algorithm in C#. This is a very beautiful and efficient clustering algorithm. I tried to explain the procedure in the simplest words possible. Hope you enjoyed it.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here