In this article, you will see how to cluster 2D data using Python and simulation in PyGame.
Introduction
In this article, I will explain the implementation of K-Mean algorithm which is being used in Machine Learning. In the above figures, on the left is unclustered data whereas on the right is clustered in 10 clusters. For this, I have created two files:
- pyDataCluster.py
- clusterSimulation.py
File 1 contains an implementation class of K-Means and File 2 is a simulation file written with pyGame
(a game library for Python). pyDataCluster
class returns the clustered data so data can be viewed in console too.
Background
Machine Learning is an advanced step in AI. Instead of creating a complex algorithm, simple algorithms are used with large amount of previous data to get the optimized results. This process is the base of Learning Algorithm. Clustering is a process where data is grouped in classes. To group the data, different parameters can be employed depending upon the situation. In K-Means algorithm, we cluster in groups by using the mean values of each Cluster which is computed by taking raw data and then processing it repeatedly until mean is not stable.
Basic Workflow
The basic workflow is as follows:
- Get the data.
- Set the number of clusters you want.
- Create an empty 2D array to store the clustered data.
- For each Cluster, get a random point value which will serve as initial means.
- For each point, calculate the distance with respect to mean.
- Put the point in cluster with minimum distance.
- Recalculate the means for every cluster and update the means.
- Use this updated mean to step 5, repeat until mean from two consecutive repetitions become equal.
Using the Code
Let's look at the code.
Firstly, the clustering class:
To use this class in your code, do this:
from pyDataCluster import *
data=[]
groups=10
for i in range(5000):
data.append([random.randint(1,500),random.randint(1,500)])
cluster = pyDataCluster(groups,data)
This will randomly initialize the data and will create an object named cluster
with 10 groups and data array.
finalCluster = cluster.finalCluster()
clus = cluster.createCluster()
Initialization
The class constructor will initialize the class variable.
def __init__(self,numberOfCluster,Data,initialPoints=[]):
'''
Constructor
'''
self.Kgroups=numberOfCluster
self.Data=Data
self.Cluster=[]
self.Kmeans=initialPoints
self.initialMeanPositions()
self.terminat=True
Either pass the initial points or leave it. initialMeanPositons()
will initialize this for you.
Create Cluster
def createCluster(self):
self.clusterSpace()
for i in self.Data:
point=[i[0],i[1]]
group=self.getClusterGroup(point)
self.Cluster[group].append(i)
self.setMeans()
return(self.Cluster)
This function is the workhorse of the class. It will create the clusters of data on the given mean points. Repeatedly calling this function on the given data will result in better clusters.
Final Cluster
To get the final cluster, this will do the job:
def finalCluster(self):
while self.terminat:
clus=self.createCluster()
return(clus)
This function just goes in a loop until termination signal is not given by the setMeans
function.
setMeans
To set the mean, this function will do the job as said in basic workflow:
def setMeans(self):
means=[]
x=0
y=0
for i in self.Cluster:
for j in i:
x=x+j[0]
y=y+j[1]
means.append([math.floor(x/len(i)),math.floor(y/len(i))])
x=0
y=0
if(self.Kmeans==means):
self.terminat=False
self.Kmeans=[]
self.Kmeans=means
Assigning the Cluster Group
This function will return the group index where a given point belongs:
def getClusterGroup(self,point):
dist=[]
for i in self.Kmeans:
dist.append(math.fabs(point[0]-i[0])+math.fabs(point[1]-i[1]))
minIndex = dist.index(min(dist))
return minIndex
Empty Cluster
For every run, you will need an empty cluster, this function will flush the old values if any and create an empty one:
def clusterSpace(self):
self.Cluster=[]
for i in range(self.Kgroups):
self.Cluster.append([])
Up to this, the Clustering is completed and now the Simulation Part.
clusterSimulation
This requires the PyGame
library which can be downloaded from their site.
import pygame, sys, time
from pygame.locals import *
from pyDataCluster import *
data=[]
groups=10
for i in range(5000):
data.append([random.randint(1,500),random.randint(1,500)])
cluster = pyDataCluster(groups,data)
Color=[]
for i in range(groups):
while True:
cl=((random.randint(0,255)),(random.randint(0,255)),(random.randint(0,255)))
if cl not in Color:
Color.append(cl)
break
pygame.init()
WINDOWWIDTH = 500
WINDOWHEIGHT = 500
BASICFONT = pygame.font.Font('freesansbold.ttf',50)
windowSurface = pygame.display.set_mode((WINDOWWIDTH, WINDOWHEIGHT), 0, 32)
pygame.display.set_caption('Cluster Simulation')
BLACK = (0, 0, 0)
RED = (255, 0, 0)
GREEN = (0, 255, 0)
BLUE = (0, 0, 255)
WHITE=(255,255,255)
while cluster.terminat:
points=[]
clus=cluster.createCluster()
a=0
for i in clus:
for j in i:
points.append({'rect':pygame.Rect(j[0],j[1],4,4),'color':Color[a]})
a=a+1
for p in points:
pygame.draw.rect(windowSurface, p['color'], p['rect'])
pygame.display.update()
while True:
for event in pygame.event.get():
if event.type == QUIT:
pygame.quit()
sys.exit()
Try changing the data amount and groups to see the effects.