Introduction
KMeans is a simple clustering algorithm, which calculates the distance between samples and centroid to generate clusters. K is the number of clusters given by user. At the initial time, the K clusters are chosen randomly, they are adjusted at each iteration to get the optimal results. The centroid is the mean of samples in its corresponding cluster, thus, the algorithm is called K"Means".
KMeans Model
KMeans
The normal KMeans algorithm is really simple. Denote the K clusters as , and the number of samples in each cluster as . The loss function of KMeans is:
.
Calculate the derivative of loss function:
Then, make the derivative equal to 0, we can obtain:
namely, the centroid is the means of samples in its corresponding cluster. The code of KMeans is shown below:
def kmeans(self, train_data, k):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2])
centers = self.createCenter(train_data)
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
where adjustCluster()
is to adjust centroid after determining the initial centroids, which aims to minimize the loss function. The code of adjustCluster
is:
def adjustCluster(self, centers, distances, train_data, k):
sample_num = len(train_data)
flag = True
while flag:
flag = False
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
d[:, i] = self.calculateDistance(train_data, centers[i])
old_label = distances[:, 0].copy()
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
if np.sum(old_label - distances[:, 0]) != 0:
flag = True
for j in range(k):
current_cluster =
train_data[distances[:, 0] == j]
if len(current_cluster) != 0:
centers[j, :] = np.mean(current_cluster, axis=0)
return centers, distances
Bisecting KMeans
Because KMeans may get a local optimation result, to solve the problem, we introduce another KMeans algorithm called bisecting KMeans. In bisecting KMeans, all the samples are regarded as a cluster at initial, and the cluster is divided into two parts. Then, choose one part of the divided cluster to bisect again and again till the number of cluster is K. We bisect the cluster according to minimum Sum of Squared Error(SSE). Denote the current n clusters as:
We choose a cluster in and bisect it into two parts using normal KMeans. The SSE is:
We choose the which can get a minimum of SSE as the cluster to be bisected, namely:
Repeat the above processes till the number of clusters is K.
def biKmeans(self, train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2])
initial_center = np.mean(train_data, axis=0)
centers = [initial_center]
distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)
while len(centers) < self.k:
min_SSE = np.inf
best_index = None
best_centers = None
best_distances = None
for j in range(len(centers)):
centerj_data = train_data[distances[:, 0] == j]
split_centers, split_distances = self.kmeans(centerj_data, 2)
split_SSE = np.sum(split_distances[:, 1]) ** 2
other_distances = distances[distances[:, 0] != j]
other_SSE = np.sum(other_distances[:, 1]) ** 2
if (split_SSE + other_SSE) < min_SSE:
best_index = j
best_centers = split_centers
best_distances = split_distances
min_SSE = split_SSE + other_SSE
best_distances[best_distances[:, 0] == 1, 0] = len(centers)
best_distances[best_distances[:, 0] == 0, 0] = best_index
centers[best_index] = best_centers[0, :]
centers.append(best_centers[1, :])
distances[distances[:, 0] == best_index, :] = best_distances
centers = np.array(centers)
return centers, distances
KMeans++
Because the initial centroid has great effect on the performance of KMeans, to solve the problem, we introduce another KMeans algorithm called KMeans++. Denote the current n clusters as:
When we choose the (n+1)-th centroid, the farther from the existing centroids the sample is, the more probable it will be chosen as the new centroid. First, we calculate the minimum distance between each sample and the existing centroids:
Then, calculate the probability of each sample to be chosen as the next centroid:
Then, we use roulette wheel selection to get the next centroid. After determining K clusters, run adjustCluster()
to adjust the result.
def kmeansplusplus(self,train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2])
initial_center = train_data[np.random.randint(0, sample_num-1)]
centers = [initial_center]
while len(centers) < self.k:
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
d[:, i] = self.calculateDistance(train_data, centers[i])
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
index = self.rouletteWheelSelection(prob, sample_num)
new_center = train_data[index, :]
centers.append(new_center)
centers = np.array(centers)
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
Conclusion and Analysis
Indeed, it is necessary to adjust clusters after determining how to choose the parameter 'K', there exist some algorithms. Finally, let's compare the performances of three kinds of clustering algorithms.
It can be found that KMeans++ has the best performance.
The related code and dataset in this article can be found in MachineLearning.
History
- 3rd June, 2019: Initial version