Step-by-Step Guide to Implement Machine Learning X - KMeans

Ryukkkk

4.33/5 (5 votes)

3 Jun 2019CPOL2 min read

5.2K

Easy to implement machine learning

This article is an entry in our Machine Learning and Artificial Intelligence Challenge. Articles in this sub-section are not required to be full articles so care should be taken when voting.

Introduction

KMeans is a simple clustering algorithm, which calculates the distance between samples and centroid to generate clusters. K is the number of clusters given by user. At the initial time, the K clusters are chosen randomly, they are adjusted at each iteration to get the optimal results. The centroid is the mean of samples in its corresponding cluster, thus, the algorithm is called K"Means".

KMeans Model

KMeans

The normal KMeans algorithm is really simple. Denote the K clusters as $\mu_{1},\mu_{2},...\mu_{k}$ , and the number of samples in each cluster as $N_{1},N_{2},...,N_{k}$ . The loss function of KMeans is:

$J\left(\mu_{1},\mu_{2},...\mu_{k}\right)=\frac{1}{2}\sum_{j=1}^{K}\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)^{2}$ .

Calculate the derivative of loss function:

$\frac{\partial J}{\partial\mu_{j}}=-\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)$

Then, make the derivative equal to 0, we can obtain:

$\mu_{j}=\frac{1}{N_{j}}\sum_{i=1}^{N_{j}}x_{i}$

namely, the centroid is the means of samples in its corresponding cluster. The code of KMeans is shown below:

Python

def kmeans(self, train_data, k):
    sample_num = len(train_data)
    distances = np.zeros([sample_num, 2])                      # (index, distance)
    centers = self.createCenter(train_data)
    centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
    return centers, distances

where adjustCluster() is to adjust centroid after determining the initial centroids, which aims to minimize the loss function. The code of adjustCluster is:

Python

def adjustCluster(self, centers, distances, train_data, k):
    sample_num = len(train_data)
    flag = True  # If True, update cluster_center
    while flag:
        flag = False
        d = np.zeros([sample_num, len(centers)])
        for i in range(len(centers)):
            # calculate the distance between each sample and each cluster center
            d[:, i] = self.calculateDistance(train_data, centers[i])

        # find the minimum distance between each sample and each cluster center
        old_label = distances[:, 0].copy()
        distances[:, 0] = np.argmin(d, axis=1)
        distances[:, 1] = np.min(d, axis=1)
        if np.sum(old_label - distances[:, 0]) != 0:
            flag = True
            # update cluster_center by calculating the mean of each cluster
            for j in range(k):
                current_cluster =
                      train_data[distances[:, 0] == j]  # find the samples belong
                                                        # to the j-th cluster center
                if len(current_cluster) != 0:
                    centers[j, :] = np.mean(current_cluster, axis=0)
    return centers, distances

Bisecting KMeans

Because KMeans may get a local optimation result, to solve the problem, we introduce another KMeans algorithm called bisecting KMeans. In bisecting KMeans, all the samples are regarded as a cluster at initial, and the cluster is divided into two parts. Then, choose one part of the divided cluster to bisect again and again till the number of cluster is K. We bisect the cluster according to minimum Sum of Squared Error(SSE). Denote the current n clusters as:

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

We choose a cluster $c_{i}$ in and bisect it into two parts $c_{i1},c_{i2}$ using normal KMeans. The SSE is:

$SSE_{i}=SSE\left(c_{i1},c_{i2}\right)+SSE\left(C-c_{i}\right)$

We choose the $c_{i}$ which can get a minimum of SSE as the cluster to be bisected, namely:

$index =arg\min SSE_{i}$

Repeat the above processes till the number of clusters is K.

Python

def biKmeans(self, train_data):
    sample_num = len(train_data)
    distances = np.zeros([sample_num, 2])         # (index, distance)
    initial_center = np.mean(train_data, axis=0)  # initial cluster #shape (1, feature_dim)
    centers = [initial_center]                    # cluster list

    # clustering with the initial cluster center
    distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)

    # generate cluster centers
    while len(centers) < self.k:
        # print(len(centers))
        min_SSE  = np.inf
        best_index = None
        best_centers = None
        best_distances = None

        # find the best split
        for j in range(len(centers)):
            centerj_data = train_data[distances[:, 0] == j]   # find the samples belonging
                                                              # to the j-th center
            split_centers, split_distances = self.kmeans(centerj_data, 2)
            split_SSE = np.sum(split_distances[:, 1]) ** 2    # calculate the distance
                                                              # for after clustering
            other_distances = distances[distances[:, 0] != j] # the samples don't belong
                                                              # to j-th center
            other_SSE = np.sum(other_distances[:, 1]) ** 2    # calculate the distance
                                                              # don't belong to j-th center

            # save the best split result
            if (split_SSE + other_SSE) < min_SSE:
                best_index = j
                best_centers = split_centers
                best_distances = split_distances
                min_SSE = split_SSE + other_SSE

        # save the spilt data
        best_distances[best_distances[:, 0] == 1, 0] = len(centers)
        best_distances[best_distances[:, 0] == 0, 0] = best_index

        centers[best_index] = best_centers[0, :]
        centers.append(best_centers[1, :])
        distances[distances[:, 0] == best_index, :] = best_distances
    centers = np.array(centers)   # transform form list to array
    return centers, distances

KMeans++

Because the initial centroid has great effect on the performance of KMeans, to solve the problem, we introduce another KMeans algorithm called KMeans++. Denote the current n clusters as:

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

When we choose the (n+1)-th centroid, the farther from the existing centroids the sample is, the more probable it will be chosen as the new centroid. First, we calculate the minimum distance between each sample and the existing centroids:

$D\left(x_{i}\right)=\min\limits_{c_{j}\in C}\left(x_{i}-c_{j}\right)$

Then, calculate the probability of each sample to be chosen as the next centroid:

$p_{i}=\frac{D\left(x_{i}\right)^{2}}{\sum_{x_{i}\in X}D\left(x_{i}\right)^{2}}$

Then, we use roulette wheel selection to get the next centroid. After determining K clusters, run adjustCluster() to adjust the result.

Python

def kmeansplusplus(self,train_data):
    sample_num = len(train_data)
    distances = np.zeros([sample_num, 2])       # (index, distance)

    # randomly select a sample as the initial cluster
    initial_center = train_data[np.random.randint(0, sample_num-1)]
    centers = [initial_center]

    while len(centers) < self.k:
        d = np.zeros([sample_num, len(centers)])
        for i in range(len(centers)):
            # calculate the distance between each sample and each cluster center
            d[:, i] = self.calculateDistance(train_data, centers[i])

        # find the minimum distance between each sample and each cluster center
        distances[:, 0] = np.argmin(d, axis=1)
        distances[:, 1] = np.min(d, axis=1)

        # Roulette Wheel Selection
        prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
        index = self.rouletteWheelSelection(prob, sample_num)
        new_center = train_data[index, :]
        centers.append(new_center)

    # adjust cluster
    centers = np.array(centers)   # transform form list to array
    centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
    return centers, distances

Conclusion and Analysis

Indeed, it is necessary to adjust clusters after determining how to choose the parameter 'K', there exist some algorithms. Finally, let's compare the performances of three kinds of clustering algorithms.

It can be found that KMeans++ has the best performance.

The related code and dataset in this article can be found in MachineLearning.

History

3^rd June, 2019: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)