Introduction
K-nearest neighbors(KNN) is a simple machine learning algorithm, the principle of which is to calculate the distance between the test object and the training set. Then, the object in training set is selected by the distances to add to a K-NN set until the K-NN set includes a predefined number of nearest neighbors, which can be expressed as:
where is the KNN set, is given by:
KNN Model
KNN model consists of distance calculation, select K and classify.
Distance Calculation
Distance is used to measure the similarity between two objects in the feature space. There are many methods to calculate distance between two objects, such as Euclidean distance, Cosine distance, Edit distance, Manhattan distance, etc. The simplest method is Euclidean distance, which is calculated as:
where n
is the feature dimension.
Because the different scales of feature value, the bigger value has more effect on the distance. Thus, all the features need to be normalized. Specifically, there are two normalization methods. One is min-max normalization, which is given by:
def Normalization(self, data):
minValue = data.min(axis=0)
maxValue = data.max(axis=0)
diff = maxValue - minValue
mindata = np.tile(minValue, (data.shape[0], 1))
normdata = (data - mindata)/np.tile(diff, (data.shape[0], 1))
return normdata
The other is z-score normalization, which is given by:
where is the mean of and is the standard deviation of .
def Standardization(self, data):
meanValue = data.mean(axis=0)
varValue = data.std(axis=0)
standarddata = (data - np.tile(meanValue,
(data.shape[0], 1)))/np.tile(varValue, (data.shape[0], 1))
return standarddata
After normalization, the code of Euclidean distance is shown below:
train_num = train_data.shape[0]
distances = np.tile(input, (train_num, 1)) - train_data
distances = distances**2
distances = distances.sum(axis=1)
distances = distances**0.5
Select K
If we select a small K, the model is learned with a small neighbor which will lead to small "approximate error" while large "estimate error". In a word, small K will make the model complex and tend to overfit. On the contrary, if we select a large K, the model is learned with a large neighbor which will lead to large "approximate error" while small "estimate error". In a word, large K will make the model simple and tend to large computation.
Classifiy
After determine K, we can utilize vote for classification, which means the minority is subordinate to the majority, whose code is shown below:
disIndex = distances.argsort()
labelCount = {}
for i in range(k):
label = train_label[disIndex[i]]
labelCount[label] = labelCount.get(label, 0) + 1
prediction = sorted(labelCount.items(), key=op.itemgetter(1), reverse=True)
label = prediction[0][0]
In the above code, we first use argsorts()
to get the ordered index, then count each kind of label in the first K samples, finally labelCount
is sorted to get the label with the most votes, which is the prediction for the test object. The whole prediction function is shown below:
def calcuateDistance(self, input_sample, train_data, train_label, k):
train_num = train_data.shape[0]
distances = np.tile(input, (train_num, 1)) - train_data
distances = distances**2
distances = distances.sum(axis=1)
distances = distances**0.5
disIndex = distances.argsort()
labelCount = {}
for i in range(k):
label = train_label[disIndex[i]]
labelCount[label] = labelCount.get(label, 0) + 1
prediction = sorted(labelCount.items(), key=op.itemgetter(1), reverse=True)
label = prediction[0][0]
return label
Conclusion and Analysis
KNN is implemented by linear traverse in this articles. However, there exists more effective method for KNN, like kd tree. Moreover, it is valid to apply cross-validation to get a more suitable K. Finally, let's compare our KNN with the KNN in Sklearn and the detection performance is displayed below.
Form the figure, we can learn that the KNN in this article is better than sklearn knn in terms of accuracy. The runtime is nearly the same.
The related code and dataset in this article can be found in MachineLearning.