Introduction
Support Vector Machine(SVM) is a classifier based on the max margin in feature space. The learning strategy of SVM is to make the margin max, which can be transformed into a convex quadratic programming problem. Before learning the algorithm of SVM, let me introduce some terms.
Functional margin is defined as: For a given training set T
, and hyper-plane(w,b)
, the functional margin between the hyper-plane (w,b)
and the sample(xi, yi)
is:
The functional margin between hyper-plane (w,b)
and the training set T is the minimum of
Functional margin indicates the confidence level of the classification results. If the hyper-parameters (w,b)
change in proportion, for example, change (w,b)
into (2w,2b)
. Though the hyper-plane hasn't changed, the functional margin expend doubles. Thus, we apply some contrains on w
to make the hyper-plane definitized , such as normalization ||w|| = 1
. Then, the margin is called geometric margin: For a given training set T
, and hyper-plane(w,b)
, the functional margin between the hyper-plane (w,b)
and the sample(xi, yi)
is:
Similarly, the geometric margin between hyper-plane (w,b)
and the training set T is the minimum of
Now, we can conclude the relationship between functional margin and geometric margin:
SVM Model
The SVM model consists of optimization problem, optimization algorithm and classify.
Optimization Problem
When the dataset is linearly separable, the supports vectors are the samples which are nearest to the hyper-plane as shown below.
The samples on H1
and H2
are the support vectors.The distance between H1
and H2
is called hard margin. Then, the optimization problem of SVM is:
When the dataset is not linearly separable, some samples in the training set don't satisfy the condition that the functional margin is larger than or equal to 1
. To solve the problem, we add a slack variable for each sample (xi, yi)
. Then, the constraint is:
Meanwhile, add a cost for each slack variable. The target function is:
where C
is the penalty coefficient. When C
is large, the punishment of misclassification will be increased, whereas the punishment of misclassification will be reduced. Then, the optimization problem of SVM is:
The support vectors are on the border of margin or between the border and the hyper-plane as shown below. In this case, the margin is called soft margin.
It needs to apply kernel trick to transform the data from original space into feature space when the dataset is not linearly separable. The function for kernel trick is called kernel function, which is defined as:
where is a mapping from input space to feature space, namely,
The code of kernel trick is shown below:
def kernelTransformation(self, data, sample, kernel):
sample_num, feature_dim = np.shape(data)
K = np.zeros([sample_num])
if kernel == "linear":
K = np.dot(data, sample.T)
elif kernel == "poly":
K = (np.dot(data, sample.T) + self.c) ** self.n
elif kernel == "sigmoid":
K = np.tanh(self.g * np.dot(data, sample.T) + self.c)
elif kernel == "rbf":
for i in range(sample_num):
delta = data[i, :] - sample
K[i] = np.dot(delta, delta.T)
K = np.exp(-self.g * K)
else:
raise NameError('Unrecognized kernel function')
return K
After applying kernel trick, the optimization problem of SVM is:
where is the Lagrange factor.
Optimization Algorithm
It is feasible to solve the SVM optimization problem with traditional convex quadratic programming algorithms. However, when the training set is large, the algorithms will take a long time. To solve the problem, Platt proposed an efficient algorithm called Sequential Minimal Optimization (SMO).
SMO is a kind of heuristic algorithm. When all the variables follow the KKT condition, the optimization problem is solved. Else, choose two variables and fix other variables and construct a convex quadratic programming problem with the two variables. The problem has two variables: one chooses the alpha which violated KKT conditions, the other is determined by the constraints. Denote, the are the variables, fix . Thus, is calculated by:
If is determined , is determined accordingly. In each iteration of SMO, two variables are updated till the problem solved. Then, the optimization problem of SVM is:
When there is only two variable, it is a simple quadratic programming problem, as shown below:
Because the constraint is:
when ,
when , because ,.
The optimized value follows:
where L
and H
are the lower bound and upper bound of the diagonal line, which is calculated by:
The uncutting optimized value follows:
where E1 and E2 are the difference between the prediction value g(x)
and the real value. g(x)
is defined as:
So far, we get feasible solutions for :
There are two loops in SMO, namely, outside loop and inner loop.
In the outside loop, choose the alpha which violated KKT conditions, namely,
First, search the samples follow .If all the samples follow the condition, search the whole dataset.
In the inner loop, first search the which make maximum. If the chosen doesn't decrease enough, first search the samples on the margin border. If the chosen decreases enough, stop search. Else, search the whole dataset. If we can find a feasible after searching the whole dataset, we will choose a new .
In each iteration, we updated b
by:
For convenience, we have to store the value of , which is calculated by:
The search and update code is shown below:
def innerLoop(self, i):
Ei = self.calculateErrors(i)
if self.checkKKT(i):
j, Ej = self.selectAplha2(i, Ei)
old_alpha1 = self.alphas[i]
old_alpha2 = self.alphas[j]
H = min(C, C + old_alpha2 - old_alpha1)
H = min(C, old_alpha2 + old_alpha1)
if self.train_label[i] != self.train_label[j]:
L = max(0, old_alpha2 - old_alpha1)
H = min(self.C, self.C + old_alpha2 - old_alpha1)
else:
L = max(0, old_alpha2 + old_alpha1 - self.C)
H = min(self.C, old_alpha2 + old_alpha2)
if L == H:
return 0
K11 = self.K[i, i]
K12 = self.K[i, j]
K21 = self.K[j, i]
K22 = self.K[j, j]
eta = K11 + K22 - 2 * K12
if eta <= 0:
return 0
self.alphas[j] = old_alpha2 + self.train_label[j]*(Ei - Ej)/eta
self.alphas[j] = self.updateAlpha2(self.alphas[j], L, H)
new_alphas2 = self.alphas[j]
self.upadateError(j)
new_alphas1 = old_alpha1 + self.train_label[i] *
self.train_label[j] * (old_alpha2 - new_alphas2)
self.alphas[i] = new_alphas1
self.upadateError(i)
y2K21(new_alpha2 - old_alpha2) + old_b
y2K22(new_alpha2 - old_alpha2) + old_b
b1 = - Ei - self.train_label[i] * K11 * (old_alpha1 - self.alphas[i]) -
self.train_label[j] * K21 * (old_alpha2 - self.alphas[j]) + self.b
b2 = - Ej - self.train_label[i] * K12 * (old_alpha1 - self.alphas[i]) -
self.train_label[j] * K22 * (old_alpha2 - self.alphas[j]) + self.b
if (self.alphas[i] > 0) and (self.alphas[i] < self.C):
self.b = b1
elif (self.alphas[j] > 0) and (self.alphas[j] < self.C):
self.b = b2
else:
self.b = (b1 + b2)/2.0
return 1
else:
return 0
Classify
We can make a prediction using the optimized parameters, which is given by:
for i in range(test_num):
kernel_data = self.kernelTransformation(support_vectors, test_data[i, :], self.kernel)
probability[i] = np.dot(kernel_data.T, np.multiply
(support_vectors_label, support_vectors_alphas)) + self.b
if probability[i] > 0:
prediction[i] = 1
else:
prediction[i] = -1
Conclusion and Analysis
SVM is a more complex algorithm than previous algorithms. In this article, we simplify the search process to make it a bit more easy to understand. Finally, let's compare our SVM with the SVM in Sklearn and the detection performance is displayed below:
The detection performance is a little worse than the sklearn's , which is because the SMO in our SVM is simpler than sklearn's.
The related code and dataset in this article can be found in MachineLearning.