Introduction
Naive Bayes is a kind of classification based on Bayesian decision theory and feature conditional independence, which calculates the probability distribution based on conditional independence on training set as the detection model. For a given test object, the label of the maximum of the posterior probability is the prediction of the test object. Maximize the posterior probability means minimizing the expected risk. Then another question is why call it "Naive" Bayes? This is because Naive Bayes follow such a naive hypothesis: all the features for classification are independent when the label is definitized, which is given by:
where x(j) is the i-th feature, ck is the k-th
label. Then, the Bayes classifier can be defined as:
So why maximize the posterior probability means minimizing the expected risk ? Let the loss is 0-1 loss function is
where f(x)
is the decision function. Then, the expected risk is
which is calculated from joint distribution P(X,Y)
. Thus the conditional expectation is:
To minimize the expected risk, it needs to minimize each X = x
, namely:
Naive Bayes Model
Naive Bayes model consists of parameters estimation and classify.
Parameters Estimation
In the training process, learning means estimate the prior probability and conditional probability. Maximum likelihood estimation (MLE) is a general method to get the above parameters. The MLE of prior probability is given by:
Denote the
j-th
feature set is
{aj1,aj2,...,ajsi}.Then, the MLE of conditional probability is given by:
In the Naive Bayes training process, the prior probability and conditional probability is calculated. However, if a value of a feature has never occurred in the training set, it's probability is equal to zero, which will effect the result of posterior probability. To solve the problem, we introduce Laplace smoothing: add an integer to the frequency of each random variable.
Then, the Bayesian estimation of prior probability is:
where N
is the number of unique labels, the K
is the number of samples. The code of prior probability is shown below:
prior_probability = {}
for key in range(len(label_value)):
prior_probability[label_value[key][0]] =
(label_value[key][1] + self.laplace) / (N + K * self.laplace)
self.prior_probability = prior_probability
where label_value
is the tuple of (label, label_num)
.
Similarly, the Bayesian estimation of conditional probability is:
The code of conditional probability is shown below. A matrix is applied to save the conditional probability and S[j]
is the number of unique labels of the j-th
feature.
prob = []
for j in range(feature_dim):
count = np.zeros([S[j], len(label_count)])
feature_temp = train_data[:, j]
feature_value_temp = feature_value[j]
for i in range(len(feature_temp)):
for k in range(len(feature_value_temp)):
for t in range(len(label_count)):
if feature_temp[i] == feature_value_temp[k]
and train_label[i] == label_value[t][0]:
count[k][t] += 1
for m in range(len(label_value)):
count[:, m] = (count[:, m] + self.laplace) /
(label_value[m][1] + self.laplace*S[j])
prob.append(count)
self.conditional_probability = prob
Classify
After calculating the prior probability and conditional probability, the Bayesian classification model is:
The classification code is shown below. The predict
is a dictionary which includes the probability of each label. Then we just need to sort the predict
and the prediction is the first element in the sorted dictionary.
def classify(self, sample):
predict = {}
for m in range(len(self.label_value)):
temp = self.prior_probability
[self.label_value[m][0]]
for n in range(len(sample)):
if sample[n] in self.feature_value[n]:
index = np.where(self.feature_value[n] == sample[n])[0][0]
temp = temp * self.conditional_probability[n][index][m]
else:
temp = self.laplace /
(self.S[n] * self.laplace)
predict[self.label_value[m][0]] = temp
return predict
Conclusion and Analysis
The Bayesian model is this article is Berniulli Bayesian model. Except that, there are other Bayesian model such as Guassian Bayesian model, Polynomial Bayesian model. Finally, let's compare our Bayesian model with the Bayes model in Sklearn and the detection performance is displayed below:
It is found that both methods achieve poor detection results. Moreover, our Bayesian model takes a longer runtime, which may be that the algorithm of conditional probability contains too many loops.
The related code and dataset in this article can be found in MachineLearning.