Introduction
In this article we will look at training and testing of a Multi-class Logistic Classifier
- Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix \(W\) and a bias vector \(b\). Classification is done by projecting data points onto a set of hyperplanes, the distance to which is used to determine a class membership probability.
- Mathematically this can be expressed as
\begin{eqnarray*} & P(Y=i|x, W,b) =\frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}
- Corresponding to each class \(y_i\) logistic classifier is paramemterized by a set of parameters \(W_i,b_i\).
- These parameters are used to compute the class probability.
- Given a unknown vector x,The prediction is performed as
\begin{eqnarray*} & y_{pred} = argmax_i P(Y=i|x,W,b) \\ & y_{pred} = argmax_i \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \end{eqnarray*}
- Given a set of labelled training data \({X_i,Y_i}\) where \(i\) in \({1,\ldots,N}\) we need to estimate these parameters.
Loss Function
- Ideally we would like to compute the parameters so that the \(0-1\) loss is minimized
\begin{eqnarray*} & \ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}} \\ & f(x)= argmax_k P(Y=y_k |x,\theta) \end{eqnarray*}
- \(P(Y=y_k |x,\theta)\) is modelled using logistic function.
- The \(0-1\) loss function is not differentiable, hence optimizing it for large modes is computationally infesible.
- Instead we maximize the log-likelyhood of the classifier given the training data \(\mathcal{D}\).
- Maximum Likelyhood estimation is used to perform this operation.
- Estimate the parameters so that likelyhood of training data \(\mathcal{D} \) is maximized under the model parameters
- It is assumed that the data samples are independent, so the probability of the set is the product of probabilities of individual examples.
\begin{eqnarray*} & L(\theta={W,b},\mathcal{D}) =argmax \prod_{i=1}^N P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = argmax \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = - argmin \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ \end{eqnarray*}
- It should be noted that Likelyhood of correct class is not same as number of right predictions.
- Log Likelyhood function can be considered as differential version of the \(0-1\) loss function.
- In the present application negative log-likelyhood is used as the loss function
- Optimal parameters are learned by minimizing the loss function.
- In the present application gradient based methods are used for minimization.
- Specifically stochastic gradient descent and conjugated gradient descent are used for minimization of the loss function.
- The cost function is expressed as
\begin{eqnarray*} & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log {e^{W_i x + b_i}}- log {\sum_j e^{W_j x + b_j}} \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N {W_i x + b_i} + log \frac{1}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}
- The first part of the sum is affine,second is a log of sum of exponentials which is convex Thus the loss function is convex.
- Thus we can compute the parameters corresponding to global maxima of the loss function using gradient descent methods.
- Thus we compute the derivatives of the loss function \(L(\theta,\mathcal{D})\) with respect to \(\theta,\partial{\ell}/\partial{W}\) and \(\partial{\ell}/\partial{b}\)
Theano
- Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
- It is a expression compiler,and can evaluate symbolic expression when executed.Typically programs which are implemented in C/C++ can be written concisely and efficiently in Theano.
- computing the gradients in most programming languages (C/C++, Matlab, Python), involves manually deriving the expressions for the gradient of the loss with respect to the parameters \(\partial{\ell}\partial{W}\), and \(\partial{\ell}/\partial{b}\),
- This approah not only involves manual coding but the derivatives can get difficult to compute for complex models,
- With Theano, this work is greatly simplified as it performs automatic differentiation .
Example
Theano Code
- The python code for training and testing can be found in the git repository https://github.com/pi19404/OpenVision ImgML/LogisticRegression.py file.
- the ImgML/load_datasets.py contains methods to load datasets from pickel files or SVM format files.
""" symbolic expressions defining input and output vectors"""
x=T.matrix('x');
y=T.ivector('y');
""" The mnist dataset in pickel format"""
model_name1="/media/LENOVO_/repo/mnist.pkl.gz"
""" creating object of class Logistic regression"""
""" input is 28*28 dimension feature vector ,and
output lables are digits from 0-9 """
classifier = LogisticRegression(x,y,28*28,10);
""" loading the datasets"""
[train,test,validate]=load_datasets.load_pickle_data(model_name1);
""" setting the dataset"""
classifier.set_datasets(train,test,validate);
#
#classifier.init_classifier(model_name1);n_out
""" Training the classifiers"""
classifier.train_classifier(0.13,1000,30);
""" Saving the model """
classifier.save('1');
#x=classifier.train[0].get_value(borrow=True)[0];
#classifier.predict(x);
""" Loading the model"""
classifier.load('1')
x=train[0].get_value(borrow=True);
y=train[1].eval();
print 'True class:'+`y`
xx,yy=classifier.predict(x);
print 'Predicted class:' + `yy`
classifier.testing();
- C/C++ code has been also written using Eigen/OpenCV and incorporated in OpenVision Library. This can be found in the files https://github.com/pi19404/OpenVision ImgML/LogisticRegression.cpp and ImgML/LogisticRegression.hpp files.