Step-by-Step Guide to Implement Machine Learning VIII - Linear Regression

Ryukkkk

5.00/5 (2 votes)

27 May 2019CPOL3 min read

10K

Easy to implement machine learning

This article is an entry in our Machine Learning and Artificial Intelligence Challenge. Articles in this sub-section are not required to be full articles so care should be taken when voting.

Introduction

There universally exists a relationship among variables. Indeed, the relationship can be divided into two categories, namely, certainty relation and uncertainty relation. The certainty relation can be expressed with a function. The certainty relation is also called correlation, which can be studied with regression analysis.

Generally, the linear regression model is:

$h_{\theta}\left(x\right)=\theta^{T}x$

The optimal $\theta$ can be determined by minimum the loss function:

$J\left(\theta\right)=\sum^{m}_{i=1}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}$

Regression Model

Linear regression consists of linear regression, local weighted linear regression, ridge regression, Lasso regression and stepwise linear regression.

Linear Regression

The parameter $\theta$ for linear regression can be calculated by gradient descent method or regular expression. Because gradient descent method has been introduced in Step-by-Step Guide to Implement Machine Learning IV - Logistic Regression, we introduce the solution with regular expression in this article.

First, calculate the derivative of loss function:

$\begin{align*} \nabla_{\theta}J\left(\theta\right)& =\nabla_{\theta}\frac{1}{2}\left(X\theta-Y\right)^{T}\left(X\theta-Y\right)\\ &= \frac{1}{2}\nabla_{\theta}\left(\theta^{T}X^{T}X\theta-\theta^{T}X^{T}Y-YX\theta+Y^{T}Y\right)\\ &=\frac{1}{2}\nabla_{\theta}tr\left(\theta^{T}X^{T}X\theta-\theta^{T}X^{T}Y-YX\theta+Y^{T}Y\right)\\ &=\frac{1}{2}\nabla_{\theta}\left(tr\theta^{T}X^{T}X\theta-2trYX\theta\right)\\ &=\frac{1}{2}\nabla_{\theta}\left(X^{T}X\theta+X^{T}X\theta-2X^{T}Y\right)\\ &= X^{T}X\theta-X^{T}Y \end{align*}$

Then, make the derivative equal to 0, we can obtain:

$X^{T}X\theta=X^{T}Y$

Finally, $\theta$ is:

$\theta=\left(X^{T}X\right)^{-1}X^{T}Y$

where X is the training data and Y is the corresponding label. The code of linear regression is shown below:

Python

def standardLinearRegression(self, x, y):
    if self.norm_type == "Standardization":
        x = preProcess.Standardization(x)
    else:
        x = preProcess.Normalization(x)

    xTx = np.dot(x.T, x)
    if np.linalg.det(xTx) == 0:   # calculate the Determinant of xTx
        print("Error: Singluar Matrix !")
        return
    w = np.dot(np.linalg.inv(xTx), np.dot(x.T, y))
    return w

Local Weighted Linear Regression

It is underfitting in linear regression for it using the unbiased estimation of minimum mean square error(MMSE). To solve the problem, we assign weights on the points around the point to be predicted. Then, we apply normal regression analysis on it. The loss function for local weighted linear regression is:

$J\left(\theta\right)=\sum^{m}_{i=1}w^{(i)}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}$

Like linear regression, we calculate the derivative of loss function and make it equal to 0. The optimal $\theta$ is

$\theta=\left(X^{T}WX\right)^{-1}X^{T}WY$

The weights in local weighted linear regression is like the kernel function in SVM, which is given by:

$w^{(i)}=exp\left(-\frac{\left(x^{(i)}-x\right)^{T}\left(x^{(i)}-x\right)}{2\tau^{2}}\right)$

The code of local weighted linear regression is shown below:

Python

def LWLinearRegression(self, x, y, sample):
    if self.norm_type == "Standardization":
        x = preProcess.Standardization(x)
    else:
        x = preProcess.Normalization(x)

    sample_num = len(x)
    weights = np.eye(sample_num)
    for i in range(sample_num):
        diff = sample - x[i, :]
        weights[i, i] = np.exp(np.dot(diff, diff.T)/(-2 * self.k ** 2))
    xTx = np.dot(x.T, np.dot(weights, x))
    if np.linalg.det(xTx) == 0:
        print("Error: Singluar Matrix !")
        return
    result = np.dot(np.linalg.inv(xTx), np.dot(x.T, np.dot(weights, y)))
    return np.dot(sample.T, result)

Ridge Regression

If the feature dimension is large, than the number of samples, the input matrix is not full rank, whose inverse matrix doesn't exist. To solve the problem, ridge regression add $\lambda I$ to make the matrix nonsingular. Actually, it is equal to add L2 regularization on the loss function for ridge regression, namely:

$J\left(\theta\right)=\sum^{m}_{i=1}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}+\lambda\left| |\theta| \right|_{2}$

Like linear regression, we calculate the derivative of loss function and make it equal to 0. The optimal $\theta$ is:

$\theta=\left(X^{T}X+\lambda^{2}I\right)^{-1}X^{T}Y$

The code of ridge regression is shown below:

Python

def ridgeRegression(self, x, y):
    if self.norm_type == "Standardization":
        x = preProcess.Standardization(x)
    else:
        x = preProcess.Normalization(x)

    feature_dim = len(x[0])
    xTx = np.dot(x.T, x)
    matrix = xTx + np.exp(feature_dim)*self.lamda
    if np.linalg.det(xTx) == 0:
        print("Error: Singluar Matrix !")
        return
    w = np.dot(np.linalg.inv(matrix), np.dot(x.T, y))
    return w

Lasso Regression

Like ridge regression, Lasso regression add L1 regularization on the loss function, namely:

$J\left(\theta\right)=\sum^{m}_{i=1}\left(h_{\theta}\left(x\right)^{(i)}-y^{(i)}\right)^{2}+\lambda\left| |\theta| \right|_{1}$

Because the L1 regularization contains absolute value expression, the loss function is not derivable anywhere. Thus, we apply coordinate descent method (CD). The CD gets a minimum at a direction each iteration, namely,

$\theta^{i+1}_{j}=arg\min\limits_{\theta_{i}} J\left(\theta_{1},\theta_{2}^{(i-1)},..., \theta_{n}^{(i-1)}\right)$

We can get a closed solution for CD, which is given by:

$\begin{equation} \left\{ \begin{array}{lr} \theta_{j}=\rho_{j}+\lambda &\ if\ \rho_{i}<-\lambda \\ \theta_{j}=0 &if\ -\lambda\leq\rho_{i}\leq\lambda \\ \theta_{j}=\rho_{j}-\lambda &\ if\ \rho_{i}>\lambda \\ \end{array} \right. \end{equation}$

where:

$\rho_{j}=\sum_{i=1}^{m}x_{j}^{(i)}\left(y^{(i)}-\sum_{k\ne j}^{n}\theta_{k}x_{k}^{(i)}\right)\\=\sum_{i=1}^{m}x_{j}^{(i)}\left(y^{(i)}-\hat y_{pred}^{(i)}+\theta_{j}x^{(i)}_{j}\right)$

The code of Lasso regression is shown below:

Python

def lassoRegression(self, x, y):
    if self.norm_type == "Standardization":
        x = preProcess.Standardization(x)
    else:
        x = preProcess.Normalization(x)

    y = np.expand_dims(y, axis=1)
    sample_num, feataure_dim = np.shape(x)
    w = np.ones([feataure_dim, 1])
    for i in range(self.iterations):
        for j in range(feataure_dim):
            h = np.dot(x[:, 0:j], w[0:j]) + np.dot(x[:, j+1:], w[j+1:])
            w[j] = np.dot(x[:, j], (y - h))
            if j == 0:
                w[j] = 0
            else:
                w[j] = self.softThreshold(w[j])
    return w

Stepwise Linear Regression

Stepwise linear regression is similar to Lasso, which applies greedy algorithm at each iteration to get a minimum rather than CD. Stepwise linear regression adds or cuts down a small part on the weights at each iteration. The code of stepwise linear regression is shown below:

Python

def forwardstepRegression(self, x, y):
    if self.norm_type == "Standardization":
        x = preProcess.Standardization(x)
    else:
        x = preProcess.Normalization(x)

    sample_num, feature_dim = np.shape(x)
    w = np.zeros([self.iterations, feature_dim])
    best_w = np.zeros([feature_dim, 1])
    for i in range(self.iterations):
        min_error = np.inf
        for j in range(feature_dim):
            for sign in [-1, 1]:
                temp_w = best_w
                temp_w[j] += sign * self.learning_rate
                y_hat = np.dot(x, temp_w)
                error = ((y - y_hat) ** 2).sum()           # MSE
                if error < min_error:                   # save the best parameters
                    min_error = error
                    best_w = temp_w
        w = best_w
    return w

Conclusion and Analysis

There are many solutions to get the optimal parameter for linear regression. In this article, we only introduce some basic algorithms. Finally, let's compare our linear regression with the linear regression in Sklearn and the detection performance is displayed below:

Sklearn linear regression performance: