Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / Java

Multi-Linear Regression in Java

5.00/5 (11 votes)
13 Mar 2017CPOL5 min read 77.4K   2.1K  
Multi-linear regression/classification with simple examples and Java code
This article introduces multi-linear regression/classification with simple examples and provide the codes in Java.

Introduction

I introduce a very popular subject in statistical modelling; multi-linear (or multi-variate) regression (MLR) or classification. In simple examples, I will show you the usage of MLR. MLR has been used extensively in science (biological, pharmaceutical, financial, medical and more).

Background

A few months ago, I wrote an article about matrix operations in Java. I suggest reading that article first since the code in this article is heavily dependent on the matrix operations.

To understand multi-linear regression (MLR), have a look at the following table:

Diet score Male age>20 BMI
4 0 1 27
7 1 1 29
6 1 0 23
2 0 0 20
3 0 1 21

The body mass index of five people has been measured. For each person, the diet score, whether they are male or female and whether they are older than 20 have also been recorded in three columns. Do not ask me what diet score is and how to measure them, because I do not know and this is just a toy example. The question is: what is the relationships between BMI and diet score, gender and age? If we have the diet score, gender and age of a new person, can we get his/her body mass index? MLR is here to answer these questions. We expect the relationships between BMI and three variables to be something like this:

Based on this equation, in order to predict the value of BMI for a person with known diet score, gender and age, you need to know the values of all beta. MLR finds the value of all missing coefficients. We call ß0 bias. In most real-life applications, having a large bias means the predictors (i.e., the three variables) do not have enough predictive power and having small bias is a good sign of having a good predictive model. A large bias could possibly mean that there are other descriptors that can explain the observations which we have not discovered them yet.

Let's show the BMI column in the above table as a column matrix and name it Y and the values of all independent variables as a 3 x 3 matrix with name X and finally the values of beta matrix that will be discovered later as a column matrix b. The unknown matrix b can be found as:

b = (X'X)-1X'Y

where X' is the transpose of matrix X and -1 returns the inverse of the matrix.

If you want to have bias, you need to add a new column to matrix X. This new column should be the first one and its value for all rows must be 1.

Limitation of MLR: MLR works only when the number of columns in X matrix is less than or equals the number of rows. In other words, the number of descriptors cannot be more than the number of observations. Another limitation is about the inverse operation in the above equation. Not all matrices have inverse and when we cannot get the inverse of X'X, then the calculation of b matrix will fail and therefore the MLR will fail. There are other methods such as Partial Least Square or Support Vector Machine that work fine when MLR fails.

Using the Code

We only need the implementation of a single method on top of all matrix operations methods described in another article in order to create the model and find the values of b matrix.

C++
public Matrix calculate() throws NoSquareException {
	if (bias)
		this.X = X.insertColumnWithValue1();
	checkDiemnsion();
	Matrix Xtr = MatrixMathematics.transpose(X); //X'
	Matrix XXtr = MatrixMathematics.multiply(Xtr,X); //X'X
	Matrix inverse_of_XXtr = MatrixMathematics.inverse(XXtr); //(X'X)^-1
	if (inverse_of_XXtr == null) {
		System.out.println("Matrix X'X does not have any inverse. 
                            So MLR failed to create the model for these data.");
		return null;
	}
	Matrix XtrY = MatrixMathematics.multiply(Xtr,Y); //X'Y
	return MatrixMathematics.multiply(inverse_of_XXtr,XtrY); //(X'X)^-1 X'Y
}

The above code follows the following steps in order to get the b matrix:

  1. If you want to have bias (i.e., beta0), then add a new column to X matrix
  2. Then check the input matrices are valid
  3. Then find the transpose of X (i.e. X' )
  4. Then multiply X by X'
  5. Then find the inverse of matrix from step 4; i.e. (XX')-1
  6. Then multiply X' by Y
  7. Finally, multiply matrix from operation in step 5 by matrix of operation in step 6

Now let's test the method on the above example:

C++
Matrix X = new Matrix(new double[][]{{4,0,1},{7,1,1},{6,1,0},{2,0,0},{3,0,1}});
Matrix Y = new Matrix(new double[][]{{27},{29},{23},{20},{21}});
MultiLinear ml = new MultiLinear(X, Y);
Matrix beta = ml.calculate();

When we use the constructor with two arguments, then the value of bias by default is true. Here are the results:

This is a model to predict the MSI having the values of all independent variables (i.e., diet score, gender and age). The size of the values for beta and also their sign shows their importance. In this illustrative example, diet score and gender have a greater contribution to BMI than age, and effect of gender and diet score is opposite; i.e., people with more high diet score have more BMI and males have significantly lower BMI with respect to females. It is interesting to see the insight that MLR is giving about the BMI observations.

One final question: Is this a good model? The minimum that we can do is to use the model (the above equation) and predict the BMI and then compare them with the observed values:

BMI predicted
27 27
29 27.75
23 24.25
20 18.75
21 22.25

As you can see, the predicted ones are not that far from the observed ones. You can find the error (i.e., predicted - observed) for each case and calculate the mean squared error (MSE) that can indicate how accurate our model is. The lower the MSE, the better the model. There are plenty of fancy statistical tests that can be used to examine the suitability of the model which I will ignore in this article. You can find a couple of more tests in the code. One of the test examples is a classification analysis using MLR.

Points of Interest

With a few lines of code, I tried to illustrate one of the most important statistical modelling algorithms (MLR). I have not tested the codes for large matrices and because of recursive operations that we have you may need to increase the thread's stack size (i.e. -Xss flag). Please let me know if you have some interesting data that we can test the codes.

History

  • 23rd March, 2013: First version (v1.0.1)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)