1. Introduction to Linear Regression
In the early 1990s, Orley Ashenfelter, an Economics Professor at Princeton University claimed to have found a method to predict the quality of Bordeaux wine, and hence its price, without tasting a single drop. He used the method of linear regression which is one of the methods for supervised learning in today's world. The data used by Ashenfelter is available online. In this article, I use this data to give a gentle introduction to Linear Regression. We use the R language for this purpose.
Linear Regression is a method which is used to predict the outcome of a variable, the output, or dependent variable, by using a set of independent variables. As the name implies, the method of prediction is linear, with a linear relation of the independent variables being used to predict the value of the outcome variable. This is shown in the following equation for multi-variable linear regression:
where
- xij = the jth independent variable for the ith observation,
- yi = the dependent variable,
- εi = error term for the ith observation, also called a residual; this residual is the actual value of the dependent variable minus the value predicted by the linear regression model,
- β0 = the intercept coefficient,
- βj = the regression coefficient for the the jth independent variable.
There is a set of data called as the training set for which the values of the independent and dependent variables are known - this data is used for training purposes, for learning the parameters of the linear regression, β0 and βj which will be used for purposes of prediction. There is another set of data, called as the test set for which the independent and dependent variables are known, but which will be used to test the accuracy and other parameters of our linear regression model. About two years ago, I attended an online course called the The Analytics Edge offered by MIT, on the edX platform. They had introduced Linear Regression using the R language using a different form of wine-related CSV files containing the pricing data. We use this learning to introduce linear regression.
Before embarking on our journey, I take a look at the contents of wine related CSV files which are available for download at the top of this page. As introduced in a previous article of mine, I use the R commands read.csv()
and str()
to read in the CSV file, and to see its structure, respectively. The six columns of the data file are:
- Year: Year in which the wine was produced. This is a unique identifier for each observation.
- LogPrice: Logarithm of the price of the wine. This is what we try to predict for the test data. In other words, this is the dependent variable.
- WinterRain: Winter rain in the Bordeaux region in ml, rain during the months October to March. Independent variable.
- Temperature: Average temperature in the region, during the months April to September, in degrees Celsius. Independent variable.
- HarvestRain: Harvest rain in the region in ml, rain during the months August and September. Independent variable.
- TimeYears: Time since vintage, in years. Independent variable.
The structure of this data is shown in the following screenshot, as seen in the R console, where the wine training data are read into a data frame named wineTrain
:
We use this training data set to learn the linear regression parameters from this data, and use these to predict the price values (well, logarithm of the price) for the test set. We notice that this data frame has 24 observations with the 6 variables listed above.
2. Linear Regression in R
In this section, we pose a set of questions and get introduced to linear regression in R by addressing these questions.
- What is the summary of the wine data?
This is seen using the summary()
command which gives the following output:
Looking at this output, we find that the data is for the years 1952 to 1977. Since we are working with logarithm of the price, it is possible that this value is negative. The values of logarithm of price are indeed negative, and the maximum value is 0.0. The statistical parameters of the Winter Rain, Temperature, Harvest Rain and Time in years are also available from the summary.
- How does wine price vary with temperature?
This is seen using the plot()
command which gives the following output:
> plot(wineTrain$Temperature, wineTrain$LogPrice)
The result is a plot as follows:
Looking at this plot, we see that there is generally an increase in logarithm of price with increase in temperature, and that a straight line can be fit through the data. We next attempt to create a simple linear regression model using only the temperature in degrees, to predict the wine price.
- How to do a linear regression computation in R, with one independent variable?
For this, we consider one independent variable, Temperature, which is the average temperature of the region during the year. The command to make R do the linear regression using Temperature
as the independent variable, and LogPrice
as the dependent variable is. Following this command, a summary of this model is obtained using the summary
command.
> modelReg1 = lm(LogPrice ~ Temperature, data=wineTrain)
> summary(modelReg1)
Here, the variable name before the ~ symbol is the dependent variable, and the variable name after the ~ symbol is the independent variable. Upon running these commands, the R console output is shown below:
There is a lot of information provided in this summary. Let us get introduced to this summary by posing a further set of questions as below.
- What is the Sum of Square Errors, SSE?
The Sum of Square Errors is the sum of the individual residual values for all the points. If there are N observations, and the individual error terms are ε1, ε2, ..., εN, then the SSE is given by:
SSE = (ε1)2 + (ε2)2 + ... + (εN)2
Given two linear regression models, the one with the smaller value of SSE is a better model. Ideally, we would like to have a linear regression model which has the minimum SSE value.
- What is the Root Mean Square Error, RMSE?
The Sum of Square Errors SSE can be hard to interpret because it depends on N, the number of observations, and that its units are difficult to understand, since the SSE is in square units of the dependent variable. To overcome this, the Root Mean Square Error, RMSE is used. The RMSE is defined as
This RMSE value is normalized by N, and is in the same units as the dependent variable.
- What is the R2 error measure and what is its significance ?
The R2 error measure compares the best model to a baseline model. The baseline model does not use any variables, and predicts the same outcome (dependent variable) irrespective of the values of the independent variable(s). As an example, the baseline model may predict the average value of the dependent variable over all the observations.
The sum of square errors for the baseline model is also referred to as the Total Sum of Squares, written as SST. The formula for R2 is:
This RMSE value is normalized by N, and is in the same units as the dependent variable.
R2 is a nice error measure as it captures the value added by using a linear regression model, as against using an average outcome for each data point. If the value of R2 is zero, then this implies that there is no improvement over the baseline model. On the other hand, if the value of R2 is equal to 1, it means a perfect predictive model, that it is perfectly predicting all the values of the dependent variable for the training set. An R2 close to 1 implies a near perfect predictive model. R2 is unitless, and therefore is a nice error measure.
Both the SSE and SST values have to be greater than or equal to zero, since they are the sum of squares of the error terms. Additionally, the SSE has to be lesser than or equal to SST. In other words, the linear regression model will never be worse than the baseline model.
Armed with this knowledge of SSE, RMSE, and R2, we are now in a position to interpret the Linear Regression summary shown above.
- How do we interpret the Linear Regression summary shown above?
Looking at the output to the summary(modelReg1)
command shown above, we see the following parts in the output:
- Call part: where the command used has been output, for reference. This is a description of the function used to build the model.
- Residuals part: This is a summary of the residuals or error terms. The minimum value, first and third quartiles, the median, and maximum value of residuals are shown.
- Coefficients part: where the first row corresponds to the Intercept term, and the second row corresponds to the independent variable
Temperature
. The Estimates column gives the estimates for the β values for our model. If a coefficient is zero or near to zero in the Estimates column, then its value does not change the value of the dependent variable significantly. The remaining columns help us determine whether a variable should be included in the model or not. The Standard Error column indicates a measure of how much the coefficient is likely to vary from the estimate value. The t-value is just the estimate divided by the standard error. If the absolute value of the t-value is large, that coefficient is more likely to be significant. The last column Pr(>|t|) gives a measure of how plausible it is that a coefficient is actually zero. The smaller this probability number, the less likely that its coefficient estimate is zero. Adjacent to the Pr(>|t|) column are some symbols, stars, dots. Three stars indicates that the coefficient is very significant, when reducing number of stars indicating lesser significance. A period indicates that the coefficient is almost significant. The legend for the star coding scheme is shown at the bottom of the Coefficients table.
- Last part: where we can read off the Multiple R-squared value which is the same as the R2 value discussed above. Adjacent to this is the Adjusted R-squared which adjusts the R2 to the number of independent variables used relative to the number of data points. Multiple R-squared generally increases upon adding more independent variables.
- How to compute the residuals and SSE for our model?
The residuals are stored in the vector modelReg1$residuals
. The SSE can be computed by the R command:
> SSE = sum(modelReg1$residuals^2)
This comes to 5.3942 for our model.
- How can our initial Linear Regression Model be improved?
We can add another independent variable to the model, and let us add the variable HarvestRain
. The corresponding linear regression call is:
> modelReg2 = lm(LogPrice ~ Temperature + HarvestRain, data=wineTrain)
> summary(modelReg2)
The result of the summary
command is as in the following figure:
We see that by adding another independent variable to the linear regression computation, the R2 value increased to 0.7079. So, modelReg2
is a perhaps a better model than modelReg1
. If we compute the SSE for modelReg2
as above, we find that this value is 2.9607, which is better than the SSE value for modelReg1
; so this model is indeed a better model than the previous one. We notice that both Temperature
and HarvestRain
are significant variables, as indicated by the three stars corresponding to those rows.
- Can we further improve the model?
We add all the independent variables to the model. The corresponding linear regression call is:
> modelReg3 = lm(LogPrice ~ Temperature + HarvestRain + WinterRain + TimeYears, data=wineTrain)
> summary(modelReg3)
The result of the summary
command is as in the following figure:
We notice that the Coefficients table has five rows, one for the intercept, and one each for the four independent variables. We also see that by adding all the independent variables to the linear regression computation, the R2 value increases to 0.8383, which is nearly double the R2 value for our first linear regression model. Also, the SSE calculation shows that it is 1.6387, which is lesser than the value for the earlier two linear regression models. We notice that both Temperature
and HarvestRain
are significant variables, as indicated by the three stars corresponding to those rows, while TimeYears
is less significant indicated by two stars. The variable WinterRain
is just significant, indicated by a period at the end of its row.
- Does our model suffer from multicollinearity issues?
Correlation is a measure of the linear relationship between variables. If two independent variables in our model are correlated, then the model likely suffers from multicollinearity issues. To check whether our model has such issues, we compute the correlation between all variables of the wineTrain
data frame by using the cor()
command in R.
> cor(wineTrain)
The result of the cor()
command is as in the following figure:
We notice that the the correlation coefficients output in this table are not close to either +1 or to -1 for any set of two independent variables. So, our model does not suffer from multicollinearity issues.
- How does the model perform on new data?
The data used to build the model is the training data, and any data which lies outside this data set, aka new data is the test data. The question arises as to how this model will perform on test data. Such test data is also available for download above in the file called wine_test.csv
.
> wineTest = read.csv("wine_test.csv")
> str(wineTest)
This shows that there are 3 observations of the same 6 variables present in the file wine_train.csv
. We now examine how the third model modelReg3
performs on the test data.
- How is prediction done on the test data?
The R command for prediction is as follows:
> predictTest = predict(modelReg3, newdata=wineTest)
> predictTest
This shows the prices predicted by the model modelReg3
. If we look at the results of the str(wineTest)
, our model has predicted -1.677749, whereas the actual value is -1.31 for the year 1978. A similar comparison can be done for the other two years 1979 and 1980. These predictions are not too far off, but they are not too close also, and we try to quantify these predictions by taking a look at the R2 value. This is what we do next.
- How to quantify the results of our prediction?
The standard way is to compute the SSE and SST values, and use these to compute R2.
> SSE = sum((wineTest$LogPrice - predictTest)^2)
> SST = sum((wineTest$LogPrice - mean(wineTrain$LogPrice))^2)
> 1 - SSE / SST
This gives a somewhat low R2 value of 0.37. We can attribute this low value to the less number of data points for both the training and test data. If we had more data points to train, and more data points, then it is likely that our R2 value on the test data would be better.
3. Closure
This concludes our short article on wine pricing using linear regression. In this article, we got introduced to linear regression in R by looking at the wine pricing problem. We got introduced to loading the data, and running a linear regression with a single variable. We understood the out that it gives in R. We then expanded the linear regression model to include another variable, and then extended this further to include more variables. We saw the meanings of the SSE and R2 values which the linear regression computation gave as output. We then used the output from linear regression to predict price values in a test set, and thus saw the accuracy of the model.
We urge you to load your own training and test CSV files, try out linear regression using the commands listed above, and let us know your feedback.
History
- Version 1.0: 22 Feb 2017.