What is Linear Regression? Part 2: Multiple Linear Regression

Welcome back! This is Part 2 of our linear regression series. It is in line with Chapter 3 of the free Introduction to Machine Learning course I am teaching on IQmates. You can find the Part 1: Simple Linear Regression here because we will be continuing from where we left off and not explain some concepts in this post.

 

As the title of this blog post suggests, we are going to discuss Multiple Linear Regression. The name is self-explanatory, right? Okay, to some extent it is. In the previous post (Part 1), we used Simple Linear Regression which was predicting the median price of houses in the Boston area using the lstat variable only. Multiple Linear Regression then takes it a step further and says instead of using only one variable like we did, why can we not add more variables in order to (hopefully) increase our prediction accuracy (something we will discuss in later chapters)? As we saw previously, using only lstat was not sufficient. We are agreed that there are other variables which affect the median price of houses in the Boston area (or generally anywhere else) apart from the lstat such as the crime rate (crim) or the number of bedrooms. With Simple Linear Regression, we cannot see the full effect off all these variables together: we have to do one at a time. In comes Multiple Linear Regression, which will allow us to do just that.

 

Luckily, we will still use the lm() function to build our linear model. The only difference is we will now regress the medv target onto multiple predictors and not just a single one (hence the name). Let’s say we think two factors, age of the house and lstat, affect the median value of the house. We can create a multiple linear regression model using this code:

 

lm.fit2 = lm(medv ~ lstat + age, data = boston_data)

 

This code basically says regress medv onto the lstat and age variables from the boston_data dataset. The “and” part is written as the “+” sign in the code. Previously we had just done medv ~ lstat when we were doing simple linear regression but now we have to do medv ~ lstat + age since we believe median house price is dependent on both lstat and age.

 

Let’s see what the summary is for this model:

 

summary(lm.fit2)

Call:
lm(formula = medv ~ lstat + age, data = boston_data)

Residuals:
Min 1Q Median 3Q Max 
-15.981 -3.978 -1.283 1.968 23.158

Coefficients:
Estimate Std. Error t value Pr(>|t|) 
(Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
lstat -1.03207 0.04819 -21.416 < 2e-16 ***
age 0.03454 0.01223 2.826 0.00491 ** 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495 
F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16

We have modeled using lstat and age but what about crime rate (crim)? In the previous post, we agreed that crime rate affects house prices. Should we only be stuck with predicting using two variables now? Nope! We can easily extent our Multiple Linear Regression model to have many more variables, as many as we want actually. We can keep adding variables using the “+” sign like how we added the “age” variable but, depending on how many variables we have, it might be cumbersome to write out all the variables. Luckily, there is shorthand we can use and let’s see it work here where we predict medv using all 13 variables:

 

lm.fit2 = lm.fit(medv ~ ., data = boston_data)

 

That’s it! The dot (“.”) I put there is telling R that we are regressing medv onto every other variable in the dataset. Let’s look at the summary of the results of this updated model:

 

summary(lm.fit2)

Call:
lm(formula = medv ~ ., data = boston_data)

Residuals:
Min 1Q Median 3Q Max 
-15.595 -2.730 -0.518 1.777 26.199

Coefficients:
Estimate Std. Error t value Pr(>|t|) 
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 ** 
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288 
chas 2.687e+00 8.616e-01 3.118 0.001925 ** 
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229 
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 ** 
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338 
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

 

Just as in the previous post, we can check what names to use for us to drill down on a part of the summary:

 

names(summary(lm.fit2))

[1] "call" "terms" "residuals" "coefficients" 
[5] "aliased" "sigma" "df" "r.squared" 
[9] "adj.r.squared" "fstatistic" "cov.unscaled"

 

So using summary(lm.fit2)$r.squared will give us the R2  and the summary(lm.fit2)$sigma  will give us the RSE.

 

As discussed in the course, when you take a look at the summary, you can see that a variable like age has a high p-value (also notice that it does not have asterisks next to it. These asterisks highlight variables the model thinks are important in doing the prediction, meaning the ones without asterisks are being considered not that important when they are put together with others. In this case, age is being considered useless (no asterisk at all) when it is with the other 13 variables. The number of stars a variable has out of three, the more important the model thinks it is. We will trust it’s judgement!) What this means is we can model this situation without the age variable. Now one might start wondering how we will do that considering the dot (“.”) helped us regress on all variable so, since we are taking out age, does that mean we can’t use it and have to type every other variable? Nope again! We just exclude age by doing this:

 

lm.fit2 = lm(medv ~. -age, data = boston_data)

 

Simple as that! The model says regress medv on all other variables except (“-“) age from the boston_data dataset.

 

That’s the basics of Multiple Linear Regression! We moved from using one variable to predict our target to using all of them. After simple analysis, we found that age was not very useful in the face of 12 other variables so we dropped it and refit our model. If you do not want to write the code the way I did it, you can simply update your model by running lm.fit2 = update(lm.fit2, ~. -age) which will tell R to update the lm.fit2 we already have by changing the regression to be onto all other variables except age. If you want to get rid of multiple variables at once, for example in our lm.fit2, indus also had a high p-value along with age, so to remove both you can run:

 

lm.fit2 = lm(medv ~ . -age -indus, data = boston_data)

 

Now that we are done with the basics of Multiple Linear Regression, the next post will quickly cover interaction terms. The theory has already been discussed on videos in the Introduction to Machine Learning course that is freely available on IQmates. These linear regression tutorials are in line with Chapter 3 of the course.