What is Linear Regression? Part 3: Interaction Terms

In the first part, we talked about the basics of Simple Linear Regression and in the second one, we discussed Multiple Linear Regression. In this post, we are going to quickly talk about Interaction Terms. As emphasised in the previous posts, the codes here are in line with the free Introduction to Machine Learning course I am teaching on IQmates. If the codes do not make sense just by themselves, it is because the explanations are in the Chapter 3 of that course (you might want to check it out!)


So what are interaction terms? This is a situation where one variable is dependent on how two other variables interact with or affect each other. The example in this post will be the interaction between the lstat variable and the age variable in our Boston database.


Including interaction terms is easy. We just add them to the lm() function and use an asterisk to create the interaction term:


lm.fit3 <- lm(medv ~ lstat + age + lstat * age, data = boston_data)

lm(formula = medv ~ lstat + age + lstat * age, data = boston_data)

Min 1Q Median 3Q Max 
-15.806 -4.045 -1.333 2.085 27.552

Estimate Std. Error t value Pr(>|t|) 
(Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
age -0.0007209 0.0198792 -0.036 0.9711 
lstat:age 0.0041560 0.0018518 2.244 0.0252 * 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531 
F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16


This summary shows that the lstat and the interaction term (lstat * age) are both significant. In the presence of these two terms, age is really not that important. Despite this, R will include the age variable because it is included in the interaction term. By this fact, it is possible to write the code above in a more compact way. Because R includes the individual variables that are in the interaction term, we can rewrite the linear model code like this (excluding the individual terms):


lm.fit3 <- lm(medv ~ lstat * age, data = boston_data)


If you run, summary(lm.fit3) again, you will get the same result as above.


That’s how you include interaction terms in R. What about when you want to include powers of the variables, for example (age)2 ? Let’s see that in the next post.