We have so far been dealing with quantitative variables in this series discussing the basics of Linear Regression. There are many situations you will come across though where you will have to model using qualitative variables, for example when you are predicting when someone will get into debt and you have their level of education (a qualitative variable). For this example, we are going to load the Carseats dataset you will find in the ISLR package:
 "Sales" "CompPrice" "Income" "Advertising" "Population"  "Price" "ShelveLoc" "Age" "Education" "Urban"  "US"
You can run the ?Carseats short-code to get an explanation of these variable names. Of interest, there is the qualitative variable “ShelveLoc” which is an indicator of the quality of the shelving location i.e. space within a store in which the car seat is displayed at each location. It takes on three values: Bad, Medium and Good.
From the videos on the course I am teaching, we saw that qualitative variables need to be coded in order for them to be used in machine learning algorithms. In the video, we mentioned that there are different coding schemes one can use, for example if the predictor is gender (Male and Female), one can say 0 = Male and 1 = Female. Another person might code this variable as -1 = Male and 1 = Female. Fortunately R generates dummy variables for us automatically so we do not have to stress about coding situations where there are multiple levels (or choices) for example major cities in South Africa. Let’s see how we can use this power of R to advantage when dealing with ShelveLoc which has three levels:
lm.fit5 <- lm(Sales ~ . + Income * Advertising + Price * Age, data = Carseats)
Call: lm(formula = Sales ~ . + Income * Advertising + Price * Age, data = Carseats) Residuals: Min 1Q Median 3Q Max -2.9208 -0.7503 0.0177 0.6754 3.3413 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 *** CompPrice 0.0929371 0.0041183 22.567 < 2e-16 *** Income 0.0108940 0.0026044 4.183 3.57e-05 *** Advertising 0.0702462 0.0226091 3.107 0.002030 ** Population 0.0001592 0.0003679 0.433 0.665330 Price -0.1008064 0.0074399 -13.549 < 2e-16 *** ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 *** ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 *** Age -0.0579466 0.0159506 -3.633 0.000318 *** Education -0.0208525 0.0196131 -1.063 0.288361 UrbanYes 0.1401597 0.1124019 1.247 0.213171 USYes -0.1575571 0.1489234 -1.058 0.290729 Income:Advertising 0.0007510 0.0002784 2.698 0.007290 ** Price:Age 0.0001068 0.0001333 0.801 0.423812 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.011 on 386 degrees of freedom Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719 F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
When you look at the summary of the model, you see that R has coded the ShelveLoc variable into two variables: ShelveLocGood and ShelveLocMedium. To see the coding that used, you can run the contrasts() function:
Good Medium Bad 0 0 Good 1 0 Medium 0 1
R created the ShelveLocGood dummy variable that takes on the value of 1 if the shelving location is good and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location in medium and 0 otherwise. A bad shelving location corresponds to a 0 for each of the two dummy variables.
Great! We are done with the basics of Linear Regression. As a recap, we looked at:
- Part 1: Simple Linear Regression
- Part 2: Multiple Linear Regression
- Part 3: Interaction Terms
- Part 4: Non-Linear Transformations of the Predictors
- Part 5: Dealing with Qualitative Predictors
As emphasis, these posts are in line with the Introduction to Machine Learning course on IQmates that you can access for FREE from any where, any time. I record the videos in such a way that I explain the concepts behind what we have just coded here and will definitely help you in understanding the codes.