For a recent assignment in Sanjay’s SEM class, we had to plot interactions between two continuous variables – the model was predicting students’ grades (GRADE) from how often they attend class (ATTEND) and how many of the assigned books they read (BOOKS), and their interaction. I did all the plotting in ggplot2. It was my first time trying to add lines for different categories to the same plot, and I really wanted labels for each line to show up in the plot legend, which was trickier than I would have thought. I got it to work, though!
Here’s a link to an HTML version of my homework document, which includes Sanjay’s instructions, etc.
For this assignment, you should download the dataset grade_read_attend.sav (linked above). The dataset is in SPSS format. (Credit for the dataset goes to Jeremy Miles.)
This (made-up) dataset contains data from 40 students in a literature class. The dataset contains 3 variables:
-
GRADE is the student's final grade (out of 100)
-
BOOKS is the number of assigned books that the student actually read (out of 4)
-
ATTEND is the number of class meetings the student attended (out of 20)
In your favorite statistical package (SPSS, SAS, R, whatever) do the following:
Run a simple regression in which you regress GRADE on ATTEND. (In regressionspeak, you say “regress Y on X,” where Y is the dependent/response variable and X is the independent/input variable. Thus, I am telling you to treat GRADE as the dependent variable and ATTEND as the independent variable.) Using the output of your analysis, do the following:
-
(1a) Write out the algebraic equation representing this analysis, using the unstandardized coefficient estimates from your output (that is, write out the best-fitting linear model predicting GRADE from ATTEND).
-
(1b) Create a graph in which the Y-axis is GRADE and the X-axis is ATTEND. Draw a line representing the slope of ATTEND for a realistic range of values. (You can do this by hand, or using whatever software you'd like.)
library(foreign)
data <- read.spss("/Users/TARDIS/Downloads/grade_read_attend.sav", to.data.frame = TRUE)
# Run a simple regression in which you regress GRADE on ATTEND
model1 <- lm(GRADE ~ ATTEND, data = data)
summary(model1)
##
## Call:
## lm(formula = GRADE ~ ATTEND, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.78 -10.90 2.02 12.43 31.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.998 8.169 4.53 5.7e-05 ***
## ATTEND 1.883 0.555 3.39 0.0016 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.8 on 38 degrees of freedom
## Multiple R-squared: 0.233, Adjusted R-squared: 0.212
## F-statistic: 11.5 on 1 and 38 DF, p-value: 0.00163
# Write out the algebraic equation representing this analysis
print(paste("GRADE = ", round(model1$coeff[1], digits = 2), " + ", round(model1$coeff[2],
digits = 2), "*ATTEND + e", sep = ""))
## [1] "GRADE = 37 + 1.88*ATTEND + e"
# Create a graph in which the Y-axis is GRADE and the X-axis is ATTEND. Draw
# a line representing the slope of ATTEND for a realistic range of values.
library(ggplot2)
ggplot(data, aes(x = ATTEND, y = GRADE)) + geom_point(shape = 1) + geom_smooth(method = lm,
se = FALSE)
Run a multiple regression in which you regress GRADE on BOOKS and ATTEND. Using the output of your analysis:
-
(2a) Write the algebraic equation for this analysis as above.
-
(2b) Create a graph with the same axes as 1b above. Graph the regression lines of ATTEND for students who have read 0, 2, and 4 books (so you should draw 3 lines).
# Run a multiple regression in which you regress GRADE on BOOKS and ATTEND
model2 <- lm(GRADE ~ ATTEND + BOOKS, data = data)
summary(model2)
##
## Call:
## lm(formula = GRADE ~ ATTEND + BOOKS, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.80 -13.37 0.06 9.17 32.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.379 7.745 4.83 2.4e-05 ***
## ATTEND 1.283 0.587 2.19 0.035 *
## BOOKS 4.037 1.753 2.30 0.027 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.1 on 37 degrees of freedom
## Multiple R-squared: 0.329, Adjusted R-squared: 0.292
## F-statistic: 9.06 on 2 and 37 DF, p-value: 0.000628
# Write out the algebraic equation representing this analysis
print(paste("GRADE = ", round(model2$coeff[1], digits = 2), " + ", round(model2$coeff[2],
digits = 2), "*ATTEND", " + ", round(model2$coeff[3], digits = 2), "*BOOKS + e",
sep = ""))
## [1] "GRADE = 37.38 + 1.28*ATTEND + 4.04*BOOKS + e"
# Create a graph with the same axes as 1b above. Graph the regression lines
# of ATTEND for students who have read 0, 2, and 4 books (so you should draw
# 3 lines).
ggplot(data, aes(x = ATTEND, y = GRADE)) + geom_point(shape = 1) + geom_abline(aes(color = "0 books"),
intercept = model2$coeff[1], slope = model2$coeff[2], show_guide = TRUE) +
geom_abline(aes(color = "2 books"), intercept = model2$coeff[1] + 2 * model2$coeff[3],
slope = model2$coeff[2], show_guide = TRUE) + geom_abline(aes(color = "4 books"),
intercept = model2$coeff[1] + 4 * model2$coeff[3], slope = model2$coeff[2],
show_guide = TRUE) + guides(color = guide_legend(title = "Number of books read"))
Run a multiple regression in which you regress GRADE on BOOKS, ATTEND, and the interaction of BOOKS with ATTEND. Do not center any of the variables. Using the output of your analysis:
-
(3a) Write the algebraic equation representing the results of this analysis three times. The first time, write it in standard form. The second time, rearrange it so that you can easily see the conditional slope of BOOKS. The third time, rearrange again so you can easily see the conditional slope of ATTEND.
-
(3b) Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3 lines depicting the regressions of ATTEND for students who have read 0, 2, and 4 books.
-
(3c) Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3 lines depicting the regressions of BOOKS for students who have attended 10, 15, and 20 times.
# Run a multiple regression in which you regress GRADE on BOOKS, ATTEND, and
# the interaction of BOOKS with ATTEND
model3 <- lm(GRADE ~ ATTEND * BOOKS, data = data)
summary(model3)
##
## Call:
## lm(formula = GRADE ~ ATTEND * BOOKS, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.84 -11.85 1.47 8.16 34.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.221 11.262 4.90 2e-05 ***
## ATTEND -0.137 0.878 -0.16 0.877
## BOOKS -6.208 5.151 -1.21 0.236
## ATTEND:BOOKS 0.735 0.349 2.10 0.042 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 36 degrees of freedom
## Multiple R-squared: 0.402, Adjusted R-squared: 0.352
## F-statistic: 8.07 on 3 and 36 DF, p-value: 0.000306
b0 <- round(model3$coeff[1], digits = 2)
attend <- round(model3$coeff[2], digits = 2)
books <- round(model3$coeff[3], digits = 2)
interaction <- round(model3$coeff[4], digits = 2)
# Write the algebraic equation representing the results of this analysis
# three times. The first time, write it in standard form.
print(paste("GRADE = ", b0, " + ", attend, "*ATTEND", " + ", books, "*BOOKS + ",
interaction, "*ATTEND*BOOKS + e", sep = ""))
## [1] "GRADE = 55.22 + -0.14*ATTEND + -6.21*BOOKS + 0.73*ATTEND*BOOKS + e"
# The second time, rearrange it so that you can easily see the conditional
# slope of BOOKS.
print(paste("GRADE = ", b0, " + ", attend, "*ATTEND", " + (", books, " + ",
interaction, "*ATTEND)*BOOKS + e", sep = ""))
## [1] "GRADE = 55.22 + -0.14*ATTEND + (-6.21 + 0.73*ATTEND)*BOOKS + e"
# The third time, rearrange again so you can easily see the conditional
# slope of ATTEND.
print(paste("GRADE = ", b0, " + (", attend, " + ", interaction, "*BOOKS)*ATTEND + ",
books, "*BOOKS + e", sep = ""))
## [1] "GRADE = 55.22 + (-0.14 + 0.73*BOOKS)*ATTEND + -6.21*BOOKS + e"
# Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3
# lines depicting the regressions of ATTEND for students who have read 0, 2,
# and 4 books.
ggplot(data, aes(x = ATTEND, y = GRADE)) + geom_point(shape = 1) + geom_abline(aes(color = "0 books"),
intercept = b0, slope = attend, show_guide = TRUE) + geom_abline(aes(color = "2 books"),
intercept = b0 + 2 * books, slope = attend + 2 * interaction, show_guide = TRUE) +
geom_abline(aes(color = "4 books"), intercept = b0 + 4 * books, slope = attend +
4 * interaction, show_guide = TRUE) + guides(color = guide_legend(title = "Number of books read"))
# Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3
# lines depicting the regressions of BOOKS for students who have attended
# 10, 15, and 20 times.
ggplot(data, aes(x = BOOKS, y = GRADE)) + geom_point(shape = 1) + geom_abline(aes(color = "Attended 10 times"),
intercept = b0 + 10 * attend, slope = books + 10 * interaction, show_guide = TRUE) +
geom_abline(aes(color = "Attended 15 times"), intercept = b0 + 15 * attend,
slope = books + 15 * interaction, show_guide = TRUE) + geom_abline(aes(color = "Attended 20 times"),
intercept = b0 + 20 * attend, slope = books + 20 * interaction, show_guide = TRUE) +
guides(color = guide_legend(title = "Number of classes attended"))
Repeat the analysis you ran for #3, only this time you should first center BOOKS and ATTEND around their means, and then regress GRADE on BOOKS(centered), ATTEND(centered), and their interaction.
-
(4a) Write the algebraic equation representing the results of this analysis three times. The first time, write it in standard form. The second time, rearrange it so that you can easily see the conditional slope of BOOKS. The third time, rearrange again so you can easily see the conditional slope of ATTEND.
-
(4b) Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3 lines depicting the regressions of ATTEND for students who have read 0, 2, and 4 books.
-
(4c) Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3 lines depicting the regressions of BOOKS for students who have attended 10, 15, and 20 times.
data$ATTENDc <- data$ATTEND - mean(data$ATTEND)
data$BOOKSc <- data$BOOKS - mean(data$BOOKS)
# Run a multiple regression in which you regress GRADE on BOOKS, ATTEND, and
# the interaction of BOOKS with ATTEND
model4 <- lm(GRADE ~ ATTENDc * BOOKSc, data = data)
summary(model4)
##
## Call:
## lm(formula = GRADE ~ ATTENDc * BOOKSc, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.84 -11.85 1.47 8.16 34.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.602 2.319 26.57 <2e-16 ***
## ATTENDc 1.333 0.562 2.37 0.023 *
## BOOKSc 4.155 1.678 2.48 0.018 *
## ATTENDc:BOOKSc 0.735 0.349 2.10 0.042 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 36 degrees of freedom
## Multiple R-squared: 0.402, Adjusted R-squared: 0.352
## F-statistic: 8.07 on 3 and 36 DF, p-value: 0.000306
b0 <- round(model4$coeff[1], digits = 2)
attend <- round(model4$coeff[2], digits = 2)
books <- round(model4$coeff[3], digits = 2)
interaction <- round(model4$coeff[4], digits = 2)
# Write the algebraic equation representing the results of this analysis
# three times. The first time, write it in standard form.
print(paste("GRADE = ", b0, " + ", attend, "*ATTENDc", " + ", books, "*BOOKSc + ",
interaction, "*ATTENDc*BOOKSc + e", sep = ""))
## [1] "GRADE = 61.6 + 1.33*ATTENDc + 4.15*BOOKSc + 0.73*ATTENDc*BOOKSc + e"
# The second time, rearrange it so that you can easily see the conditional
# slope of BOOKSc.
print(paste("GRADE = ", b0, " + ", attend, "*ATTENDc", " + (", books, " + ",
interaction, "*ATTENDc)*BOOKSc + e", sep = ""))
## [1] "GRADE = 61.6 + 1.33*ATTENDc + (4.15 + 0.73*ATTENDc)*BOOKSc + e"
# The third time, rearrange again so you can easily see the conditional
# slope of ATTENDc.
print(paste("GRADE = ", b0, " + (", attend, " + ", interaction, "*BOOKSc)*ATTENDc + ",
books, "*BOOKSc + e", sep = ""))
## [1] "GRADE = 61.6 + (1.33 + 0.73*BOOKSc)*ATTENDc + 4.15*BOOKSc + e"
# Draw a graph with GRADE on the Y-axis and ATTENDc on the X-axis. Draw 3
# lines depicting the regressions of ATTENDc for students who have read 0,
# 2, and 4 booksc.
books0c <- 0 - mean(data$BOOKS)
books2c <- 2 - mean(data$BOOKS)
books4c <- 4 - mean(data$BOOKS)
ggplot(data, aes(x = ATTENDc, y = GRADE)) + geom_point(shape = 1) + geom_abline(aes(color = "0 books"),
intercept = (b0 + books0c * books), slope = (attend + books0c * interaction),
show_guide = TRUE) + geom_abline(aes(color = "2 books"), intercept = (b0 +
books2c * books), slope = (attend + books2c * interaction), show_guide = TRUE) +
geom_abline(aes(color = "4 books"), intercept = (b0 + books4c * books),
slope = (attend + books4c * interaction), show_guide = TRUE) + guides(color = guide_legend(title = "Number of books read"))
# Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3
# lines depicting the regressions of BOOKS for students who have attended
# 10, 15, and 20 times.
attend10c <- 10 - mean(data$ATTEND)
attend15c <- 15 - mean(data$ATTEND)
attend20c <- 20 - mean(data$ATTEND)
ggplot(data, aes(x = BOOKSc, y = GRADE)) + geom_point(shape = 1) + geom_abline(aes(color = "Attended 10 times"),
intercept = (b0 + attend10c * attend), slope = (books + attend10c * interaction),
show_guide = TRUE) + geom_abline(aes(color = "Attended 15 times"), intercept = (b0 +
attend15c * attend), slope = (books + attend15c * interaction), show_guide = TRUE) +
geom_abline(aes(color = "Attended 20 times"), intercept = (b0 + attend20c *
attend), slope = (books + attend20c * interaction), show_guide = TRUE) +
guides(color = guide_legend(title = "Number of classes attended"))
Repeat #3, only this time you should first z-score all three variables (GRADE, BOOKS, and ATTEND), and then run the regression on the z-scored variables, including the product of z-BOOKS times z-ATTEND.
-
(5a) Write the algebraic equation representing the results of this analysis three times. The first time, write it in standard form. The second time, rearrange it so that you can easily see the conditional slope of BOOKS. The third time, rearrange again so you can easily see the conditional slope of ATTEND.
-
(5b) Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3 lines depicting the regressions of ATTEND: one line for students who have read an average number of books, one line for students whose value on BOOKS is 1 standard deviation below the mean, and one line for students whose value on BOOKS is 1 standard deviation above the mean.
-
(5c) Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3 lines depicting the regressions of BOOKS: one line for students who have attended an average number of times, one line for students whose attendance is 1 standard deviation below the mean, and one line for students whose attendance is 1 standard deviation above the mean.
data$ATTENDz <- (data$ATTEND - mean(data$ATTEND))/sd(data$ATTEND)
data$BOOKSz <- (data$BOOKS - mean(data$BOOKS))/sd(data$BOOKS)
# Run a multiple regression in which you regress GRADE on BOOKSz, ATTENDz,
# and the interaction of BOOKSz with ATTENDz
model5 <- lm(GRADE ~ ATTENDz * BOOKSz, data = data)
summary(model5)
##
## Call:
## lm(formula = GRADE ~ ATTENDz * BOOKSz, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.84 -11.85 1.47 8.16 34.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.60 2.32 26.57 <2e-16 ***
## ATTENDz 5.70 2.40 2.37 0.023 *
## BOOKSz 5.95 2.40 2.48 0.018 *
## ATTENDz:BOOKSz 4.50 2.14 2.10 0.042 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 36 degrees of freedom
## Multiple R-squared: 0.402, Adjusted R-squared: 0.352
## F-statistic: 8.07 on 3 and 36 DF, p-value: 0.000306
b0 <- round(model5$coeff[1], digits = 2)
attend <- round(model5$coeff[2], digits = 2)
books <- round(model5$coeff[3], digits = 2)
interaction <- round(model5$coeff[4], digits = 2)
# Write the algebraic equation representing the results of this analysis
# three times. The first time, write it in standard form.
print(paste("GRADE = ", b0, " + ", attend, "*ATTENDz", " + ", books, "*BOOKSz + ",
interaction, "*ATTENDz*BOOKSz + e", sep = ""))
## [1] "GRADE = 61.6 + 5.7*ATTENDz + 5.95*BOOKSz + 4.5*ATTENDz*BOOKSz + e"
# The second time, rearrange it so that you can easily see the conditional
# slope of BOOKSz.
print(paste("GRADE = ", b0, " + ", attend, "*ATTENDz", " + (", books, " + ",
interaction, "*ATTENDz)*BOOKSz + e", sep = ""))
## [1] "GRADE = 61.6 + 5.7*ATTENDz + (5.95 + 4.5*ATTENDz)*BOOKSz + e"
# The third time, rearrange again so you can easily see the conditional
# slope of ATTENDz.
print(paste("GRADE = ", b0, " + (", attend, " + ", interaction, "*BOOKSz)*ATTENDz + ",
books, "*BOOKSz + e", sep = ""))
## [1] "GRADE = 61.6 + (5.7 + 4.5*BOOKSz)*ATTENDz + 5.95*BOOKSz + e"
# Draw a graph with GRADE on the Y-axis and ATTEND on the X-axis. Draw 3
# lines depicting the regressions of ATTEND: one line for students who have
# read an average number of books, one line for students whose value on
# BOOKS is 1 standard deviation below the mean, and one line for students
# whose value on BOOKS is 1 standard deviation above the mean.
ggplot(data, aes(x = ATTENDz, y = GRADE)) + geom_point(shape = 1) + geom_abline(aes(color = "low books (-1SD)"),
intercept = b0 + -1 * books, slope = attend + -1 * interaction, show_guide = TRUE) +
geom_abline(aes(color = "ave. books (mean)"), intercept = b0, slope = attend,
show_guide = TRUE) + geom_abline(aes(color = "high books (+1SD)"), intercept = b0 +
1 * books, slope = attend + 1 * interaction, show_guide = TRUE) + guides(color = guide_legend(title = "Number of books read"))
# Draw a graph with GRADE on the Y-axis and BOOKS on the X-axis. Draw 3
# lines depicting the regressions of BOOKS: one line for students who have
# attended an average number of times, one line for students whose
# attendance is 1 standard deviation below the mean, and one line for
# students whose attendance is 1 standard deviation above the mean.
ggplot(data, aes(x = BOOKSz, y = GRADE)) + geom_point(shape = 1) + geom_abline(aes(colour = "low attendance (-1SD)"),
intercept = b0 + -1 * attend, slope = books + -1 * interaction, size = 1,
fullrange = T, show_guide = TRUE) + geom_abline(aes(colour = "ave. attendance (mean)"),
intercept = b0, slope = books, size = 1, fullrange = T, show_guide = TRUE) +
geom_abline(aes(colour = "high attendance (+1SD)"), intercept = b0 + 1 *
attend, slope = books + 1 * interaction, size = 1, fullrange = T, show_guide = TRUE) +
guides(colour = guide_legend(title = "Attendance"))
Dataset source: http://www.jeremymiles.co.uk/regressionbook/data/
And John has another great way to do simple slopes in ggplot2!
I wanted to share this way of doing the simple slopes using the 'predict' function. This also demonstrates how to produce data on the fly -- good for reproducible examples!
#Replace this with your data.
# For now, making up new stuff.
summary(books<-round(runif(100,from=0,to=4),0)) #Get number of books from a uniform distribution from 0-4
summary(attend<-rnorm(100,mean=14,sd=4.3)) #Get the number of days attended from a normal distribution
summary(grade<-55-.137*attend-6.2*books+.74*attend*books+ #The grade is related to books and attend...
rnorm(100,0,20)) #...plus unmeasured things!
head(theData<-data.frame(books=books,
attend=attend,
grade=grade))
#Update below is cool, but not necessary. It's just
# an easy way to make nested models. Maybe you've
# come across something like this or better?
summary(mod1<-lm(grade~attend,data=theData))
summary(mod3<-update(mod1,.~.+books+attend:books))
#Uses a model to get predicted values for each row –
# you can use the original data or new data
(theData$Predicted_Grade<-predict(mod3,theData,type='response'))
require(plyr) #for the '.' function
require(ggplot2)
(plot3a<-ggplot(theData, aes(y=Predicted_Grade,x=attend)) +
#I subset the data for geom_line or else we get a line for every value of books
geom_line(subset= .(books %in% c(0,2,4)), aes(colour = as.factor(books)), size = 1)+
geom_point(aes(y=grade,x=attend))+scale_color_discrete(name='Books'))
#One nice thing about this method is that the lines don't extend past the
# data. So honest!