Tagged: ggplot2

Dealing with “missing”/out of bounds values in heatmaps

I was tinkering around in R to see if I could plot better looking heatmaps, when I encountered an issue regarding how specific values are represented in plots with user-specified restricted ranges. When I’m plotting heatmaps, I usually want to restrict the range of the data being plotted in a specific way. In this case, I was plotting classifier accuracy across frequencies and time of Fourier-transformed EEG data, and I want to restrict the lowest value of the heatmap to be chance level (in this case, 50%).

I used the following code to generate a simple heatmap:

decoding <- readMat("/Users/Pablo/Desktop/testDecoding.mat")$plotData
plotDecoding <- melt(decoding)

quartz(width=5,height=4)
ggplot(plotDecoding, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value)) +
scale_fill_viridis(name = "",limits = c(0.5,0.75))

And this is what I got:

heatmap1

See the gray areas? What’s happening here is that any data points that fall outside of my specified data range (see the limits argument in scale_fill_viridis) are being converted to NAs. Apparently, NA values are plotted as gray by default. Obviously these values are not missing in the actual data; how can we actually ensure that they're plotted? The solution comes from the scales package that comes with ggplot. Whenever you create a plot with specified limits, include the argument oob = squish (oob = out of bounds) in the same line where you set the limits (make sure that the scales package is loaded). Ex:

scale_fill_viridis(name = "",limits = c(0.5,0.75),oob=squish)

What this does is that values that fall below the specified range are represented as the minimum value (color, in this case), and values that fall above the range are represented as the maximum value (which seems to be what Matlab defaults to). The resulting heatmap looks like this:

heatmap2

A simple solution to an annoying little quirk in R!

Looking at interactions of continuous variables

So we had an unexpected interaction effect between two continuous predictors in our dataset. I thought it was some kind of ceiling effect so I did some plots to see what was causing it. I still think it’s a ceiling effect, but I also found it interesting that the effect of predictor 1 is only really present at average levels of predictor 2. Anyway, I’ve attached my code. If you have an interaction that you can’t quite understand, these plots may help: http://rpubs.com/abcharlton/interaction-plots.

download-1

download

Conditional Density Plots in ggplot2

My distance contribution to ggplot2 day 🙂 http://rpubs.com/rosemm/cond_density_plots

The example is with some real data of mine, which I can’t share with you all just yet, sadly. You can apply it pretty quickly to other data, though, for example:
ggplot(diamonds, aes(carat, ..count.., fill = cut)) +
geom_density(position = "fill")

This post (http://stackoverflow.com/questions/14570293/special-variables-in-ggplot-count-density-etc) has some information about the weird ..count.. thing going on in the aes() mapping, for those who are curious. I also encourage you to play around with the adjust argument in geom_density() to see how that changes your plot. Have fun!

Plotting logistic regressions, Part 3

If you haven’t already, check out plotting logistic regression part 1 (continuous by categorical interactions) and part 2 (continuous by continuous interactions).

All of this code is available on Rose’s github: https://github.com/rosemm/rexamples/blob/master/logistic_regression_plotting_part3.Rmd









Plotting the results of your logistic regression Part 3: 3-way interactions

If you can interpret a 3-way interaction without plotting it, go find a mirror and give yourself a big sexy wink. wink That’s impressive.

For the rest of us, looking at plots will make understanding the model and results so much easier. And even if you are one of those lucky analysts with the working memory capacity of a super computer, you may want this code so you can use plots to help communicate a 3-way interaction to your readers.

Use the model from the Part 1 code.

Here’s that model:

summary(model)
## 
## Call:
## glm(formula = DV ~ (X1 + X2 + group)^2, family = "binomial", 
##     data = data, na.action = "na.exclude")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44094  -0.45991   0.04136   0.52301   2.74705  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5873     0.3438  -1.708 0.087631 .  
## X1            2.6508     0.5592   4.740 2.13e-06 ***
## X2           -2.2599     0.4977  -4.540 5.61e-06 ***
## groupb        2.2111     0.5949   3.717 0.000202 ***
## groupc        0.6650     0.4131   1.610 0.107456    
## X1:X2         0.1201     0.2660   0.452 0.651534    
## X1:groupb     2.7323     1.2977   2.105 0.035253 *  
## X1:groupc    -0.6816     0.7078  -0.963 0.335531    
## X2:groupb     0.8477     0.7320   1.158 0.246882    
## X2:groupc     0.4683     0.6558   0.714 0.475165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 205.46  on 290  degrees of freedom
## AIC: 225.46
## 
## Number of Fisher Scoring iterations: 7

Let’s add a 3-way interaction. Instead of re-running the whole model, we can use the nifty update() function. This will make the change to the model (adding the 3-way interaction), and automatically refit the whole thing. (It is also fine to just re-run the model — you’ll get the exact same results. I just wanted to show off the update() function.)

new.model <- update(model, ~ . + X1:X2:group) # the . stands in for the whole formula we had before

# if you wanted to specify the whole model from scratch instead of using update():
new.model <- glm(DV ~ X1*X2*group, 
             data=data, na.action="na.exclude",  family="binomial")  

summary(new.model)
## 
## Call:
## glm(formula = DV ~ X1 * X2 * group, family = "binomial", data = data, 
##     na.action = "na.exclude")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.36818  -0.39475   0.02394   0.45860   2.82512  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.8959     0.4037  -2.219 0.026491 *  
## X1             2.7511     0.5914   4.652 3.29e-06 ***
## X2            -2.3070     0.5076  -4.545 5.49e-06 ***
## groupb         2.5457     0.6776   3.757 0.000172 ***
## groupc         1.3403     0.5201   2.577 0.009958 ** 
## X1:X2          0.6779     0.4314   1.572 0.116057    
## X1:groupb      3.8321     1.6589   2.310 0.020882 *  
## X1:groupc     -0.6709     0.7412  -0.905 0.365349    
## X2:groupb      1.1732     0.7623   1.539 0.123806    
## X2:groupc      0.1871     0.6975   0.268 0.788483    
## X1:X2:groupb   1.1108     0.8198   1.355 0.175458    
## X1:X2:groupc  -1.4068     0.5899  -2.385 0.017082 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 194.51  on 288  degrees of freedom
## AIC: 218.51
## 
## Number of Fisher Scoring iterations: 7

Calculate probabilities for the plot

Again, we’ll put X1 on the x-axis. That’s the only variable we’ll enter as a whole range.

X1_range <- seq(from=min(data$X1), to=max(data$X1), by=.01)

Next, compute the equations for each line in logit terms.

Pick some representative values for the other continuous variable

Just like last time, we’ll plug in some representative values for X2, so we can have separate lines for each representative level of X2.

X2_l <- mean(data$X2) - sd(data$X2) 
X2_m <- mean(data$X2)
X2_h <- mean(data$X2) + sd(data$X2)
# check that your representative values actually fall within the observed range for that variable
summary(data$X2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.52900 -0.70950 -0.02922 -0.04995  0.67260  2.38000
c(X2_l, X2_m, X2_h)
## [1] -1.0621963 -0.0499475  0.9623013

Now we can go ahead and plug those values into the rest of the equation to get the expected logits across the range of X1 for each of our “groups” (hypothetical low X2 people, hypothetical average X2 people, hypothetical high X2 people). We’ll also plug in the dummy codes for each of the three groups (a, b, and c). And we’ll calculate the predicted probabilities of the DV for each combination of X2 level and group.

But instead of literally writing out all of the equations (9 of them!!), we’ll just use the fun-and-easy predict() function.

If you ran your model in SPSS, so you only have the coefficients and not the whole model as an R object, you can still make the plots — you just need to spend some quality time writing out those equations. For examples of how to do this (for just 3 equations, but you get the idea) see Part 1 and Part 2 in this series.

To use predict(), you make a new data frame with the predictor values you want to use (i.e. the whole range for X1, group a, and the representative values we picked for X2), and then when you run predict() on it, for each row in the data frame it will generate the predicted value for your DV from the model you saved. The expand.grid() function is a quick and easy way to make a data frame out of all possible combinations of the variables provided. Perfect for this situation!

#make a new data frame with the X values you want to predict 
generated_data <- as.data.frame(expand.grid(X1=X1_range, X2=c(X2_l, X2_m, X2_h), group=c("a", "b", "c") ))
head(generated_data)
##          X1        X2 group
## 1 -2.770265 -1.062196     a
## 2 -2.760265 -1.062196     a
## 3 -2.750265 -1.062196     a
## 4 -2.740265 -1.062196     a
## 5 -2.730265 -1.062196     a
## 6 -2.720265 -1.062196     a
#use `predict` to get the probability using type='response' rather than 'link' 
generated_data$prob <- predict(new.model, newdata=generated_data, type = 'response')
head(generated_data) 
##          X1        X2 group       prob
## 1 -2.770265 -1.062196     a 0.01675787
## 2 -2.760265 -1.062196     a 0.01709584
## 3 -2.750265 -1.062196     a 0.01744050
## 4 -2.740265 -1.062196     a 0.01779198
## 5 -2.730265 -1.062196     a 0.01815042
## 6 -2.720265 -1.062196     a 0.01851595
# let's make a factor version of X2, so we can do gorgeous plotting stuff with it later :)
generated_data$X2_level <- factor(generated_data$X2, labels=c("low (-1SD)", "mean", "high (+1SD)"), ordered=T)
summary(generated_data)
##        X1                 X2           group         prob        
##  Min.   :-2.77027   Min.   :-1.06220   a:1683   Min.   :0.00000  
##  1st Qu.:-1.37027   1st Qu.:-1.06220   b:1683   1st Qu.:0.02416  
##  Median : 0.02974   Median :-0.04995   c:1683   Median :0.56304  
##  Mean   : 0.02974   Mean   :-0.04995            Mean   :0.51727  
##  3rd Qu.: 1.42973   3rd Qu.: 0.96230            3rd Qu.:0.99162  
##  Max.   : 2.82973   Max.   : 0.96230            Max.   :1.00000  
##         X2_level   
##  low (-1SD) :1683  
##  mean       :1683  
##  high (+1SD):1683  
##                    
##                    
## 

Plot time!

This kind of situation is exactly when ggplot2 really shines. We want multiple plots, with multiple lines on each plot. Of course, this is totally possible in base R (see Part 1 and Part 2 for examples), but it is so much easier in ggplot2. To do this in base R, you would need to generate a plot with one line (e.g. group a, low X2), then add the additional lines one at a time (group a, mean X2; group a, high X2), then generate a new plot (group b, low X2), then add two more lines, then generate a new plot, then add two more lines. Sigh.

Not to go down too much of a rabbit hole, but this illustrates what is (in my opinion) the main difference between base R graphics and ggplot2: base graphics are built for drawing, whereas ggplot is built for visualizing data. It’s the difference between specifying each line and drawing them on your plot vs. giving a whole data frame to the plotting function and telling it which variables to use and how. Depending on your needs and preferences, base graphics or ggplot may be a better choice for you. For plotting complex model output, like a 3-way interaction, I think you’ll generally find that ggplot2 saves the day.

library(ggplot2)

plot.data <- generated_data

# check out your plotting data
head(plot.data)
##          X1        X2 group       prob   X2_level
## 1 -2.770265 -1.062196     a 0.01675787 low (-1SD)
## 2 -2.760265 -1.062196     a 0.01709584 low (-1SD)
## 3 -2.750265 -1.062196     a 0.01744050 low (-1SD)
## 4 -2.740265 -1.062196     a 0.01779198 low (-1SD)
## 5 -2.730265 -1.062196     a 0.01815042 low (-1SD)
## 6 -2.720265 -1.062196     a 0.01851595 low (-1SD)
ggplot(plot.data, aes(x=X1, y=prob, color=X2_level)) + 
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome") +
  facet_wrap(~group) # i love facet_wrap()! it's so great. you should fall in love, too, and use it all the time.

# let's try flipping it, so the facets are by X2 level and the lines are by group
ggplot(plot.data, aes(x=X1, y=prob, color=group)) + 
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome") +
  facet_wrap(~X2_level) 

# want it all on one plot? you can set the color to the interaction of group and X2_level:
ggplot(plot.data, aes(x=X1, y=prob, color=group:X2_level)) + 
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome")

For something like this, you may want to manually specify the colors, to make the plot easier to read. For more details about manually setting colors in ggplot2, see this R Club post.

library(RColorBrewer)
# pick some nice colors. I want all shades of red for a, green for b, and blue for c.
a_colors <- brewer.pal(9,"Reds")[c(4,6,9)] # I'm getting all 9 shades and just picking the 3 I want to use 
b_colors <- brewer.pal(9,"Greens")[c(4,6,9)]
c_colors <- brewer.pal(9,"Blues")[c(4,6,9)] 
colors <- c(a_colors, b_colors, c_colors)
colors # this is how R saves color values
## [1] "#FC9272" "#EF3B2C" "#67000D" "#A1D99B" "#41AB5D" "#00441B" "#9ECAE1"
## [8] "#4292C6" "#08306B"
ggplot(plot.data, aes(x=X1, y=prob, color=group:X2_level)) + 
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome") +
  scale_color_manual(values=colors) 

# you can also change the line type based on a factor
ggplot(plot.data, aes(x=X1, y=prob, color=group:X2_level)) + 
  geom_line(aes(linetype=X2_level), lwd=2) + # because linetype is instead aes(), it can vary according to the data (i.e. by X2_level)
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome") +
  scale_color_manual(values=colors) 



Plotting logistic regression models, part 2

If you haven’t already, check out plotting logistic regression part 1 (continuous by categorical interactions).

All of this code is available on Rose’s github: https://github.com/rosemm/rexamples/blob/master/logistic_regression_plotting_part2.Rmd









Plotting the results of your logistic regression Part 2: Continuous by continuous interaction

Last time, we ran a nice, complicated logistic regression and made a plot of the a continuous by categorical interaction. This time, we’ll use the same model, but plot the interaction between the two continuous predictors instead, which is a little weirder (hence part 2).

Use the model from the Part 1 code.

Here’s that model:

summary(model)
## 
## Call:
## glm(formula = DV ~ (X1 + X2 + group)^2, family = "binomial", 
##     data = data, na.action = "na.exclude")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44094  -0.45991   0.04136   0.52301   2.74705  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5873     0.3438  -1.708 0.087631 .  
## X1            2.6508     0.5592   4.740 2.13e-06 ***
## X2           -2.2599     0.4977  -4.540 5.61e-06 ***
## groupb        2.2111     0.5949   3.717 0.000202 ***
## groupc        0.6650     0.4131   1.610 0.107456    
## X1:X2         0.1201     0.2660   0.452 0.651534    
## X1:groupb     2.7323     1.2977   2.105 0.035253 *  
## X1:groupc    -0.6816     0.7078  -0.963 0.335531    
## X2:groupb     0.8477     0.7320   1.158 0.246882    
## X2:groupc     0.4683     0.6558   0.714 0.475165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 205.46  on 290  degrees of freedom
## AIC: 225.46
## 
## Number of Fisher Scoring iterations: 7

And we already saved the coefficients individually for use in the equations last time.

Calculate probabilities for the plot

Again, we’ll put X1 on the x-axis. That’s the only variable we’ll enter as a whole range.

X1_range <- seq(from=min(data$X1), to=max(data$X1), by=.01)

Next, compute the equations for each line in logit terms.

Pick some representative values for the other continuous variable

Just like last time, we’ll need to plug in values for all but one variable (X1, which is going on the x-axis of the plot), but this time we’ll pick some representative values for the other continuous predictor, X2, and plug those in to get a separate line for each representative value of X2. Typical choices are high (1SD above the mean), medium (the mean), and low (1SD below the mean) X2. Another great choice is max, median, and min. You can also do 4 or 5 lines instead of just 3, if you want. It’s your party.

Whatever you decide, I recommend checking to make sure the “representative” values you’re plugging in actually make sense given your data. For example, you may not actually have any cases with X2 value 1SD above the mean, in which case maybe you just want to put in max(X2) for the high case instead. It’s kinda weird to plot your models at values that don’t actually exist in your data (cue Twilight Zone music).

summary(data$X2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.52900 -0.70950 -0.02922 -0.04995  0.67260  2.38000
(X2_l <- mean(data$X2) - sd(data$X2) )
## [1] -1.062196
(X2_m <- mean(data$X2) )
## [1] -0.0499475
(X2_h <- mean(data$X2) + sd(data$X2) )
## [1] 0.9623013

Now we can go ahead and plug those values into the rest of the equation to get the expected logits across the range of X1 for each of our “groups” (hypothetical low X2 people, hypothetical average X2 people, hypothetical high X2 people).

If you ran your model in SPSS and you just have the coefficients…

You’ll need to actually calculate the predicted probabilities yourself. Write out the equation for your model and plug in values for everything except the variable that will go on the x-axis.

Remember, these equations need to include every coefficient for the model you ran, whether or not you actually care about plotting them.

In this case, lots of it will just drop out because I’ll be plugging in 0’s for all of the dummy codes (we’re only looking at group a), but I encourage you to keep the terms in your code, so you don’t forget that all of those predictors are still in your model, you’re just holding them constant while you plot. We’re not interested in plotting the categorical predictor right now, but it’s still there in the model, so we need to just pick a group from it and enter the dummy codes for it. The plot will show us the interaction between X1 and X2 for the reference group (for 3-way interactions, you’ll have to wait for part 3!).

X2_l_logits <- b0 + 
  X1*X1_range + 
  X2*X2_l + 
  groupb*0 + 
  groupc*0 + 
  X1.X2*X1_range*X2_l + 
  X1.groupb*X1_range*0 + 
  X1.groupc*X1_range*0 + 
  X2.groupb*X2_l*0 + 
  X2.groupc*X2_l*0 

X2_m_logits <- b0 + 
  X1*X1_range + 
  X2*X2_m + 
  groupb*0 + 
  groupc*0 + 
  X1.X2*X1_range*X2_m + 
  X1.groupb*X1_range*0 + 
  X1.groupc*X1_range*0 + 
  X2.groupb*X2_m*0 + 
  X2.groupc*X2_m*0 

X2_h_logits <- b0 + 
  X1*X1_range + 
  X2*X2_h + 
  groupb*0 + 
  groupc*0 + 
  X1.X2*X1_range*X2_h + 
  X1.groupb*X1_range*0 + 
  X1.groupc*X1_range*0 + 
  X2.groupb*X2_h*0 + 
  X2.groupc*X2_h*0 

# Compute the probibilities (this is what will actually get plotted):
X2_l_probs <- exp(X2_l_logits)/(1 + exp(X2_l_logits))
X2_m_probs <- exp(X2_m_logits)/(1 + exp(X2_m_logits))
X2_h_probs <- exp(X2_h_logits)/(1 + exp(X2_h_logits))

If you ran your model in R, then you can just use predict()

Easy peasy! And, most importantly, less typing — which means fewer errors. Thanks to John for reminding me of this handy function! You make a new data frame with the predictor values you want to use (i.e. the whole range for X1, group a, and the representative values we picked for X2), and then when you run predict() on it, for each row in the data frame it will generate the predicted value for your DV from the model you saved. The expand.grid() function is a quick and easy way to make a data frame out of all possible combinations of the variables provided. Perfect for this situation!

#make a new data frame with the X values you want to predict 
generated_data <- as.data.frame(expand.grid(X1=X1_range, X2=c(X2_l, X2_m, X2_h), group="a") )
head(generated_data)
##          X1        X2 group
## 1 -2.770265 -1.062196     a
## 2 -2.760265 -1.062196     a
## 3 -2.750265 -1.062196     a
## 4 -2.740265 -1.062196     a
## 5 -2.730265 -1.062196     a
## 6 -2.720265 -1.062196     a
summary(generated_data)
##        X1                 X2           group   
##  Min.   :-2.77027   Min.   :-1.06220   a:1683  
##  1st Qu.:-1.37027   1st Qu.:-1.06220           
##  Median : 0.02974   Median :-0.04995           
##  Mean   : 0.02974   Mean   :-0.04995           
##  3rd Qu.: 1.42973   3rd Qu.: 0.96230           
##  Max.   : 2.82973   Max.   : 0.96230
#use `predict` to get the probability using type='response' rather than 'link' 
generated_data$prob <- predict(model, newdata=generated_data, type = 'response')
head(generated_data) 
##          X1        X2 group        prob
## 1 -2.770265 -1.062196     a 0.005614200
## 2 -2.760265 -1.062196     a 0.005756836
## 3 -2.750265 -1.062196     a 0.005903074
## 4 -2.740265 -1.062196     a 0.006053005
## 5 -2.730265 -1.062196     a 0.006206719
## 6 -2.720265 -1.062196     a 0.006364312
# let's make a factor version of X2, so we can do gorgeous plotting stuff with it later :)
generated_data$X2_level <- factor(generated_data$X2, labels=c("low (-1SD)", "mean", "high (+1SD)"), ordered=T)
head(generated_data) 
##          X1        X2 group        prob   X2_level
## 1 -2.770265 -1.062196     a 0.005614200 low (-1SD)
## 2 -2.760265 -1.062196     a 0.005756836 low (-1SD)
## 3 -2.750265 -1.062196     a 0.005903074 low (-1SD)
## 4 -2.740265 -1.062196     a 0.006053005 low (-1SD)
## 5 -2.730265 -1.062196     a 0.006206719 low (-1SD)
## 6 -2.720265 -1.062196     a 0.006364312 low (-1SD)

Plot time!

In base R…

# We'll start by plotting the low X2 group:
plot(X1_range, X2_l_probs, 
     ylim=c(0,1),
     type="l", 
     lwd=3, 
     lty=2, 
     col="red", 
     xlab="X1", ylab="P(outcome)", main="Probability of super important outcome")


# Add the line for mean X2
lines(X1_range, X2_m_probs, 
      type="l", 
      lwd=3, 
      lty=3, 
      col="green")

# Add the line for high X2
lines(X1_range, X2_h_probs, 
      type="l", 
      lwd=3, 
      lty=4, 
      col="blue")

# add a horizontal line at p=.5
abline(h=.5, lty=2)

Or, you can do it in ggplot2!

library(ggplot2); library(tidyr)
# first you have to get the information into a long dataframe, which is what ggplot likes :)

# if you calculated the predicted probabilities by writing out the equations, you can combine that all into a dataframe now, and use gather() to make it long
plot.data <- data.frame(low=X2_l_probs, mean=X2_m_probs, high=X2_h_probs, X1=X1_range)
plot.data <- gather(plot.data, key=X2_level, value=prob, -X1) # this means gather all of the columns except X1

# if you used predict(), then everything is already in a nice dataframe for you
plot.data <- generated_data

# check out your plotting data
head(plot.data)
##          X1        X2 group        prob   X2_level
## 1 -2.770265 -1.062196     a 0.005614200 low (-1SD)
## 2 -2.760265 -1.062196     a 0.005756836 low (-1SD)
## 3 -2.750265 -1.062196     a 0.005903074 low (-1SD)
## 4 -2.740265 -1.062196     a 0.006053005 low (-1SD)
## 5 -2.730265 -1.062196     a 0.006206719 low (-1SD)
## 6 -2.720265 -1.062196     a 0.006364312 low (-1SD)
ggplot(plot.data, aes(x=X1, y=prob, color=X2_level)) + 
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome") 



Plotting your logistic regression models

This code is all available on Rose’s github: https://github.com/rosemm/rexamples/blob/master/logistic_regression_plotting_part1.Rmd






Plotting the results of your logistic regression Part 1: Continuous by categorical interaction

We’ll run a nice, complicated logistic regresison and then make a plot that highlights a continuous by categorical interaction.

set.seed(24601) # setting this so the random results will be repeatable 

library(MASS)
covmat <- matrix(c(1.0,   0.2,   0.6, 
                   0.2,   1.0,  -0.5, 
                   0.6,  -0.5,   1.0), nrow=3) # the true cov matrix for my data
data <- mvrnorm(300, mu=c(0,0,0), Sigma=covmat) # generate random data that match that cov matrix
colnames(data) <- c("X1", "X2", "DV")
data <- as.data.frame(data)
data$group <- gl(n=3, k=ceiling(nrow(data)/3), labels=c("a", "b", "c", "d"))
# add some group differences and interaction stuff...
data$DV <- with(data, ifelse(group=="c" & X1 > 0, DV+rnorm(n=1, mean=1), 
                             ifelse(group=="b" & X1 > 0, DV+rnorm(n=1, mean=2) , DV)))
# make DV binary
data$DV <- ifelse(data$DV > 0, 1, 0)
head(data)
##            X1         X2 DV group
## 1  0.75191377  0.2978933  1     a
## 2 -1.45574399 -0.6868934  0     a
## 3 -0.07297941  0.2450438  0     a
## 4  0.91090328 -1.6934376  1     a
## 5  1.19905258  0.4765307  0     a
## 6  0.42466903 -0.9937526  1     a

Get the coefficients from your logistic regression model

First, whenever you’re using a categorical predictor in a model in R (or anywhere else, for that matter), make sure you know how it’s being coded!! For this example, we want it dummy coded (so we can easily plug in 0’s and 1’s to get equations for the different groups). This is called contr.treatment() in R.

contrasts(data$group)
##   b c d
## a 0 0 0
## b 1 0 0
## c 0 1 0
## d 0 0 1
# Great, that's what we wanted. And we can see that a is the reference group.
# If you want to change what contrasts will run, you can add an argument to the glm() call. For example contrasts="contr.treatment" will make it traditional dummy coding if it isn't already.

Now we can run that model.

# note this use of exponent in a formula will give us all 2-way interactions
model <- glm(DV ~ (X1 + X2 + group)^2, 
             data=data, na.action="na.exclude",  family="binomial") 
             
summary(model)
## 
## Call:
## glm(formula = DV ~ (X1 + X2 + group)^2, family = "binomial", 
##     data = data, na.action = "na.exclude")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44094  -0.45991   0.04136   0.52301   2.74705  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5873     0.3438  -1.708 0.087631 .  
## X1            2.6508     0.5592   4.740 2.13e-06 ***
## X2           -2.2599     0.4977  -4.540 5.61e-06 ***
## groupb        2.2111     0.5949   3.717 0.000202 ***
## groupc        0.6650     0.4131   1.610 0.107456    
## X1:X2         0.1201     0.2660   0.452 0.651534    
## X1:groupb     2.7323     1.2977   2.105 0.035253 *  
## X1:groupc    -0.6816     0.7078  -0.963 0.335531    
## X2:groupb     0.8477     0.7320   1.158 0.246882    
## X2:groupc     0.4683     0.6558   0.714 0.475165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 205.46  on 290  degrees of freedom
## AIC: 225.46
## 
## Number of Fisher Scoring iterations: 7
model$coef
## (Intercept)          X1          X2      groupb      groupc       X1:X2 
##  -0.5872841   2.6508212  -2.2599250   2.2110951   0.6649971   0.1201166 
##   X1:groupb   X1:groupc   X2:groupb   X2:groupc 
##   2.7323113  -0.6816327   0.8476695   0.4682893
# save the coefficient values so we can use them in the equations
b0 <- model$coef[1] # intercept
X1 <- model$coef[2]
X2 <- -model$coef[3]
groupb <- model$coef[4]
groupc <- model$coef[5]
X1.X2 <- model$coef[6]
X1.groupb <- model$coef[7]
X1.groupc <- model$coef[8]
X2.groupb <- model$coef[9]
X2.groupc <- model$coef[10]

Note: If you were working in SPSS (or for some other reason you have run a model but can’t generate a plot for it), you can enter in your coefficients here, like this:

b0 <- -0.5872841 # intercept
X1 <- 2.6508212
X2 <- -2.2599250
groupb <- 2.2110951
groupc <- 0.6649971
X1.X2 <- 0.1201166
X1.groupb <- 2.7323113
X1.groupc <- -0.6816327
X2.groupb <- 0.8476695
X2.groupc <- 0.4682893

Calculate probabilities for the plot

First, decide what variable you want on your x-axis. That’s the only variable we’ll enter as a whole range. (The range we set here will determine the range on the x-axis of the final plot, by the way.)

X1_range <- seq(from=min(data$X1), to=max(data$X1), by=.01)

Next, compute the equations for each group in logit terms. These equations need to include every coefficient for the model you ran. You’ll need to plug in values for all but one variable – whichever variable you decided will be displayed on the x-axis of your plot. You make a separate equation for each group by plugging in different values for the group dummy codes.

X2_val <- mean(data$X2) # by plugging in the mean as the value for X2, I'll be generating plots that show the relationship between X1 and the outcome "for someone with an average X2".

a_logits <- b0 + 
  X1*X1_range + 
  X2*X2_val + 
  groupb*0 + 
  groupc*0 + 
  X1.X2*X1_range*X2_val + 
  X1.groupb*X1_range*0 + 
  X1.groupc*X1_range*0 + 
  X2.groupb*X2_val*0 + 
  X2.groupc*X2_val*0 # the reference group

b_logits <- b0 + 
  X1*X1_range + 
  X2*X2_val + 
  groupb*1 + 
  groupc*0 + 
  X1.X2*X1_range*X2_val + 
  X1.groupb*X1_range*1 + 
  X1.groupc*X1_range*0 + 
  X2.groupb*X2_val*1 + 
  X2.groupc*X2_val*0

c_logits <- b0 + 
  X1*X1_range + 
  X2*X2_val + 
  groupb*0 + 
  groupc*1 + 
  X1.X2*X1_range*X2_val + 
  X1.groupb*X1_range*0 + 
  X1.groupc*X1_range*1 + 
  X2.groupb*X2_val*0 + 
  X2.groupc*X2_val*1

# Compute the probibilities (this is what will actually get plotted):
a_probs <- exp(a_logits)/(1 + exp(a_logits))
b_probs <- exp(b_logits)/(1 + exp(b_logits))
c_probs <- exp(c_logits)/(1 + exp(c_logits))

Plot time!

# We'll start by plotting the ref group:
plot(X1_range, a_probs, 
     ylim=c(0,1),
     type="l", 
     lwd=3, 
     lty=2, 
     col="gold", 
     xlab="X1", ylab="P(outcome)", main="Probability of super important outcome")


# Add the line for people who are in the b group
lines(X1_range, b_probs, 
      type="l", 
      lwd=3, 
      lty=3, 
      col="turquoise2")

# Add the line for people who are in the c group
lines(X1_range, c_probs, 
      type="l", 
      lwd=3, 
      lty=4, 
      col="orangered")

# add a horizontal line at p=.5
abline(h=.5, lty=2)

Or, you can do it in ggplot2!

library(ggplot2); library(tidyr)
# first you have to get the information into a long dataframe, which is what ggplot likes :)
plot.data <- data.frame(a=a_probs, b=b_probs, c=c_probs, X1=X1_range)
plot.data <- gather(plot.data, key=group, value=prob, a:c)
head(plot.data)
##          X1 group         prob
## 1 -2.770265     a 0.0003264137
## 2 -2.760265     a 0.0003351590
## 3 -2.750265     a 0.0003441385
## 4 -2.740265     a 0.0003533586
## 5 -2.730265     a 0.0003628255
## 6 -2.720265     a 0.0003725460
ggplot(plot.data, aes(x=X1, y=prob, color=group)) + # asking it to set the color by the variable "group" is what makes it draw three different lines
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome")