Category: nuggets

Dealing with “missing”/out of bounds values in heatmaps

I was tinkering around in R to see if I could plot better looking heatmaps, when I encountered an issue regarding how specific values are represented in plots with user-specified restricted ranges. When I’m plotting heatmaps, I usually want to restrict the range of the data being plotted in a specific way. In this case, I was plotting classifier accuracy across frequencies and time of Fourier-transformed EEG data, and I want to restrict the lowest value of the heatmap to be chance level (in this case, 50%).

I used the following code to generate a simple heatmap:

decoding <- readMat("/Users/Pablo/Desktop/testDecoding.mat")$plotData
plotDecoding <- melt(decoding)

ggplot(plotDecoding, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value)) +
scale_fill_viridis(name = "",limits = c(0.5,0.75))

And this is what I got:


See the gray areas? What’s happening here is that any data points that fall outside of my specified data range (see the limits argument in scale_fill_viridis) are being converted to NAs. Apparently, NA values are plotted as gray by default. Obviously these values are not missing in the actual data; how can we actually ensure that they're plotted? The solution comes from the scales package that comes with ggplot. Whenever you create a plot with specified limits, include the argument oob = squish (oob = out of bounds) in the same line where you set the limits (make sure that the scales package is loaded). Ex:

scale_fill_viridis(name = "",limits = c(0.5,0.75),oob=squish)

What this does is that values that fall below the specified range are represented as the minimum value (color, in this case), and values that fall above the range are represented as the maximum value (which seems to be what Matlab defaults to). The resulting heatmap looks like this:


A simple solution to an annoying little quirk in R!

Looking at interactions of continuous variables

So we had an unexpected interaction effect between two continuous predictors in our dataset. I thought it was some kind of ceiling effect so I did some plots to see what was causing it. I still think it’s a ceiling effect, but I also found it interesting that the effect of predictor 1 is only really present at average levels of predictor 2. Anyway, I’ve attached my code. If you have an interaction that you can’t quite understand, these plots may help:



Bringing in Qualtrics (and other data)

While a lot of us have grown comfortable using Excel to clean and manipulate our data, there is a growing trend toward transparency and reproduciblity that make it difficult to keep going down that road. It’s also just too easy to make a mistake with Excel. For example, someone I was helping out with an email campaign recently sent me an Excel file that was sorted only by one column. I just trusted them, rather than properly checking the file, and sent the email out and over 400 people got an email with the wrong name in the greeting. That was awkward and frustrating. For me, I love being able to look at my code, see where I brought in the raw data, and see all of the manipulations I did to clean it up. Excel doesn’t do that for me. Thanks to Hadley Wickham’s “dplyr” package it is surprisingly easy to manipulate data in R. I recommend printing out RStudio’s “Data Wrangling Cheat Sheet” and hanging it up somewhere visible if you do regularly manipulate data in R. Here is an example of some data manipulation that I recently did in R.

Step 1. Setup

I’ve set my working directory so that R knows what folder to retrieve the raw data files from. Alternatively, you can give R the whole file name including folders when you read in the CSV file and not bother setting a working directory.


I’ve loaded the “dplyr” package, which I installed earlier using the command install.packages(“dplyr”). One problem with dplyr is that it uses some function names that mean something different in base R or in other packages. I’ve run into a lot of errors and found that the best workaround is to simply tell R that when I say “select”, what I mean is use select from the dplyr package.

filter <- dplyr::filter
select <- dplyr::select

Step 2. Bring in Qualtrics data

Here are a couple of rules of thumb that I use:

  1. Only create one object per data file. It is really confusing to come back 6 months later and see that you have 15 objects that are different versions of the same dataset. I like to see only one version. Dplyr makes it easy to have only one object.
  2. I almost never refer to rows or columns by number. Column numbers and row numbers change every time you tweak the dataset. For best transparency, use some name or rule-based method to remove data or tweak it.

I’m going to read in the data as a CSV file. I recommend against trying to read it in as an Excel file. There are several packages that supposedly read Excel, but they don’t seem to have consistent performance, and there is no guarantee that your code will work later if you do read it in as Excel.

qualtrics <- read.csv("PPAuthenticity2SPAPR15.csv", stringsAsFactors = FALSE) %>%

Notice that I used the “stringsAsFactors = FALSE” argument. By default, R will try to turn everything into factors, which is generally not what I want at all. I used the pipe operator “%>%” to let R know that I’m not done. Now I’m going to make some changes to this data. The pipe operator comes included with dplyr.

In our lab, we have all of our grad students run through the surveys before the real participants do to test it. I want to get rid of the grad student responses, so I filter out all observations that don’t have a student ID that starts with “95”. The grad students are supposed to put “Test” in this field, though John for some reason puts “666”. Filtering out all student ID’s that don’t start with “95” takes care of all of these test observations. Again, it ends with a pipe so that R knows there is more. Because I’m using pipes, I don’t even have to tell it what data I want to execute this command on. It already knows to look at the previous pipe row.

filter(grepl(“^95”, ID)) %>%

An English translation of this would be “In the row above, filter out all results where ‘ID’ doesn’t start with ’95’.” Grepl matches the standard expression I want, and filter removes those rows. In Qualtrics there is a second header row with really long, unwieldy descriptions. This will remove that row too. If all you wanted to do was remove that row of labels, you could simply remove it by position when you bring it in. Normally I don’t like to refer to rows by number, but I don’t think it does any harm to only remove the first row:

read.csv(“yourfile.csv”)[-1, ] # this is alternative code that I’m not using for my analysis

I try to make good, proper names for my variables in Qualtrics, but they always seem to get messed up. I inevitably end up renaming some of them:

rename(adskep_2 = adskep_10,
adskep_3 = adskep_11,
adskep_4 = adskep_12,
adskep_5 = adskep_13,
adskep_6 = Q28_14,
adskep_7 = Q28_15,
adskep_8 = Q28_16,
adskep_9 = Q28_17,
Out_not_authentic = Symbol_5) %>%

Note that the name to the left of the equal sign is the new name. The name to the right is the messed up name that Qualtrics gave me.

Now, I’m telling R only to keep variables that have the stems I want:

select(matches(“cont|symbol|cred|integ|out|adskep|id|qpq”)) %>%

In plain English, this would say “keep only the columns that have ‘cont’ or ‘symbol’ or ‘cred’ or ‘integ’ or ‘out’ or ‘adskep’ or ‘id’ or ‘qpq’ as part of their name.”

All of my variables were read in as character strings, so I will need to transform relevant columns to numeric format:

mutate_each(funs(as.numeric), -ID) %>%

Using the “mutate_each” command from the dplyr package, I’ve transformed every column except for “ID” to numeric format.

I need a composite variable that is the mean of all variables from the four dimensions of my scale. You can use “mutate” to create a new variable.

mutate(authenticity = rowMeans(select(.,matches(“cont|Symbol|cred|Integ”)))) %>%

In Qualtrics, I ran two conditions. I need a factor variable that tells me which condition the person was in. Right now I have two variables representing the two conditions. Each is a string of 1’s and NA’s. I only need one of these variables to make my new variable since condition “qpq” and “bonus” are mutually exclusive.

mutate(condition = factor(.$qpq, labels=c(“qpq”,”bonus”), exclude=NULL)) %>%

I created a new variable called “condition”, which is a factored version of “qpq”. When you create a factor in R, you can use the “exclude=NULL” argument to tell it that you want “NA” to be a factor level, rather than just representing missing data. Next, I used “select” to drop the “qpq” variable that has now become obsolete. Since I didn’t include a pipe operator at the end of my last command, all the code will now run and return my cleaned up data.

Step 3. Bring in a second dataset and merge it with the first

In our lab, we have students answer all of the demographic questions separately. We end up having to merge the data. This is ridiculously easy to do in R:

demos <- read.csv("Spring 2015 Demos UPPER 3-8.csv", stringsAsFactors = FALSE) %>%
alldata <- left_join(qualtrics,demos, by="ID")

I’ve brought in the second dataset. People often end up filling out the demos multiple times for whatever reason. I don’t want duplicates because I will end up with duplicate data after I do the join. I have used the “distinct” function to get rid of redundant student ID’s. Then I used “left_join” to keep everything in my dataset on the left “qualtrics,” and to tack on data from my other dataset, “demos” wherever there is a match by the “ID” column, which both datasets have in this case. Again, it’s pretty easy to join two datasets.

The output of this process is three objects:

  1. qualtrics
  2. demos
  3. alldata

There is no data1, data2, data3, etc. Very clean, very transparent, and very easy to look back and see exactly what I did.

Here is all of the code in one place:

Setting up a lovely new project

I’ve been doing two things recently that make me want to improve my workflow: starting new projects, and trying to revive old ones.

I remember back in the day when I first started using R instead of point-and-click stuff in SPSS or excel, and I was SO PROUD that my work was all in code, super reproducible. I also remember when I read my first book all by myself, and I was so proud then, too. How standards change, eh? My ability to read Amelia Bedelia cover to cover no longer seems as impressive, and – not unlike the work of that bumbling parlor maid – my old code now looks more like a series of misunderstandings and cryptic jokes than clean, reproducible analyses.

Looking back through my old projects and (gasp!) needing to share old code with new collaborators has brought to light some common problems:

1) Where even is that R script?

2) This code totally worked the last time I used it, and it simply doesn’t now.

3) Is this the code I used for that conference poster, or was it another version? Wasn’t there some code in here for sequence plots? Where did that go?

In an attempt to solve these problems* for Future Rosie, I’m setting up my new projects much better. We’ve talked about tools for workflow and reproducible code on this blog before, and of course there’s lots of tutorials for this online. This is just my version.

* Note that the “where even is that r script” problem is not really solved by nifty reproducible code tools. You just need to pick a naming convention and file organization system and then effing stick to it (I’m looking at you, Past Rosie!).

Setting up a pretty project

I’m using R studio and git. These are the important pieces for my own workflow (solving problems 2 and 3 above). I also use github to host my code, so it’s easy to share, etc. It’s also a nice way to keep track of my to-do list for each project (via “issues”). You don’t need to set up a github account to take advantage of git version control, though, and if you start a git project in Rstudio without associating it with a github account, you can always add it later. So definitely set up git, and connect it to github if you want.

Install stuff

If you haven’t already, you’ll need to install Rstudio ( and git (

Set up your project

Open Rstudio and click File>New Project… Then select “start new directory” and then “empty project.” Type in the name of the project (this will be the folder name). Select “create a git repository” to use version control, and select “use packrat* with this project” to save all of the useful information about which versions of every package you need.

Screenshot 2015-09-21 12.37.16

*”What is packrat,” you ask? Excellent question. It keeps track of which versions of all of your R packages are used for a project, so you can run old code with old package versions if you need to. Read about it and see if you want to use it.

Using git

In your new, empty project, you’ll notice a couple things: There’s a new tab by Environment and History

Screenshot 2015-09-21 12.40.02

and there are some automatically generated files hanging out in your directory.

Screenshot 2015-09-21 12.39.53

When you use git for version control, you save “commits”, which are like snapshots of your work. I recommend you commit every time you’re done adding something or making some change to your code. You’ll save a short message with each commit describing what you’ve done, so I recommend you commit every time you feel like you could describe your work with a pithy little message. For example, “added code for histograms of age and PPVT” is good, as is “fixed typos in intro paragraph”, “deleted sequence plots”, or “switched dataframe transformations to tidyr functions”. I like to wait to commit until I have working code, not before (so something like “trying to get LDA to work”, “still trying to get LDA to work” etc. are probably not good commits). Your commits will be the history of your project, and you’ll be able to click on each commit to see what your code looked like then. Committing is not the same as saving – I recommend you save compulsively, like every second, whether you’ve done anything worthwhile or not. Saving will update the file(s) on your computer (for example, the .r file in your working directory). Committing is a way to preserve snapshots of your project at useful or interesting points, so you (or someone else) can go back and look at it later.

Your first commit is usually just to start the project. I generally just include the message “init” for my first commit. Committing in Rstudio is a breeze. 🙂 Just click the little boxes next to each of the files you want to commit, and then click “commit”

Screenshot 2015-09-21 13.53.47

This opens the git window in Rstudio, where you type your commit message. The bottom shows you any changes, line by line, in your files from the previous commit (green means lines added – it’s all green now since these are new files being added). This is great for helping you remember what you want to write in your message. Neat, huh?

Screenshot 2015-09-21 13.55.07

Then hit “commit”! Bam. 🙂 You can also use this window to browse through your previous commits, which is nice when you’re looking for an older version of something or you want to see how or when your code changed.

Note: Sometimes I get the following error message when I try to commit:

Screenshot 2015-09-21 13.58.28

I’m not sure why. But it’s easy to fix: Close this popup, then hit “refresh” on the git window, then hit “commit” again. That usually works for me.

When you’re done checking out your git stuff, just close the git window and go back to regular Rstudio land.

Which files to commit?

Note that I committed several files above:

  1. the .Rprofile (you can use this to save options and preferences and stuff),
  2. .gitignore (more on this below)
  3. .Rproj file (roughly, this one saves my current state in Rstudio, including history, any variables I’ve made, packages loaded, dataframes open, etc. You may or may not want to version control this guy, depending on what you’re doing)
  4. all the packrat stuff (you can read about packrat here)
  5. If I had text files I was working on for this project (for me, they’re usually .r, .md, or .rmd files) I would have committed those, too.

The .gitignore file is important. It’s a list of stuff you DON’T want version controlled. For example, if you have some other stuff in this directory (slides from a conference presentation of this project, stimuli from the study, data, etc.) you can tell git to not bother version-controlling that stuff. Version control works best on plain text files (.txt, .r, .md, .rmd, .tex, .html, .csv, etc.), so I don’t recommend you try to version control other stuff (any mircosoft office files, images, movies, sound files, etc.).

I have an example pdf in my directory, so I’m going to add that to my .gitignore list. Click on it, then click the settings wheel, and click “ignore”. This will add it to the .gitignore list. It’s also possible to edit the .gitignore list yourself (it’s just a text file, so you can type right in it).

Screenshot 2015-09-21 14.08.02

Screenshot 2015-09-21 14.08.12

You’ll see that there are a couple things automatically added to the .gitignore list for you when you start a project in Rstudio this way, and now my .pdf file is added to that list! Joy.

When you look at your git window, you’ll notice the .gitignore file is there:

Screenshot 2015-09-21 14.10.37

That’s because it’s been changed, and the change has not yet been committed.  You can click the box next to it, then hit “commit” to commit that change now (your message might be something like “added old .pdf from previous workshop to .gitignore”).

Note that if you use a Mac, you might also see a .DS_Store file in your directory. This is just your computer keeping track of folder preferences, etc. You most likely don’t need to version control this, so add it to your .gitignore list as well.

Want to combine this awesomeness with a github repo?

Damn straight you do. You can use SSH to connect your RStudio project git stuff to your remote github repository. For example, here’s a repo of mine that connects to an RStudio project I have stored on my laptop: This means that I have all of the lovely version control stuff going on in RStudio when I work on this code AND I can make all of that publicly available so cool, smart people like John can contribute to my code! It’s also a handy way to share code with others – I can just send them a link to my github repo, and they have everything there at their fingertips (and if I update it, they’ll always have access to the most recent version, as well as all of the old commits in case they want an old version).

If you’re new to github, check out the extremely excellent materials available in Jenny Bryan‘s course at UBC: And here is a much quicker and less comprehensive resource:

Plotting logistic regressions, Part 3

If you haven’t already, check out plotting logistic regression part 1 (continuous by categorical interactions) and part 2 (continuous by continuous interactions).

All of this code is available on Rose’s github:

Plotting the results of your logistic regression Part 3: 3-way interactions

If you can interpret a 3-way interaction without plotting it, go find a mirror and give yourself a big sexy wink. wink That’s impressive.

For the rest of us, looking at plots will make understanding the model and results so much easier. And even if you are one of those lucky analysts with the working memory capacity of a super computer, you may want this code so you can use plots to help communicate a 3-way interaction to your readers.

Use the model from the Part 1 code.

Here’s that model:

## Call:
## glm(formula = DV ~ (X1 + X2 + group)^2, family = "binomial", 
##     data = data, na.action = "na.exclude")
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44094  -0.45991   0.04136   0.52301   2.74705  
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5873     0.3438  -1.708 0.087631 .  
## X1            2.6508     0.5592   4.740 2.13e-06 ***
## X2           -2.2599     0.4977  -4.540 5.61e-06 ***
## groupb        2.2111     0.5949   3.717 0.000202 ***
## groupc        0.6650     0.4131   1.610 0.107456    
## X1:X2         0.1201     0.2660   0.452 0.651534    
## X1:groupb     2.7323     1.2977   2.105 0.035253 *  
## X1:groupc    -0.6816     0.7078  -0.963 0.335531    
## X2:groupb     0.8477     0.7320   1.158 0.246882    
## X2:groupc     0.4683     0.6558   0.714 0.475165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 205.46  on 290  degrees of freedom
## AIC: 225.46
## Number of Fisher Scoring iterations: 7

Let’s add a 3-way interaction. Instead of re-running the whole model, we can use the nifty update() function. This will make the change to the model (adding the 3-way interaction), and automatically refit the whole thing. (It is also fine to just re-run the model — you’ll get the exact same results. I just wanted to show off the update() function.)

new.model <- update(model, ~ . + X1:X2:group) # the . stands in for the whole formula we had before

# if you wanted to specify the whole model from scratch instead of using update():
new.model <- glm(DV ~ X1*X2*group, 
             data=data, na.action="na.exclude",  family="binomial")  

## Call:
## glm(formula = DV ~ X1 * X2 * group, family = "binomial", data = data, 
##     na.action = "na.exclude")
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.36818  -0.39475   0.02394   0.45860   2.82512  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.8959     0.4037  -2.219 0.026491 *  
## X1             2.7511     0.5914   4.652 3.29e-06 ***
## X2            -2.3070     0.5076  -4.545 5.49e-06 ***
## groupb         2.5457     0.6776   3.757 0.000172 ***
## groupc         1.3403     0.5201   2.577 0.009958 ** 
## X1:X2          0.6779     0.4314   1.572 0.116057    
## X1:groupb      3.8321     1.6589   2.310 0.020882 *  
## X1:groupc     -0.6709     0.7412  -0.905 0.365349    
## X2:groupb      1.1732     0.7623   1.539 0.123806    
## X2:groupc      0.1871     0.6975   0.268 0.788483    
## X1:X2:groupb   1.1108     0.8198   1.355 0.175458    
## X1:X2:groupc  -1.4068     0.5899  -2.385 0.017082 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 412.88  on 299  degrees of freedom
## Residual deviance: 194.51  on 288  degrees of freedom
## AIC: 218.51
## Number of Fisher Scoring iterations: 7

Calculate probabilities for the plot

Again, we’ll put X1 on the x-axis. That’s the only variable we’ll enter as a whole range.

X1_range <- seq(from=min(data$X1), to=max(data$X1), by=.01)

Next, compute the equations for each line in logit terms.

Pick some representative values for the other continuous variable

Just like last time, we’ll plug in some representative values for X2, so we can have separate lines for each representative level of X2.

X2_l <- mean(data$X2) - sd(data$X2) 
X2_m <- mean(data$X2)
X2_h <- mean(data$X2) + sd(data$X2)
# check that your representative values actually fall within the observed range for that variable
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.52900 -0.70950 -0.02922 -0.04995  0.67260  2.38000
c(X2_l, X2_m, X2_h)
## [1] -1.0621963 -0.0499475  0.9623013

Now we can go ahead and plug those values into the rest of the equation to get the expected logits across the range of X1 for each of our “groups” (hypothetical low X2 people, hypothetical average X2 people, hypothetical high X2 people). We’ll also plug in the dummy codes for each of the three groups (a, b, and c). And we’ll calculate the predicted probabilities of the DV for each combination of X2 level and group.

But instead of literally writing out all of the equations (9 of them!!), we’ll just use the fun-and-easy predict() function.

If you ran your model in SPSS, so you only have the coefficients and not the whole model as an R object, you can still make the plots — you just need to spend some quality time writing out those equations. For examples of how to do this (for just 3 equations, but you get the idea) see Part 1 and Part 2 in this series.

To use predict(), you make a new data frame with the predictor values you want to use (i.e. the whole range for X1, group a, and the representative values we picked for X2), and then when you run predict() on it, for each row in the data frame it will generate the predicted value for your DV from the model you saved. The expand.grid() function is a quick and easy way to make a data frame out of all possible combinations of the variables provided. Perfect for this situation!

#make a new data frame with the X values you want to predict 
generated_data <-, X2=c(X2_l, X2_m, X2_h), group=c("a", "b", "c") ))
##          X1        X2 group
## 1 -2.770265 -1.062196     a
## 2 -2.760265 -1.062196     a
## 3 -2.750265 -1.062196     a
## 4 -2.740265 -1.062196     a
## 5 -2.730265 -1.062196     a
## 6 -2.720265 -1.062196     a
#use `predict` to get the probability using type='response' rather than 'link' 
generated_data$prob <- predict(new.model, newdata=generated_data, type = 'response')
##          X1        X2 group       prob
## 1 -2.770265 -1.062196     a 0.01675787
## 2 -2.760265 -1.062196     a 0.01709584
## 3 -2.750265 -1.062196     a 0.01744050
## 4 -2.740265 -1.062196     a 0.01779198
## 5 -2.730265 -1.062196     a 0.01815042
## 6 -2.720265 -1.062196     a 0.01851595
# let's make a factor version of X2, so we can do gorgeous plotting stuff with it later :)
generated_data$X2_level <- factor(generated_data$X2, labels=c("low (-1SD)", "mean", "high (+1SD)"), ordered=T)
##        X1                 X2           group         prob        
##  Min.   :-2.77027   Min.   :-1.06220   a:1683   Min.   :0.00000  
##  1st Qu.:-1.37027   1st Qu.:-1.06220   b:1683   1st Qu.:0.02416  
##  Median : 0.02974   Median :-0.04995   c:1683   Median :0.56304  
##  Mean   : 0.02974   Mean   :-0.04995            Mean   :0.51727  
##  3rd Qu.: 1.42973   3rd Qu.: 0.96230            3rd Qu.:0.99162  
##  Max.   : 2.82973   Max.   : 0.96230            Max.   :1.00000  
##         X2_level   
##  low (-1SD) :1683  
##  mean       :1683  
##  high (+1SD):1683  

Plot time!

This kind of situation is exactly when ggplot2 really shines. We want multiple plots, with multiple lines on each plot. Of course, this is totally possible in base R (see Part 1 and Part 2 for examples), but it is so much easier in ggplot2. To do this in base R, you would need to generate a plot with one line (e.g. group a, low X2), then add the additional lines one at a time (group a, mean X2; group a, high X2), then generate a new plot (group b, low X2), then add two more lines, then generate a new plot, then add two more lines. Sigh.

Not to go down too much of a rabbit hole, but this illustrates what is (in my opinion) the main difference between base R graphics and ggplot2: base graphics are built for drawing, whereas ggplot is built for visualizing data. It’s the difference between specifying each line and drawing them on your plot vs. giving a whole data frame to the plotting function and telling it which variables to use and how. Depending on your needs and preferences, base graphics or ggplot may be a better choice for you. For plotting complex model output, like a 3-way interaction, I think you’ll generally find that ggplot2 saves the day.

library(ggplot2) <- generated_data

# check out your plotting data
##          X1        X2 group       prob   X2_level
## 1 -2.770265 -1.062196     a 0.01675787 low (-1SD)
## 2 -2.760265 -1.062196     a 0.01709584 low (-1SD)
## 3 -2.750265 -1.062196     a 0.01744050 low (-1SD)
## 4 -2.740265 -1.062196     a 0.01779198 low (-1SD)
## 5 -2.730265 -1.062196     a 0.01815042 low (-1SD)
## 6 -2.720265 -1.062196     a 0.01851595 low (-1SD)
ggplot(, aes(x=X1, y=prob, color=X2_level)) + 
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome") +
  facet_wrap(~group) # i love facet_wrap()! it's so great. you should fall in love, too, and use it all the time.

# let's try flipping it, so the facets are by X2 level and the lines are by group
ggplot(, aes(x=X1, y=prob, color=group)) + 
  geom_line(lwd=2) + 
  labs(x="X1", y="P(outcome)", title="Probability of super important outcome") +