Conditional Density Plots in ggplot2

My distance contribution to ggplot2 day 🙂 http://rpubs.com/rosemm/cond_density_plots

The example is with some real data of mine, which I can’t share with you all just yet, sadly. You can apply it pretty quickly to other data, though, for example:
ggplot(diamonds, aes(carat, ..count.., fill = cut)) +
geom_density(position = "fill")

This post (http://stackoverflow.com/questions/14570293/special-variables-in-ggplot-count-density-etc) has some information about the weird ..count.. thing going on in the aes() mapping, for those who are curious. I also encourage you to play around with the adjust argument in geom_density() to see how that changes your plot. Have fun!

Fall 2016 Tentative Schedule

R Club meets every Thursday from 2-3:30 PM in 006 Straub.

Topic R Packages
Week 2 Data visualization ggplot2
Week 3 Data wrangling dplyr, tidyr
Week 4 Reproducible, dynamic reports rmarkdown
Week 5 Programming, simulation and elegant coding
Week 6 Simulation
Week 7 Regression, principal components analysis (PCA)
Week 8 Structural equation modeling (SEM) lavaan
Week 9 Thanksgiving
Week 10 Neuroimaging in R or meta-analysis metafor?

#SlackR

I just set up a slack team for R Club. The focus right now is on doing a little planning for the coming quarter, but it will also serve as a place for our UO R community to ask questions and maybe get help from fellow students on the fly. All you need to join is a UO email address.

uorclub.slack.com

Consultation: Translating SAS Multi-Level Models into R

Today at R club, we’ll help Jessica with a fun problem: she has a bunch of multi-level models written in SAS that she want’s to be able to play with in R. The motivation is that she can find R on pretty much any computer she sits down at these days, but only has access to one computer that runs SAS. For convenience, and because she is learning R, she wants to be able to work with these models using lme4 or nlme.

The data sounds really cool:

This project is looking at diurnal cortisol (4 times measurements per day per participant over the course of 2 days) in a cross sectional sample, ages 4-16 years old, specifically examining how two groups differ (previously institutionalized versus biologically reared). The second component of the project is examining how a 3rd variable (sleep) may moderate these effects.

Her data is already in long format, so we can get right into the modeling.

Code review meetings?

Anyone interested in setting up meetings specifically for code review (rather than working through new problems or presenting techniques, which is more typical R Club)?

I’m thinking about this because I’ve been writing A LOT of code for my dissertation, and it’s all publicly available on my github repo — this is not the same as it actually being useful/interpretable to other researchers, though, which is my hope. I think I do a pretty good job of being consistent within my own code, but I’m thinking that’s really not enough to actually make sure other people will be able to understand it. Code reviews might help me, and others in the same boat, end up with code that is clearer and cleaner, and has gotten over the initial test of another person reading it and being like, “wait, what?”

So what I’m thinking about is a group of people who are all actively writing code for real projects (or have existing code from a completed project that they would like to clean up), and we meet to go through each others code and provide feedback about what’s clear and what’s not, and where to improve the code if it can be done better. We might want to have an initial meeting where we agree upon some style guides (a couple ideas of places to start: variable names in general, r code in particular, and as always, there’s an r package for that).

Interested? Leave a comment, or email me directly: rosem@uoregon.edu In a week or so, I’ll send around an email to interested folks so we can coordinate.

Bringing in Qualtrics (and other data)

While a lot of us have grown comfortable using Excel to clean and manipulate our data, there is a growing trend toward transparency and reproduciblity that make it difficult to keep going down that road. It’s also just too easy to make a mistake with Excel. For example, someone I was helping out with an email campaign recently sent me an Excel file that was sorted only by one column. I just trusted them, rather than properly checking the file, and sent the email out and over 400 people got an email with the wrong name in the greeting. That was awkward and frustrating. For me, I love being able to look at my code, see where I brought in the raw data, and see all of the manipulations I did to clean it up. Excel doesn’t do that for me. Thanks to Hadley Wickham’s “dplyr” package it is surprisingly easy to manipulate data in R. I recommend printing out RStudio’s “Data Wrangling Cheat Sheet” and hanging it up somewhere visible if you do regularly manipulate data in R. Here is an example of some data manipulation that I recently did in R.

Step 1. Setup

I’ve set my working directory so that R knows what folder to retrieve the raw data files from. Alternatively, you can give R the whole file name including folders when you read in the CSV file and not bother setting a working directory.

setwd(“~/Research/PPAuthenticity/studies/Study1”)

I’ve loaded the “dplyr” package, which I installed earlier using the command install.packages(“dplyr”). One problem with dplyr is that it uses some function names that mean something different in base R or in other packages. I’ve run into a lot of errors and found that the best workaround is to simply tell R that when I say “select”, what I mean is use select from the dplyr package.

library(dplyr)
filter <- dplyr::filter
select <- dplyr::select

Step 2. Bring in Qualtrics data

Here are a couple of rules of thumb that I use:

  1. Only create one object per data file. It is really confusing to come back 6 months later and see that you have 15 objects that are different versions of the same dataset. I like to see only one version. Dplyr makes it easy to have only one object.
  2. I almost never refer to rows or columns by number. Column numbers and row numbers change every time you tweak the dataset. For best transparency, use some name or rule-based method to remove data or tweak it.

I’m going to read in the data as a CSV file. I recommend against trying to read it in as an Excel file. There are several packages that supposedly read Excel, but they don’t seem to have consistent performance, and there is no guarantee that your code will work later if you do read it in as Excel.

qualtrics <- read.csv("PPAuthenticity2SPAPR15.csv", stringsAsFactors = FALSE) %>%

Notice that I used the “stringsAsFactors = FALSE” argument. By default, R will try to turn everything into factors, which is generally not what I want at all. I used the pipe operator “%>%” to let R know that I’m not done. Now I’m going to make some changes to this data. The pipe operator comes included with dplyr.

In our lab, we have all of our grad students run through the surveys before the real participants do to test it. I want to get rid of the grad student responses, so I filter out all observations that don’t have a student ID that starts with “95”. The grad students are supposed to put “Test” in this field, though John for some reason puts “666”. Filtering out all student ID’s that don’t start with “95” takes care of all of these test observations. Again, it ends with a pipe so that R knows there is more. Because I’m using pipes, I don’t even have to tell it what data I want to execute this command on. It already knows to look at the previous pipe row.

filter(grepl(“^95”, ID)) %>%

An English translation of this would be “In the row above, filter out all results where ‘ID’ doesn’t start with ’95’.” Grepl matches the standard expression I want, and filter removes those rows. In Qualtrics there is a second header row with really long, unwieldy descriptions. This will remove that row too. If all you wanted to do was remove that row of labels, you could simply remove it by position when you bring it in. Normally I don’t like to refer to rows by number, but I don’t think it does any harm to only remove the first row:

read.csv(“yourfile.csv”)[-1, ] # this is alternative code that I’m not using for my analysis

I try to make good, proper names for my variables in Qualtrics, but they always seem to get messed up. I inevitably end up renaming some of them:

rename(adskep_2 = adskep_10,
adskep_3 = adskep_11,
adskep_4 = adskep_12,
adskep_5 = adskep_13,
adskep_6 = Q28_14,
adskep_7 = Q28_15,
adskep_8 = Q28_16,
adskep_9 = Q28_17,
Out_not_authentic = Symbol_5) %>%

Note that the name to the left of the equal sign is the new name. The name to the right is the messed up name that Qualtrics gave me.

Now, I’m telling R only to keep variables that have the stems I want:

select(matches(“cont|symbol|cred|integ|out|adskep|id|qpq”)) %>%

In plain English, this would say “keep only the columns that have ‘cont’ or ‘symbol’ or ‘cred’ or ‘integ’ or ‘out’ or ‘adskep’ or ‘id’ or ‘qpq’ as part of their name.”

All of my variables were read in as character strings, so I will need to transform relevant columns to numeric format:

mutate_each(funs(as.numeric), -ID) %>%

Using the “mutate_each” command from the dplyr package, I’ve transformed every column except for “ID” to numeric format.

I need a composite variable that is the mean of all variables from the four dimensions of my scale. You can use “mutate” to create a new variable.

mutate(authenticity = rowMeans(select(.,matches(“cont|Symbol|cred|Integ”)))) %>%

In Qualtrics, I ran two conditions. I need a factor variable that tells me which condition the person was in. Right now I have two variables representing the two conditions. Each is a string of 1’s and NA’s. I only need one of these variables to make my new variable since condition “qpq” and “bonus” are mutually exclusive.

mutate(condition = factor(.$qpq, labels=c(“qpq”,”bonus”), exclude=NULL)) %>%
  select(-qpq)

I created a new variable called “condition”, which is a factored version of “qpq”. When you create a factor in R, you can use the “exclude=NULL” argument to tell it that you want “NA” to be a factor level, rather than just representing missing data. Next, I used “select” to drop the “qpq” variable that has now become obsolete. Since I didn’t include a pipe operator at the end of my last command, all the code will now run and return my cleaned up data.

Step 3. Bring in a second dataset and merge it with the first

In our lab, we have students answer all of the demographic questions separately. We end up having to merge the data. This is ridiculously easy to do in R:

demos <- read.csv("Spring 2015 Demos UPPER 3-8.csv", stringsAsFactors = FALSE) %>%
   distinct(ID)
alldata <- left_join(qualtrics,demos, by="ID")

I’ve brought in the second dataset. People often end up filling out the demos multiple times for whatever reason. I don’t want duplicates because I will end up with duplicate data after I do the join. I have used the “distinct” function to get rid of redundant student ID’s. Then I used “left_join” to keep everything in my dataset on the left “qualtrics,” and to tack on data from my other dataset, “demos” wherever there is a match by the “ID” column, which both datasets have in this case. Again, it’s pretty easy to join two datasets.

The output of this process is three objects:

  1. qualtrics
  2. demos
  3. alldata

There is no data1, data2, data3, etc. Very clean, very transparent, and very easy to look back and see exactly what I did.

Here is all of the code in one place:

Running some jobs on ACISS

If you want to catch up, check out this general post on ACISS, and the first post on how we’re getting set up to run 10,000 bootstrap iterations on 6 different SEM models.

Code walkthrough

I’ll walk through each bit of this script and how it helps us run jobs across multiple nodes on ACISS. Note that once we have a piece of code like this, we will run it on ACISS directly from an R console using source('this_R_script.r').
In this specific example, we need to do something (bootstrap confidence intervals) for each of 6 different models. Our strategy will be to write a function that takes a single argument that is a character string with a lavaan model specification and that will return the bootstrapped confidence intervals. The way the BatchJobs package works is that it takes a function like the one I just described and a vector of the arguments to pass that function. It then requests a node (which is like a separate computer — see the first post) for each argument in the list, and runs that function with that one argument on that node.

So to start with, we need to specify each of the six models:

library(BatchJobs) #load batchjobs first...

uneq.struc.model.ext <-"
ex_T1 =~ c(1, 1, 1, 1)*extr_T1_P1 + c(L2, L2, L2, L2)*extr_T1_P2 + c(L3, L3, L3, L3)*extr_T1_P3 
ex_T2 =~ c(1, 1, 1, 1)*extr_T2_P1 + c(L2, L2, L2, L2)*extr_T2_P2 + c(L3, L3, L3, L3)*extr_T2_P3
ex_T3 =~ c(1, 1, 1, 1)*extr_T3_P1 + c(L2, L2, L2, L2)*extr_T3_P2 + c(L3, L3, L3, L3)*extr_T3_P3
ex_T4 =~ c(1, 1, 1, 1)*extr_T4_P1 + c(L2, L2, L2, L2)*extr_T4_P2 + c(L3, L3, L3, L3)*extr_T4_P3

extr_T1_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T1_P1
extr_T2_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T2_P1
extr_T3_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T3_P1
extr_T4_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T4_P1
... # truncated for readability
"

uneq.struc.model.agree <-"..." 
uneq.struc.model.consc <-"..." 
uneq.struc.model.neur <-"..." 
uneq.struc.model.open <-"..." 
uneq.struc.model.HP <-"..." 

I’ve removed most of the text from the model specification since that’s not the focus here. We now want to put these models into a vector.

starts<-c(
  uneq.struc.model.ext, 
  uneq.struc.model.agree,
  uneq.struc.model.consc,
  uneq.struc.model.neur,
  uneq.struc.model.open,
  uneq.struc.model.HP)

Now we write the function that will take, one at a time, each of these six elements of the vector starts.

CIfunc <-function(x) {
  # Read in the data
  bfibfas_LNT_nat <- read.csv("bfibfas_LNT_nat.csv")
  # Load the SEM package called lavaan
  library(lavaan)

  #Fit the model specified by `x`
  fit.uneq.struc.model <- sem(x, data=bfibfas_LNT_nat, group = "agegroup", missing="ml")
  
  #Use the model fit above to bootstrap parameters 10,000 times.
  out <- bootstrapLavaan(fit.uneq.struc.model, R = 10000, FUN = function(y) {
    standardizedSolution(y)$est }, verbose = TRUE, parallel="snow", ncpus="12")
  
  # Transform the bootstrap output into nice 95% CIs for each parameter
  table.ci <- lapply(as.data.frame(out), quantile, probs=c(0.025, 0.975))

  # Return a list of both the fit model, and CIs
  list(model = fit.uneq.struc.model, cis=table.ci)
}

We now have all the pieces we need to create and run jobs using BatchJobs. Note well the comments in the above code. And also note that I’ve bolded part of the call to bootstrapLavaan. Using the arguments parallel="snow" and ncpus="12" tells this function to spread the 10,000 iterations across all 12 of the nodes processors using the snow backend. Any function you write for the BatchJobs should, ideally, be parallelized in some way. This is easy to do with lavaan, as well as the basic boot package, thankfully.

Note also that we load the data within the function. Basically, anything you need during the function has to be specified there. So we’ve saved our data file in the directory where we make our registry, and the function loads it each time its run.

To understand what happens next, it is important to know that BatchJobs uses what it calls a registry to keep track of all the info you need to give each job, and all the output. Basically, you ask BatchJobs to create a new directory structure for this project (using id = "..." below), and tell is what packages it needs to ensure are accessible to the function you’ve written (lavaan in our case).

After it creates the registry, you ask it to map the elements of the starts vector to the function. This sets up the list of jobs it will submit to nodes on ACISS.

# Create the registry. Note, the id should not contain spaces or weird characters!
reg <- makeRegistry(id = "CIs_Stability_Models", packages = "lavaan")

#In the new registry, map the function `CIfunc` to the elements fo `starts`
ids <- batchMap(reg, CIfunc, starts)

#Now that we created them, we can submit all the jobs in `reg` 
finished <- submitJobs(reg, resources = list(nodes=1, ppn=12))

When we submitJobs, we can specify that each job will get 1 node, and that each node will have 12 processors (resources = list(nodes=1, ppn=12)).

If we want to check the status of the jobs we can use showStatus(reg). And if, as will likely happen, we get kicked out of our session, we can log back in and run the following R commands:

library(BatchJobs)
reg <- loadRegistry("CIs_Stability_Models-files")
showStatus(reg)

If your registry was saved in a different location, you’d substitute out “CIs_Stability_Models-files” with “Your_Reg_Name-files”, or whatever the directory is actually called.

You can also check on your jobs from the command line using qstat with the -u flag and your user name. If you don’t see any output, it means you’ve entered the wrong user name, or your jobs are done. Below is an example of what Cory’s jobs look like when they’re running:


[ccostell@hn1 ~]$ qstat -u ccostell

hn1: 
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1058081.hn1             ccostell    generic  CIs_Stability_Mo  22781     1      1    --   24:00:00 R  01:55:34
1058082.hn1             ccostell    generic  CIs_Stability_Mo  22792     1      1    --   24:00:00 R  01:55:34
1058083.hn1             ccostell    generic  CIs_Stability_Mo    301     1      1    --   24:00:00 R  01:55:34
1058084.hn1             ccostell    generic  CIs_Stability_Mo    317     1      1    --   24:00:00 R  01:55:33
1058085.hn1             ccostell    generic  CIs_Stability_Mo  12613     1      1    --   24:00:00 R  01:55:33
1058086.hn1             ccostell    generic  CIs_Stability_Mo  12624     1      1    --   24:00:00 R  01:55:32

Once your jobs have finished (use showStatus(reg) to ensure they finished without error or premature termination), you can collect the results and save them to an RData file that you’ll want to transfer to your own computer and play with there:

reg <- loadRegistry("CIs_Stability_Models-files")
showStatus(reg)

reduced_CI_results<-reduceResultsList(reg, fun = function(job, res) res, progressbar = T)

save(reduced_CI_results,file='reduced_CI_results.RData')

A minor problem encountered by us

When we first ran this code, we came back the next day to find that none of the jobs had finished (but they hadn’t thrown errors either). We had accidentally requested nodes of the ‘short’ type, which only allow you to process on them for 4 hours before they just kill whatever you’re doing. We had to edit our simple.tmpl to include the line #PBS -q generic so that we were requesting a generic node that lets things run for up to 24 hours.

Back in the deep end with ACISS

Cory wanted to do an unassailable number of bootstrap replications (10,000) to get confidence intervals for six different models he ran using lavaan. Using his own computer to do 500 replications took, well, too long. So last week we started to set up a script for doing this all on ACISS following along a blog post I wrote some time ago. We discovered a few areas of detail left out of that original post, so I’ll document our process in detail below:

Getting started with ACISS

First, make sure you apply for a new account.

Once you’ve logged on, if you need to do any work at all (running a shell script, installing an R package), you need to request an interactive job from the sun grid engine — the node you log in to has basically no processing capacity at all.

As an example, let’s install the packages we’ll need later on. Below, I request an interactive (-I) job on a generic node, and then load whatever the default version of R is (see module avail for a list of available software and versions).

[flournoy@hn1 ~]$ qsub -q generic -I
qsub: waiting for job 1048662.hn1 to start
qsub: job 1048662.hn1 ready

[flournoy@cn91 ~]$ module load R

Note the hostname change from hn1 to cn91. Now I can run R and install packages. Which brings up the second obstacle we came across: ACISS doesn’t like making secure HTTP requests (e.g., to https://cloud.r-project.org/). So, when you want to install.package, you have to make sure to either choose an http mirror when asked, or better yet, just a specific mirror in the command itself. We’ll use the CRAN repository up at OSU.

[flournoy@cn91 ~]$  R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
...

> install.packages(c('lavaan', 'BatchJobs'), repo='http://ftp.osuosl.org/pub/cran/', Ncpus=12)
Installing packages into ‘/ibrix/users/home5/flournoy/R/x86_64-unknown-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
...

In the above install.packages command, I ask for both lavaan and BatchJobs to be installed, specify the mirror to use in the repo option, and since we’re getting used to parallelization, I also tell it to use up to 12 cpus (Ncpus) if it needs to compile anything.

Cory’s Bootstrap Strategy

For simplicity, we decided to distribute all the work for bootstrapping using one node per model. That means one node will do all 10,000 bootstraps for that model. Luckily, bootstrapping in R itself is easy to make parallel, so the job will run on all the node’s 12 processors. So we’ll get a speedup factor of 6*12=72, which is pretty nice, and we gain the added benefit of running this on a remote server.

If we wanted to get more speedup, we could potentially split bootstraps within model type across nodes, and recombine them later. That is, we could do 5,000 bootstrap iterations for model 1 on node 1, then 5,000 more for model 1 on node 2, and then recombine them later. For now, we’ll focus on the simpler version.

Next steps

This week we’re going to finalize the script that uses BatchJobs to schedule these things on ACISS’s Sun Grid Engine, and collect the results. We’ll also play around with using R Studio with a remote R console (fingers crossed).