Category: Uncategorized

Bringing in Qualtrics (and other data)

While a lot of us have grown comfortable using Excel to clean and manipulate our data, there is a growing trend toward transparency and reproduciblity that make it difficult to keep going down that road. It’s also just too easy to make a mistake with Excel. For example, someone I was helping out with an email campaign recently sent me an Excel file that was sorted only by one column. I just trusted them, rather than properly checking the file, and sent the email out and over 400 people got an email with the wrong name in the greeting. That was awkward and frustrating. For me, I love being able to look at my code, see where I brought in the raw data, and see all of the manipulations I did to clean it up. Excel doesn’t do that for me. Thanks to Hadley Wickham’s “dplyr” package it is surprisingly easy to manipulate data in R. I recommend printing out RStudio’s “Data Wrangling Cheat Sheet” and hanging it up somewhere visible if you do regularly manipulate data in R. Here is an example of some data manipulation that I recently did in R.

Step 1. Setup

I’ve set my working directory so that R knows what folder to retrieve the raw data files from. Alternatively, you can give R the whole file name including folders when you read in the CSV file and not bother setting a working directory.

setwd(“~/Research/PPAuthenticity/studies/Study1”)

I’ve loaded the “dplyr” package, which I installed earlier using the command install.packages(“dplyr”). One problem with dplyr is that it uses some function names that mean something different in base R or in other packages. I’ve run into a lot of errors and found that the best workaround is to simply tell R that when I say “select”, what I mean is use select from the dplyr package.

library(dplyr)
filter <- dplyr::filter
select <- dplyr::select

Step 2. Bring in Qualtrics data

Here are a couple of rules of thumb that I use:

  1. Only create one object per data file. It is really confusing to come back 6 months later and see that you have 15 objects that are different versions of the same dataset. I like to see only one version. Dplyr makes it easy to have only one object.
  2. I almost never refer to rows or columns by number. Column numbers and row numbers change every time you tweak the dataset. For best transparency, use some name or rule-based method to remove data or tweak it.

I’m going to read in the data as a CSV file. I recommend against trying to read it in as an Excel file. There are several packages that supposedly read Excel, but they don’t seem to have consistent performance, and there is no guarantee that your code will work later if you do read it in as Excel.

qualtrics <- read.csv("PPAuthenticity2SPAPR15.csv", stringsAsFactors = FALSE) %>%

Notice that I used the “stringsAsFactors = FALSE” argument. By default, R will try to turn everything into factors, which is generally not what I want at all. I used the pipe operator “%>%” to let R know that I’m not done. Now I’m going to make some changes to this data. The pipe operator comes included with dplyr.

In our lab, we have all of our grad students run through the surveys before the real participants do to test it. I want to get rid of the grad student responses, so I filter out all observations that don’t have a student ID that starts with “95”. The grad students are supposed to put “Test” in this field, though John for some reason puts “666”. Filtering out all student ID’s that don’t start with “95” takes care of all of these test observations. Again, it ends with a pipe so that R knows there is more. Because I’m using pipes, I don’t even have to tell it what data I want to execute this command on. It already knows to look at the previous pipe row.

filter(grepl(“^95”, ID)) %>%

An English translation of this would be “In the row above, filter out all results where ‘ID’ doesn’t start with ’95’.” Grepl matches the standard expression I want, and filter removes those rows. In Qualtrics there is a second header row with really long, unwieldy descriptions. This will remove that row too. If all you wanted to do was remove that row of labels, you could simply remove it by position when you bring it in. Normally I don’t like to refer to rows by number, but I don’t think it does any harm to only remove the first row:

read.csv(“yourfile.csv”)[-1, ] # this is alternative code that I’m not using for my analysis

I try to make good, proper names for my variables in Qualtrics, but they always seem to get messed up. I inevitably end up renaming some of them:

rename(adskep_2 = adskep_10,
adskep_3 = adskep_11,
adskep_4 = adskep_12,
adskep_5 = adskep_13,
adskep_6 = Q28_14,
adskep_7 = Q28_15,
adskep_8 = Q28_16,
adskep_9 = Q28_17,
Out_not_authentic = Symbol_5) %>%

Note that the name to the left of the equal sign is the new name. The name to the right is the messed up name that Qualtrics gave me.

Now, I’m telling R only to keep variables that have the stems I want:

select(matches(“cont|symbol|cred|integ|out|adskep|id|qpq”)) %>%

In plain English, this would say “keep only the columns that have ‘cont’ or ‘symbol’ or ‘cred’ or ‘integ’ or ‘out’ or ‘adskep’ or ‘id’ or ‘qpq’ as part of their name.”

All of my variables were read in as character strings, so I will need to transform relevant columns to numeric format:

mutate_each(funs(as.numeric), -ID) %>%

Using the “mutate_each” command from the dplyr package, I’ve transformed every column except for “ID” to numeric format.

I need a composite variable that is the mean of all variables from the four dimensions of my scale. You can use “mutate” to create a new variable.

mutate(authenticity = rowMeans(select(.,matches(“cont|Symbol|cred|Integ”)))) %>%

In Qualtrics, I ran two conditions. I need a factor variable that tells me which condition the person was in. Right now I have two variables representing the two conditions. Each is a string of 1’s and NA’s. I only need one of these variables to make my new variable since condition “qpq” and “bonus” are mutually exclusive.

mutate(condition = factor(.$qpq, labels=c(“qpq”,”bonus”), exclude=NULL)) %>%
  select(-qpq)

I created a new variable called “condition”, which is a factored version of “qpq”. When you create a factor in R, you can use the “exclude=NULL” argument to tell it that you want “NA” to be a factor level, rather than just representing missing data. Next, I used “select” to drop the “qpq” variable that has now become obsolete. Since I didn’t include a pipe operator at the end of my last command, all the code will now run and return my cleaned up data.

Step 3. Bring in a second dataset and merge it with the first

In our lab, we have students answer all of the demographic questions separately. We end up having to merge the data. This is ridiculously easy to do in R:

demos <- read.csv("Spring 2015 Demos UPPER 3-8.csv", stringsAsFactors = FALSE) %>%
   distinct(ID)
alldata <- left_join(qualtrics,demos, by="ID")

I’ve brought in the second dataset. People often end up filling out the demos multiple times for whatever reason. I don’t want duplicates because I will end up with duplicate data after I do the join. I have used the “distinct” function to get rid of redundant student ID’s. Then I used “left_join” to keep everything in my dataset on the left “qualtrics,” and to tack on data from my other dataset, “demos” wherever there is a match by the “ID” column, which both datasets have in this case. Again, it’s pretty easy to join two datasets.

The output of this process is three objects:

  1. qualtrics
  2. demos
  3. alldata

There is no data1, data2, data3, etc. Very clean, very transparent, and very easy to look back and see exactly what I did.

Here is all of the code in one place:

Running some jobs on ACISS

If you want to catch up, check out this general post on ACISS, and the first post on how we’re getting set up to run 10,000 bootstrap iterations on 6 different SEM models.

Code walkthrough

I’ll walk through each bit of this script and how it helps us run jobs across multiple nodes on ACISS. Note that once we have a piece of code like this, we will run it on ACISS directly from an R console using source('this_R_script.r').
In this specific example, we need to do something (bootstrap confidence intervals) for each of 6 different models. Our strategy will be to write a function that takes a single argument that is a character string with a lavaan model specification and that will return the bootstrapped confidence intervals. The way the BatchJobs package works is that it takes a function like the one I just described and a vector of the arguments to pass that function. It then requests a node (which is like a separate computer — see the first post) for each argument in the list, and runs that function with that one argument on that node.

So to start with, we need to specify each of the six models:

library(BatchJobs) #load batchjobs first...

uneq.struc.model.ext <-"
ex_T1 =~ c(1, 1, 1, 1)*extr_T1_P1 + c(L2, L2, L2, L2)*extr_T1_P2 + c(L3, L3, L3, L3)*extr_T1_P3 
ex_T2 =~ c(1, 1, 1, 1)*extr_T2_P1 + c(L2, L2, L2, L2)*extr_T2_P2 + c(L3, L3, L3, L3)*extr_T2_P3
ex_T3 =~ c(1, 1, 1, 1)*extr_T3_P1 + c(L2, L2, L2, L2)*extr_T3_P2 + c(L3, L3, L3, L3)*extr_T3_P3
ex_T4 =~ c(1, 1, 1, 1)*extr_T4_P1 + c(L2, L2, L2, L2)*extr_T4_P2 + c(L3, L3, L3, L3)*extr_T4_P3

extr_T1_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T1_P1
extr_T2_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T2_P1
extr_T3_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T3_P1
extr_T4_P1 ~~c(v_P1, v_P1, v_P1, v_P1)*extr_T4_P1
... # truncated for readability
"

uneq.struc.model.agree <-"..." 
uneq.struc.model.consc <-"..." 
uneq.struc.model.neur <-"..." 
uneq.struc.model.open <-"..." 
uneq.struc.model.HP <-"..." 

I’ve removed most of the text from the model specification since that’s not the focus here. We now want to put these models into a vector.

starts<-c(
  uneq.struc.model.ext, 
  uneq.struc.model.agree,
  uneq.struc.model.consc,
  uneq.struc.model.neur,
  uneq.struc.model.open,
  uneq.struc.model.HP)

Now we write the function that will take, one at a time, each of these six elements of the vector starts.

CIfunc <-function(x) {
  # Read in the data
  bfibfas_LNT_nat <- read.csv("bfibfas_LNT_nat.csv")
  # Load the SEM package called lavaan
  library(lavaan)

  #Fit the model specified by `x`
  fit.uneq.struc.model <- sem(x, data=bfibfas_LNT_nat, group = "agegroup", missing="ml")
  
  #Use the model fit above to bootstrap parameters 10,000 times.
  out <- bootstrapLavaan(fit.uneq.struc.model, R = 10000, FUN = function(y) {
    standardizedSolution(y)$est }, verbose = TRUE, parallel="snow", ncpus="12")
  
  # Transform the bootstrap output into nice 95% CIs for each parameter
  table.ci <- lapply(as.data.frame(out), quantile, probs=c(0.025, 0.975))

  # Return a list of both the fit model, and CIs
  list(model = fit.uneq.struc.model, cis=table.ci)
}

We now have all the pieces we need to create and run jobs using BatchJobs. Note well the comments in the above code. And also note that I’ve bolded part of the call to bootstrapLavaan. Using the arguments parallel="snow" and ncpus="12" tells this function to spread the 10,000 iterations across all 12 of the nodes processors using the snow backend. Any function you write for the BatchJobs should, ideally, be parallelized in some way. This is easy to do with lavaan, as well as the basic boot package, thankfully.

Note also that we load the data within the function. Basically, anything you need during the function has to be specified there. So we’ve saved our data file in the directory where we make our registry, and the function loads it each time its run.

To understand what happens next, it is important to know that BatchJobs uses what it calls a registry to keep track of all the info you need to give each job, and all the output. Basically, you ask BatchJobs to create a new directory structure for this project (using id = "..." below), and tell is what packages it needs to ensure are accessible to the function you’ve written (lavaan in our case).

After it creates the registry, you ask it to map the elements of the starts vector to the function. This sets up the list of jobs it will submit to nodes on ACISS.

# Create the registry. Note, the id should not contain spaces or weird characters!
reg <- makeRegistry(id = "CIs_Stability_Models", packages = "lavaan")

#In the new registry, map the function `CIfunc` to the elements fo `starts`
ids <- batchMap(reg, CIfunc, starts)

#Now that we created them, we can submit all the jobs in `reg` 
finished <- submitJobs(reg, resources = list(nodes=1, ppn=12))

When we submitJobs, we can specify that each job will get 1 node, and that each node will have 12 processors (resources = list(nodes=1, ppn=12)).

If we want to check the status of the jobs we can use showStatus(reg). And if, as will likely happen, we get kicked out of our session, we can log back in and run the following R commands:

library(BatchJobs)
reg <- loadRegistry("CIs_Stability_Models-files")
showStatus(reg)

If your registry was saved in a different location, you’d substitute out “CIs_Stability_Models-files” with “Your_Reg_Name-files”, or whatever the directory is actually called.

You can also check on your jobs from the command line using qstat with the -u flag and your user name. If you don’t see any output, it means you’ve entered the wrong user name, or your jobs are done. Below is an example of what Cory’s jobs look like when they’re running:


[ccostell@hn1 ~]$ qstat -u ccostell

hn1: 
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1058081.hn1             ccostell    generic  CIs_Stability_Mo  22781     1      1    --   24:00:00 R  01:55:34
1058082.hn1             ccostell    generic  CIs_Stability_Mo  22792     1      1    --   24:00:00 R  01:55:34
1058083.hn1             ccostell    generic  CIs_Stability_Mo    301     1      1    --   24:00:00 R  01:55:34
1058084.hn1             ccostell    generic  CIs_Stability_Mo    317     1      1    --   24:00:00 R  01:55:33
1058085.hn1             ccostell    generic  CIs_Stability_Mo  12613     1      1    --   24:00:00 R  01:55:33
1058086.hn1             ccostell    generic  CIs_Stability_Mo  12624     1      1    --   24:00:00 R  01:55:32

Once your jobs have finished (use showStatus(reg) to ensure they finished without error or premature termination), you can collect the results and save them to an RData file that you’ll want to transfer to your own computer and play with there:

reg <- loadRegistry("CIs_Stability_Models-files")
showStatus(reg)

reduced_CI_results<-reduceResultsList(reg, fun = function(job, res) res, progressbar = T)

save(reduced_CI_results,file='reduced_CI_results.RData')

A minor problem encountered by us

When we first ran this code, we came back the next day to find that none of the jobs had finished (but they hadn’t thrown errors either). We had accidentally requested nodes of the ‘short’ type, which only allow you to process on them for 4 hours before they just kill whatever you’re doing. We had to edit our simple.tmpl to include the line #PBS -q generic so that we were requesting a generic node that lets things run for up to 24 hours.

Back in the deep end with ACISS

Cory wanted to do an unassailable number of bootstrap replications (10,000) to get confidence intervals for six different models he ran using lavaan. Using his own computer to do 500 replications took, well, too long. So last week we started to set up a script for doing this all on ACISS following along a blog post I wrote some time ago. We discovered a few areas of detail left out of that original post, so I’ll document our process in detail below:

Getting started with ACISS

First, make sure you apply for a new account.

Once you’ve logged on, if you need to do any work at all (running a shell script, installing an R package), you need to request an interactive job from the sun grid engine — the node you log in to has basically no processing capacity at all.

As an example, let’s install the packages we’ll need later on. Below, I request an interactive (-I) job on a generic node, and then load whatever the default version of R is (see module avail for a list of available software and versions).

[flournoy@hn1 ~]$ qsub -q generic -I
qsub: waiting for job 1048662.hn1 to start
qsub: job 1048662.hn1 ready

[flournoy@cn91 ~]$ module load R

Note the hostname change from hn1 to cn91. Now I can run R and install packages. Which brings up the second obstacle we came across: ACISS doesn’t like making secure HTTP requests (e.g., to https://cloud.r-project.org/). So, when you want to install.package, you have to make sure to either choose an http mirror when asked, or better yet, just a specific mirror in the command itself. We’ll use the CRAN repository up at OSU.

[flournoy@cn91 ~]$  R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
...

> install.packages(c('lavaan', 'BatchJobs'), repo='http://ftp.osuosl.org/pub/cran/', Ncpus=12)
Installing packages into ‘/ibrix/users/home5/flournoy/R/x86_64-unknown-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
...

In the above install.packages command, I ask for both lavaan and BatchJobs to be installed, specify the mirror to use in the repo option, and since we’re getting used to parallelization, I also tell it to use up to 12 cpus (Ncpus) if it needs to compile anything.

Cory’s Bootstrap Strategy

For simplicity, we decided to distribute all the work for bootstrapping using one node per model. That means one node will do all 10,000 bootstraps for that model. Luckily, bootstrapping in R itself is easy to make parallel, so the job will run on all the node’s 12 processors. So we’ll get a speedup factor of 6*12=72, which is pretty nice, and we gain the added benefit of running this on a remote server.

If we wanted to get more speedup, we could potentially split bootstraps within model type across nodes, and recombine them later. That is, we could do 5,000 bootstrap iterations for model 1 on node 1, then 5,000 more for model 1 on node 2, and then recombine them later. For now, we’ll focus on the simpler version.

Next steps

This week we’re going to finalize the script that uses BatchJobs to schedule these things on ACISS’s Sun Grid Engine, and collect the results. We’ll also play around with using R Studio with a remote R console (fingers crossed).

R Markdown Resources

Today we talked about the what, why, and how of R markdown. There are a lot of good tutorials out there (here and here), but here are a few quick links for your reference, and a brief intro.

R markdown is a way of integrating the text of a document with R code that will be run each time you compile the document. Markdown is the format of the document text and, like HTML (but way simpler), lets you define some text as a headings, regular paragraphs, bullet lists, and more. In R markdown, you can also add chunks of code (called ‘chunks’) that are actually run with any output (plots, tables, or raw R output) written inline with the text. By default, it also echos the code to the document. You can compile R markdown documents to HTML, PDF, and word docs. It will change your life.

An example of the raw R markdown document and what it looks like when compiled:

Intro to R special topics: Data!

If you started to walk through Jenny Bryan’s course last week, you should continue working through that material. Once you’ve completed cm001–cm003, move on to cm004 (check out the link to “Basic care and feeding of data in R”, but don’t bother with the homework link unless you’re getting into github).

Once you’ve finished cm004, we can dig into ggplot2 and R markdown.

If you need help with absolutely any aspect of this stuff at all, just holler, put your hand up, dance about, or otherwise make yourself known—our goal is to eliminate all barriers separating you from the magical feeling of doing your stats in R.

Intro to R special topics: Super basic intro

We’re going to be making use of the fantastic course materials from Jenny Bryan’s UBC stats course. On Tuesday, 12 April, we’ll be walking through day 1, possibly day 2, and almost maybe day 3. This is going to be very self-paced, with lots of support from more experienced R users, so come prepared to work.

We’re thinking it’s going to be like doing homework during class time — there won’t be any lecture, but click through the lecture slides for the class before hand. What we don’t get to this week, we’ll continue to help you step through. Keep in mind we’re skipping stuff related to git, but if you’re interested, some of us would be happy to help you get set up with it.

If you want to get help setting this stuff up on your computer, bring it. Otherwise, R and R Studio are installed on lab computers, and if you use those, just make sure you have a way to save your files someplace you can access them later (like google drive).

Here’s a tentative outline of the first three modules:

After we get through this stuff, you’ll be able to open R Studio, write some stuff in it, execute it, and compile it into a nice looking html or pdf document. We’ll then move on to data!

Psych classes that already use R

What would happen if the psych department only taught stats in R? What would change? Well, here’s a list of methods classes that are already taught in R:

  • SEM – R with the lavaan package
  • MLM – R with lme4 and nlne
  • Grad Stats – both R and SPSS
  • Meta-analysis – R with metafor
  • Research methods – R demos
  • Undergrad research methods (303) – TAs can choose to use R

Final Winter R Club Meeting

Tomorrow will be the last meeting of the term.

Remember that all those enrolled for credit will need to write up a final “What I Learned in R Club” statement.

This will also be a good opportunity for deciding on the direction we take next term. I have some ideas I’d like to propose and discuss. They all revolve around the basic idea that there is a high demand for instruction in some basic tasks that are easy in R and hard (or worse) otherwise.

How can we make R Club the best resource it can be? I think we can start by identifying core skills that people aren’t learning other places, and that are doable easily and efficiently in R; and by identifying barriers to using R.

Here’s a list I’ve been compiling to get us started:

  • R Syntax, or “Wtf is the dollar sign for?”
  • Programming basics
  • Cleaning, Joining, Reshaping
    • e.g., dwell time ratings with epochs
  • Overcoming the spreadsheet
    • Qualtrics data checking – how to solve mis-coding
    • Finding mistakes in factors etc (using dplyr)
  • Writing reproducible analyses
  • GGlorious Ploting
    • EDA and visual data checking
  • Document Rendering (pdf, html, docx on Windows, OS X, Linux)
  • Tables (correlation matrices with stat tests)

Fun methods:

  • MLM
  • Topic modeling
  • Social network analysis
  • Big data techniques (e.g., scraping with Rvest)
  • CFA

Some good coursework resources:

  • http://stat545-ubc.github.io/
  • http://simplystatistics.org/2016/02/12/not-so-standard-deviations-episode-9-spreadsheet-drama/

No R Club today, but feel free to meet anyway!

Sorry for the last minute notice, but there will be no official R Club Today (3/1/16). However, if folks would still like to meet the space will be open!

See you next week!

Written by Comments Off on No R Club today, but feel free to meet anyway! Posted in Uncategorized

Big Data Repository

For those interested, some researchers at Purdue are creating a repository for publicly available data. It sounds like they’re pretty early in the process, but there might be opportunities for collaboration (or just to access the data on your own).

The full announcement (which I got via the SPSP listserv) is below:

****************

We are a research team consisting of two psychologists and two computer scientists at Purdue University and University of Pennsylvania, working on a “big data research infrastructure” project called HUBALIVE.

Our vision is to help researchers interested in human behavior observations to collect and analyze their data in a way that is 1) bigger, 2) easier, 3) faster, and 4) less expensive. The HUBALIVE project aims to allow researchers in social and organizational sciences to access a large volume of video, image, and text data that are publicly available around the world (e.g., 70,000 live video streams with associated metadata; for more info, visit: www.hubalive.org). To this end, our online data repository and accompanying analytic tools called “HUBALIVE” will be made available free-of-charge for research purposes.

At this juncture, we are interested in gauging the initial level of interest in (free) access to such data in the field of social and organizational sciences. We are also interested in identifying scholars who might be interested in collaborating with us for various research opportunities. If you are one of them, please visit the following link to take a 1-minute survey and register for more information and updates: purdue.qualtrics.com/jfe/form/SV_43lnH4OGa5BQTRz

Please let us know if you have any questions or thoughts to share:hubalive@purdue.edu.

We would very much appreciate you forwarding this to those who may be interested.

Thank you,

Sang Eun Woo
Assistant Professor
Department of Psychological Sciences
Purdue University
www.purdue.edu/hhs/psy/about/directory/faculty/…

Yung-Hsiang Lu
Associate Professor
Electrical and Computer Engineering
Purdue University
engineering.purdue.edu/HELPS

Lyle H. Ungar
Professor
Department of Computer and Information Science
University of Pennsylvania
www.cis.upenn.edu/~ungar

Louis Tay
Assistant Professor
Department of Psychological Sciences
Purdue University
www.purdue.edu/hhs/psy/about/directory/faculty/…