Tagged: dplyr

May 18, 2016

Bringing in Qualtrics (and other data)

While a lot of us have grown comfortable using Excel to clean and manipulate our data, there is a growing trend toward transparency and reproduciblity that make it difficult to keep going down that road. It’s also just too easy to make a mistake with Excel. For example, someone I was helping out with an email campaign recently sent me an Excel file that was sorted only by one column. I just trusted them, rather than properly checking the file, and sent the email out and over 400 people got an email with the wrong name in the greeting. That was awkward and frustrating. For me, I love being able to look at my code, see where I brought in the raw data, and see all of the manipulations I did to clean it up. Excel doesn’t do that for me. Thanks to Hadley Wickham’s “dplyr” package it is surprisingly easy to manipulate data in R. I recommend printing out RStudio’s “Data Wrangling Cheat Sheet” and hanging it up somewhere visible if you do regularly manipulate data in R. Here is an example of some data manipulation that I recently did in R.

Step 1. Setup

I’ve set my working directory so that R knows what folder to retrieve the raw data files from. Alternatively, you can give R the whole file name including folders when you read in the CSV file and not bother setting a working directory.

setwd(“~/Research/PPAuthenticity/studies/Study1”)

I’ve loaded the “dplyr” package, which I installed earlier using the command install.packages(“dplyr”). One problem with dplyr is that it uses some function names that mean something different in base R or in other packages. I’ve run into a lot of errors and found that the best workaround is to simply tell R that when I say “select”, what I mean is use select from the dplyr package.

library(dplyr)
filter <- dplyr::filter
select <- dplyr::select

Step 2. Bring in Qualtrics data

Here are a couple of rules of thumb that I use:

Only create one object per data file. It is really confusing to come back 6 months later and see that you have 15 objects that are different versions of the same dataset. I like to see only one version. Dplyr makes it easy to have only one object.
I almost never refer to rows or columns by number. Column numbers and row numbers change every time you tweak the dataset. For best transparency, use some name or rule-based method to remove data or tweak it.

I’m going to read in the data as a CSV file. I recommend against trying to read it in as an Excel file. There are several packages that supposedly read Excel, but they don’t seem to have consistent performance, and there is no guarantee that your code will work later if you do read it in as Excel.

qualtrics <- read.csv("PPAuthenticity2SPAPR15.csv", stringsAsFactors = FALSE) %>%

Notice that I used the “stringsAsFactors = FALSE” argument. By default, R will try to turn everything into factors, which is generally not what I want at all. I used the pipe operator “%>%” to let R know that I’m not done. Now I’m going to make some changes to this data. The pipe operator comes included with dplyr.

In our lab, we have all of our grad students run through the surveys before the real participants do to test it. I want to get rid of the grad student responses, so I filter out all observations that don’t have a student ID that starts with “95”. The grad students are supposed to put “Test” in this field, though John for some reason puts “666”. Filtering out all student ID’s that don’t start with “95” takes care of all of these test observations. Again, it ends with a pipe so that R knows there is more. Because I’m using pipes, I don’t even have to tell it what data I want to execute this command on. It already knows to look at the previous pipe row.

filter(grepl(“^95”, ID)) %>%

An English translation of this would be “In the row above, filter out all results where ‘ID’ doesn’t start with ’95’.” Grepl matches the standard expression I want, and filter removes those rows. In Qualtrics there is a second header row with really long, unwieldy descriptions. This will remove that row too. If all you wanted to do was remove that row of labels, you could simply remove it by position when you bring it in. Normally I don’t like to refer to rows by number, but I don’t think it does any harm to only remove the first row:

read.csv(“yourfile.csv”)[-1, ] # this is alternative code that I’m not using for my analysis

I try to make good, proper names for my variables in Qualtrics, but they always seem to get messed up. I inevitably end up renaming some of them:

rename(adskep_2 = adskep_10,
adskep_3 = adskep_11,
adskep_4 = adskep_12,
adskep_5 = adskep_13,
adskep_6 = Q28_14,
adskep_7 = Q28_15,
adskep_8 = Q28_16,
adskep_9 = Q28_17,
Out_not_authentic = Symbol_5) %>%

Note that the name to the left of the equal sign is the new name. The name to the right is the messed up name that Qualtrics gave me.

Now, I’m telling R only to keep variables that have the stems I want:

In plain English, this would say “keep only the columns that have ‘cont’ or ‘symbol’ or ‘cred’ or ‘integ’ or ‘out’ or ‘adskep’ or ‘id’ or ‘qpq’ as part of their name.”

All of my variables were read in as character strings, so I will need to transform relevant columns to numeric format:

mutate_each(funs(as.numeric), -ID) %>%

Using the “mutate_each” command from the dplyr package, I’ve transformed every column except for “ID” to numeric format.

I need a composite variable that is the mean of all variables from the four dimensions of my scale. You can use “mutate” to create a new variable.

mutate(authenticity = rowMeans(select(.,matches(“cont|Symbol|cred|Integ”)))) %>%

In Qualtrics, I ran two conditions. I need a factor variable that tells me which condition the person was in. Right now I have two variables representing the two conditions. Each is a string of 1’s and NA’s. I only need one of these variables to make my new variable since condition “qpq” and “bonus” are mutually exclusive.

mutate(condition = factor(.$qpq, labels=c(“qpq”,”bonus”), exclude=NULL)) %>%
select(-qpq)

I created a new variable called “condition”, which is a factored version of “qpq”. When you create a factor in R, you can use the “exclude=NULL” argument to tell it that you want “NA” to be a factor level, rather than just representing missing data. Next, I used “select” to drop the “qpq” variable that has now become obsolete. Since I didn’t include a pipe operator at the end of my last command, all the code will now run and return my cleaned up data.

Step 3. Bring in a second dataset and merge it with the first

In our lab, we have students answer all of the demographic questions separately. We end up having to merge the data. This is ridiculously easy to do in R:

demos <- read.csv("Spring 2015 Demos UPPER 3-8.csv", stringsAsFactors = FALSE) %>%
distinct(ID)
alldata <- left_join(qualtrics,demos, by="ID")

I’ve brought in the second dataset. People often end up filling out the demos multiple times for whatever reason. I don’t want duplicates because I will end up with duplicate data after I do the join. I have used the “distinct” function to get rid of redundant student ID’s. Then I used “left_join” to keep everything in my dataset on the left “qualtrics,” and to tack on data from my other dataset, “demos” wherever there is a match by the “ID” column, which both datasets have in this case. Again, it’s pretty easy to join two datasets.

The output of this process is three objects:

qualtrics
demos
alldata

There is no data1, data2, data3, etc. Very clean, very transparent, and very easy to look back and see exactly what I did.

Here is all of the code in one place:

April 12, 2016

reformatting data to wide

Here’s some thing we were working on in R Club today.

This example shows how to reformat data from (sort of) long to wide. We’ve got pre and post measurements for each subject, some subject-level variables (called dyad vars in the example), and two variables that vary by pre and post. These are the outcome of interest (DV) and a covariate (the length of the video). The data is “sort of” long in that pre and post are each on their own line, but the type of measure (DV or video length) is still each in its own column. To make it fully wide, where there’s only one line for each subject, first we need to make it fully long, and then spread it all out. I’m using dplyr and tidyr here since they’re such lovely tools, but of course there’s more than one way to go about this.

Enjoy!

Here’s the whole thing:

data <- data.frame(subj=sort(c(1:10, 1:10)), cond=rep(c("pre", "post"), 10), DV=rnorm(20), video=rnorm(20), dyadvar1=sort(c(1:10, 1:10)), dyadvar2=sort(c(1:10, 1:10)))

data

##    subj cond          DV      video dyadvar1 dyadvar2
## 1     1  pre -0.09424282 -0.5553683        1        1
## 2     1 post  1.50787647  0.1086909        1        1
## 3     2  pre  1.63471884 -0.5022278        2        2
## 4     2 post  0.06910946 -0.1067023        2        2
## 5     3  pre  0.60311557 -0.3074021        3        3
## 6     3 post  1.34918285  1.0210010        3        3
## 7     4  pre -0.52288040  0.3636248        4        4
## 8     4 post -1.14408452  0.9635112        4        4
## 9     5  pre -1.67596739 -0.5665085        5        5
## 10    5 post  0.45287833  0.1465894        5        5
## 11    6  pre  0.96739333 -0.2250043        6        6
## 12    6 post  0.20794296  1.0133354        6        6
## 13    7  pre  1.25381345 -1.0679324        7        7
## 14    7 post  0.73957481 -1.3426321        7        7
## 15    8  pre -1.26477564 -0.4130490        8        8
## 16    8 post -1.80489529 -0.5602744        8        8
## 17    9  pre -0.85829518  0.8735288        9        9
## 18    9 post  0.52721178  0.1642551        9        9
## 19   10  pre  0.08717665 -0.3085623       10       10
## 20   10 post -1.28951524 -1.7575259       10       10

library(tidyr); library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

wide.data <- data %>%
  gather(key="measure", value="value", DV:video) %>%
  unite(col=key, cond, measure)  %>%
  spread(key=key, value=value)

Here’s each step:

library(knitr)
data %>%
  gather(key="measure", value="value", DV:video)

##    subj cond dyadvar1 dyadvar2 measure       value
## 1     1  pre        1        1      DV -0.09424282
## 2     1 post        1        1      DV  1.50787647
## 3     2  pre        2        2      DV  1.63471884
## 4     2 post        2        2      DV  0.06910946
## 5     3  pre        3        3      DV  0.60311557
## 6     3 post        3        3      DV  1.34918285
## 7     4  pre        4        4      DV -0.52288040
## 8     4 post        4        4      DV -1.14408452
## 9     5  pre        5        5      DV -1.67596739
## 10    5 post        5        5      DV  0.45287833
## 11    6  pre        6        6      DV  0.96739333
## 12    6 post        6        6      DV  0.20794296
## 13    7  pre        7        7      DV  1.25381345
## 14    7 post        7        7      DV  0.73957481
## 15    8  pre        8        8      DV -1.26477564
## 16    8 post        8        8      DV -1.80489529
## 17    9  pre        9        9      DV -0.85829518
## 18    9 post        9        9      DV  0.52721178
## 19   10  pre       10       10      DV  0.08717665
## 20   10 post       10       10      DV -1.28951524
## 21    1  pre        1        1   video -0.55536830
## 22    1 post        1        1   video  0.10869087
## 23    2  pre        2        2   video -0.50222782
## 24    2 post        2        2   video -0.10670226
## 25    3  pre        3        3   video -0.30740210
## 26    3 post        3        3   video  1.02100096
## 27    4  pre        4        4   video  0.36362482
## 28    4 post        4        4   video  0.96351116
## 29    5  pre        5        5   video -0.56650848
## 30    5 post        5        5   video  0.14658942
## 31    6  pre        6        6   video -0.22500427
## 32    6 post        6        6   video  1.01333537
## 33    7  pre        7        7   video -1.06793244
## 34    7 post        7        7   video -1.34263213
## 35    8  pre        8        8   video -0.41304900
## 36    8 post        8        8   video -0.56027441
## 37    9  pre        9        9   video  0.87352882
## 38    9 post        9        9   video  0.16425512
## 39   10  pre       10       10   video -0.30856231
## 40   10 post       10       10   video -1.75752594

data %>%
  gather(key="measure", value="value", DV:video) %>%
  unite(col=key, cond, measure)

##    subj        key dyadvar1 dyadvar2       value
## 1     1     pre_DV        1        1 -0.09424282
## 2     1    post_DV        1        1  1.50787647
## 3     2     pre_DV        2        2  1.63471884
## 4     2    post_DV        2        2  0.06910946
## 5     3     pre_DV        3        3  0.60311557
## 6     3    post_DV        3        3  1.34918285
## 7     4     pre_DV        4        4 -0.52288040
## 8     4    post_DV        4        4 -1.14408452
## 9     5     pre_DV        5        5 -1.67596739
## 10    5    post_DV        5        5  0.45287833
## 11    6     pre_DV        6        6  0.96739333
## 12    6    post_DV        6        6  0.20794296
## 13    7     pre_DV        7        7  1.25381345
## 14    7    post_DV        7        7  0.73957481
## 15    8     pre_DV        8        8 -1.26477564
## 16    8    post_DV        8        8 -1.80489529
## 17    9     pre_DV        9        9 -0.85829518
## 18    9    post_DV        9        9  0.52721178
## 19   10     pre_DV       10       10  0.08717665
## 20   10    post_DV       10       10 -1.28951524
## 21    1  pre_video        1        1 -0.55536830
## 22    1 post_video        1        1  0.10869087
## 23    2  pre_video        2        2 -0.50222782
## 24    2 post_video        2        2 -0.10670226
## 25    3  pre_video        3        3 -0.30740210
## 26    3 post_video        3        3  1.02100096
## 27    4  pre_video        4        4  0.36362482
## 28    4 post_video        4        4  0.96351116
## 29    5  pre_video        5        5 -0.56650848
## 30    5 post_video        5        5  0.14658942
## 31    6  pre_video        6        6 -0.22500427
## 32    6 post_video        6        6  1.01333537
## 33    7  pre_video        7        7 -1.06793244
## 34    7 post_video        7        7 -1.34263213
## 35    8  pre_video        8        8 -0.41304900
## 36    8 post_video        8        8 -0.56027441
## 37    9  pre_video        9        9  0.87352882
## 38    9 post_video        9        9  0.16425512
## 39   10  pre_video       10       10 -0.30856231
## 40   10 post_video       10       10 -1.75752594

data %>%
  gather(key="measure", value="value", DV:video) %>%
  unite(col=key, cond, measure)  %>%
  spread(key=key, value=value)

##    subj dyadvar1 dyadvar2     post_DV post_video      pre_DV  pre_video
## 1     1        1        1  1.50787647  0.1086909 -0.09424282 -0.5553683
## 2     2        2        2  0.06910946 -0.1067023  1.63471884 -0.5022278
## 3     3        3        3  1.34918285  1.0210010  0.60311557 -0.3074021
## 4     4        4        4 -1.14408452  0.9635112 -0.52288040  0.3636248
## 5     5        5        5  0.45287833  0.1465894 -1.67596739 -0.5665085
## 6     6        6        6  0.20794296  1.0133354  0.96739333 -0.2250043
## 7     7        7        7  0.73957481 -1.3426321  1.25381345 -1.0679324
## 8     8        8        8 -1.80489529 -0.5602744 -1.26477564 -0.4130490
## 9     9        9        9  0.52721178  0.1642551 -0.85829518  0.8735288
## 10   10       10       10 -1.28951524 -1.7575259  0.08717665 -0.3085623

November 3, 2015

Using dplyr to batch analyses

Load our packages
Simulate Data
Use do for batching analysis

Get the source code!

Load our packages

If they aren’t installed, just use the install.packages command provided below.

library(MASS)
#install.packages('MASS')
library(dplyr)
#install.packages('dplyr')
library(tidyr)
#install.packages('tidyr')
library(lme4)
#install.packages('lme4')
library(ggplot2)
#install.packages('ggplot2')
library(cowplot)
#install.packages('cowplot')

Simulate Data

#set seed so this is the same every time
set.seed(1337)

I’m making up data here that is supposed to reflect Big Five personality change over the lifespan. Possible scores range from 0-100.

First, I set up population mean intercepts (at age 35) and slopes.

popLineParamMeans <- data.frame(
                var=c('E','A','C','N','O'),
                int=c(60, 65, 55, 45, 50),
                slope=c(1, 2, 3, -1, 0))

head(popLineParamMeans)

##   var int slope
## 1   E  60     1
## 2   A  65     2
## 3   C  55     3
## 4   N  45    -1
## 5   O  50     0

Next, I set a variable to a covariance matrix for the slopes and intercepts. Here, I give the intercepts a standard deviation of $\sqrt{10}$, slopes a SD of 1, and a negative covariance between them of -.5 (correlation of about -.16).

lineSigmas <- matrix(c(10, -.5, -.5, 1), nrow=2)

print(lineSigmas)

##      [,1] [,2]
## [1,] 10.0 -0.5
## [2,] -0.5  1.0

print(cov2cor(lineSigmas))

##            [,1]       [,2]
## [1,]  1.0000000 -0.1581139
## [2,] -0.1581139  1.0000000

Now I want to draw each person’s individual slope and intercept from a multivariate normal distribution based on the parameters above (for each BFI variable).

#set a variable for the number of people
# so that it's easy to change later if I want.
N <- 100

indivSlopes <- popLineParamMeans %>% # use the data frame from above
    group_by(var) %>% # do this for each variable separately
    do({
        someDat <- data.frame(
                #N draws
                #for 2 parameters with means defined here
                #and covariance matrix defined here
                      mvrnorm(n=N, 
                      mu=c(.$int, .$slope),
                      Sigma=lineSigmas))
        #give each row an ID
        someDat$id <- 1:N
        #return the data frame, 'cause that's what `do` expects
        someDat
    })

#give it some descriptive names
names(indivSlopes) <- c('bfi', 'intercept', 'slope', 'id')

head(indivSlopes)

## Source: local data frame [6 x 4]
## 
##      bfi intercept     slope    id
##   (fctr)     (dbl)     (dbl) (int)
## 1      A  64.37993 1.8269765     1
## 2      A  69.59519 2.1257374     2
## 3      A  65.94703 0.5928178     3
## 4      A  59.86194 2.1275653     4
## 5      A  67.18973 2.0810316     5
## 6      A  58.58967 3.1965359     6

So now we have a data frame that has a row for every person, for every BFI variable, and a column for that person’s intercept and slope on that variable.

Now we can generate their observed values at every time point. I’ll do 4 waves for now. We will generate them using a basic linear equation and throw in some error. Also, in this step we’ll randomly draw their starting age, and then center everything at age 35 (just because this mirrors the data I’m familiar with :]).

The equation is just $y= \beta_0 + \beta_1 * age + \epsilon$, where B0 is the intercept, B1 is the slope, age is age, and epsilon is ther normally distributed error term.

timeSeriesData <- indivSlopes %>% #use the slope-per-person-per-var df from above
    group_by(id, bfi) %>% #do something for every person, for every bfi variable
    do({
        age <- round(runif(1, 18, 50)) #randomly draw starting age
        ages <- age + 0:3 #generate the ages for the all 4 waves
        #generate 4 scores for 4 ages with random, uncorrelated error
        score <- .$intercept + .$slope * (ages-35) + rnorm(4, 0, 4) 
        #Return a data frame, 'cause that's what `do` expects.
        #Note that it always retains the grouping variables like `id` and
        #`bfi` even if we don't put them in this data frame.
        data.frame(age=ages-35,
               agew1=age-35,
               value=score)
    })


head(timeSeriesData)

## Source: local data frame [6 x 5]
## Groups: id, bfi [2]
## 
##      id    bfi   age agew1    value
##   (int) (fctr) (dbl) (dbl)    (dbl)
## 1     1      A     4     4 74.27345
## 2     1      A     5     4 59.96832
## 3     1      A     6     4 77.33757
## 4     1      A     7     4 85.42808
## 5     1      C     1     1 49.13083
## 6     1      C     2     1 63.09312

timeSeriesData %>%
    filter(bfi=='A') %>%
    ggplot(aes(x=age, y=value)) +
        geom_line(aes(group=id)) +
        geom_smooth()

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Now we have some data to work with, and you’ve learned about mvrnorm, and have a more complex example for how do can work.

Use `do` for batching analysis

Let’s pretend we don’t know the population parameters that generated our data, and want to estimate the proportion of variance due to individuals, compared to total variance.

We’ll use the multilevel modeling package lme4 to generate these statistics.

First, let’s figure out how to get the values out of lme4 we need for an ICC – that is, the variance in intercepts due to our grouping factor and total variance.

testMod <- lmer(value ~ 1 + (1|id), data = filter(timeSeriesData, bfi=='E'))
summary(testMod)

## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + (1 | id)
##    Data: filter(timeSeriesData, bfi == "E")
## 
## REML criterion at convergence: 2644.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.69346 -0.54994 -0.00728  0.58528  2.44704 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 173.25   13.162  
##  Residual              17.38    4.169  
## Number of obs: 400, groups:  id, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   59.360      1.333   44.54

VarCorr(testMod) #we can get SD for grouping factor from this

##  Groups   Name        Std.Dev.
##  id       (Intercept) 13.1624 
##  Residual              4.1686

sigma(testMod) #and residual SD from this

## [1] 4.16858

str(VarCorr(testMod))

## List of 1
##  $ id: num [1, 1] 173
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "(Intercept)"
##   .. ..$ : chr "(Intercept)"
##   ..- attr(*, "stddev")= Named num 13.2
##   .. ..- attr(*, "names")= chr "(Intercept)"
##   ..- attr(*, "correlation")= num [1, 1] 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr "(Intercept)"
##   .. .. ..$ : chr "(Intercept)"
##  - attr(*, "sc")= num 4.17
##  - attr(*, "useSc")= logi TRUE
##  - attr(*, "class")= chr "VarCorr.merMod"

VarCorr(testMod)$id['(Intercept)', '(Intercept)']

## [1] 173.2495

VarCorr(testMod)$id['(Intercept)', '(Intercept)']^.5 # Looks like this gives us the variance

## [1] 13.16243

idVariance <- VarCorr(testMod)[['id']]['(Intercept)', '(Intercept)']
residVar <- sigma(testMod)^2 #square the SD to get variance

anICC <- idVariance/(idVariance+residVar)
anICC

## [1] 0.9088424

Now all we have to do is use group_by and do to do this to each subset of the data frame. Remember that, within do, the subset of your data is called ..

iccDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + (1|id), data=.)
        idVariance <- VarCorr(amod)[['id']]['(Intercept)', '(Intercept)']
        residVar <- sigma(amod)^2 #square the SD to get variance
        anICC <- idVariance/(idVariance+residVar)
        data.frame(icc=anICC)
    })

iccDF

## Source: local data frame [5 x 2]
## Groups: bfi [5]
## 
##      bfi       icc
##   (fctr)     (dbl)
## 1      A 0.9411433
## 2      C 0.9694278
## 3      E 0.9088424
## 4      N 0.9010527
## 5      O 0.8576107

Write an ICC function

We could also write a function that takes the names of the outcome variable column, grouping variable column, and the data, and returns an ICC.

There’s a little magic in here where we construct the formula for lmer using the text. Quick explanation: paste0 just sticks together strings, so paste0('This', 'Works') gives us ThisWorks. We can use variables in it too, so if we set thing <- 'nonesense' and do paste0(thing, 'Works'), we get nonesenseWorks.

theICC <- function(yName, groupName, theData){
    if(!require(lme4)){
            stop('missing lme4 package')
    }

    theFormula <- formula(paste0(yName, '~1+(1|', groupName, ')'))
    amod <- lmer(theFormula, theData)
    idVariance <- VarCorr(amod)[[groupName]]['(Intercept)', '(Intercept)']
    residVar <- sigma(amod)^2 #square the SD to get variance

    anICC <- idVariance/(idVariance+residVar)
    anICC
}

#Test the function!
theICC('value', 'id', filter(timeSeriesData, bfi=='E'))

## [1] 0.9088424

iccDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        anICC <- theICC(yName='value',
                groupName='id',
                theData=.)
        data.frame(icc=anICC)
    })

iccDF

## Source: local data frame [5 x 2]
## Groups: bfi [5]
## 
##      bfi       icc
##   (fctr)     (dbl)
## 1      A 0.9411433
## 2      C 0.9694278
## 3      E 0.9088424
## 4      N 0.9010527
## 5      O 0.8576107

Other batched analyses

You can use do in this way for anything, so long as you return a data frame. Let’s go ahead and estimate the growth parameters for each BFI variable.

Again, start by getting a handle on the object you get back from your statistic function:

testMod <- lmer(value ~ 1 + age + (1|age), data=filter(timeSeriesData, bfi=='E'))
summary(testMod)

## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + age + (1 | age)
##    Data: filter(timeSeriesData, bfi == "E")
## 
## REML criterion at convergence: 3051.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2027 -0.5829  0.0264  0.5890  4.1250 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  age      (Intercept)   0       0.00   
##  Residual             120      10.96   
## Number of obs: 400, groups:  age, 36
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 58.88174    0.54870  107.31
## age          0.85460    0.05619   15.21
## 
## Correlation of Fixed Effects:
##     (Intr)
## age -0.057

fixef(testMod) #looks like we might be able to just coerce this into a DF

## (Intercept)         age 
##  58.8817372   0.8546026

as.data.frame(fixef(testMod))

##             fixef(testMod)
## (Intercept)     58.8817372
## age              0.8546026

Looks like, so long as we keep track of which parameter is on which row, we’ll be able to just return the output of fixef coerced into a data.frame.

growthParamsDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + age + (1|age), data=.)
        fx <- as.data.frame(fixef(amod))
    })
growthParamsDF

## Source: local data frame [10 x 2]
## Groups: bfi [5]
## 
##       bfi fixef(amod)
##    (fctr)       (dbl)
## 1       A 65.48942693
## 2       A  1.93237366
## 3       C 56.15908452
## 4       C  3.02224221
## 5       E 58.88173717
## 6       E  0.85460255
## 7       N 44.86838252
## 8       N -0.81417670
## 9       O 49.45933488
## 10      O -0.08210474

Oops, gotta add the rownames.

growthParamsDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + age + (1|age), data=.)
        fx <- fixef(amod)
        fxDF <- as.data.frame(fx)
        fxDF$param <- gsub('\\(|\\)', '', rownames(fxDF)) # and remove '(' or ')'
        fxDF
    })
growthParamsDF

## Source: local data frame [10 x 3]
## Groups: bfi [5]
## 
##       bfi          fx     param
##    (fctr)       (dbl)     (chr)
## 1       A 65.48942693 Intercept
## 2       A  1.93237366       age
## 3       C 56.15908452 Intercept
## 4       C  3.02224221       age
## 5       E 58.88173717 Intercept
## 6       E  0.85460255       age
## 7       N 44.86838252 Intercept
## 8       N -0.81417670       age
## 9       O 49.45933488 Intercept
## 10      O -0.08210474       age

Presentation of your results

If we want, we can spread out the values and output a table.

growthParamsDF_wide <- growthParamsDF %>%
    spread(param, fx)
growthParamsDF_wide

## Source: local data frame [5 x 3]
## 
##      bfi         age Intercept
##   (fctr)       (dbl)     (dbl)
## 1      A  1.93237366  65.48943
## 2      C  3.02224221  56.15908
## 3      E  0.85460255  58.88174
## 4      N -0.81417670  44.86838
## 5      O -0.08210474  49.45933

library(knitr)
#Just a preview of how to make nice tables:

kable(growthParamsDF_wide)

bfi	age	Intercept
A	1.9323737	65.48943
C	3.0222422	56.15908
E	0.8546026	58.88174
N	-0.8141767	44.86838
O	-0.0821047	49.45933

January 20, 2015

611 Highlights in R (Part II)

The raw .R file is here and the Rmd file is here.

Brief ‘Hello’ to R Markdown

Let me first introduce you to R markdown and kitr. This format allows you to integrate your analyses with text that can be automatically translated to html, word, and pdf formats. I just produced a supplementary materials document using this and it probably saved me a ridiculous number of hours. It’s easy to get started:

Save your file with extension .Rmd.

Place something like the following in the very top of your file:

---
author: John Flournoy
title: 611 Highlights in R (Part II)
output: html_document
---

Write as you normally would, using Markdown syntax for formatting. For example:

# This is a big heading

1. this is
2. a numbered list

- this is a bulleted list
- With **bold** and *italic*  text

When you want to throw in some R code, place it inside an R chunk by wrapping it in ```{r} at the beginning and ``` at the end. When you do, it comes out something like this:

# From Rose's presentation:
# a toy data set for us to work with
n <- 100
gender <- as.factor(c(rep("M", n/2), rep("F", n/2)))
IQ <- rnorm(n, mean=100, sd=15)
degree <- rep(c("HS", "BA", "MS", "PhD"), n/4)
height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1)
RT1 <- rchisq(n, 4)
RT2 <- rchisq(n, 4)
DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n)

df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2)
head(df)

##    awesome gender        IQ degree   height      RT1      RT2
## 1 52.89486      M  79.40263     HS 6.292706 1.761990 2.228117
## 2 57.04782      M  96.31289     BA 4.759042 2.236027 7.363253
## 3 55.59594      M 104.95331     MS 5.560292 5.159114 7.126281
## 4 55.44397      M  84.23347    PhD 5.354595 7.652900 2.836956
## 5 55.45191      M 107.38754     HS 4.523997 7.768904 4.504664
## 6 54.70468      M 110.07096     BA 6.139368 7.786462 7.508488

Notice that in 4, the we see the output of the command head(df). This is useful for outputting tables and plots, as we’ll see.

Here are more resources on

Streamlined Data Workflow with dplyr

Download the data here:

Make sure you have the package (library(dplyr)), and if you don’t, install it (install.packages('dplyr')). Here’s an intro to dplyr straight from the horse’s (Hadley’s) mouth.

By way of data manipulation that I’m actually using for a real project, this tutorial will introduce you to

%>%
- a function chaining operator
mutate and transmute
- adds new variables and preserves existing (transmute drops existing variables)
mutate_each
- adds new variables, applying functions in funs(...) to each column
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
left_join
- joins two data frames, returning all rows from x, and all columns from x and y
select
- keeps the variables/columns that you mention (see ?select for special functions that help you select variables)
filter
- returns rows with matching conditions (similar to selecting columns like: aDataFrame[variable == value,])
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
do
- use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same lm or t.test separately for each gender, country)

Continue reading →