November 3, 2015

Using dplyr to batch analyses

Load our packages
Simulate Data
Use do for batching analysis

Load our packages

If they aren’t installed, just use the install.packages command provided below.

library(MASS)
#install.packages('MASS')
library(dplyr)
#install.packages('dplyr')
library(tidyr)
#install.packages('tidyr')
library(lme4)
#install.packages('lme4')
library(ggplot2)
#install.packages('ggplot2')
library(cowplot)
#install.packages('cowplot')

Simulate Data

#set seed so this is the same every time
set.seed(1337)

I’m making up data here that is supposed to reflect Big Five personality change over the lifespan. Possible scores range from 0-100.

First, I set up population mean intercepts (at age 35) and slopes.

popLineParamMeans <- data.frame(
                var=c('E','A','C','N','O'),
                int=c(60, 65, 55, 45, 50),
                slope=c(1, 2, 3, -1, 0))

head(popLineParamMeans)

##   var int slope
## 1   E  60     1
## 2   A  65     2
## 3   C  55     3
## 4   N  45    -1
## 5   O  50     0

Next, I set a variable to a covariance matrix for the slopes and intercepts. Here, I give the intercepts a standard deviation of \(\sqrt{10}\), slopes a SD of 1, and a negative covariance between them of -.5 (correlation of about -.16).

lineSigmas <- matrix(c(10, -.5, -.5, 1), nrow=2)

print(lineSigmas)

##      [,1] [,2]
## [1,] 10.0 -0.5
## [2,] -0.5  1.0

print(cov2cor(lineSigmas))

##            [,1]       [,2]
## [1,]  1.0000000 -0.1581139
## [2,] -0.1581139  1.0000000

Now I want to draw each person’s individual slope and intercept from a multivariate normal distribution based on the parameters above (for each BFI variable).

#set a variable for the number of people
# so that it's easy to change later if I want.
N <- 100

indivSlopes <- popLineParamMeans %>% # use the data frame from above
    group_by(var) %>% # do this for each variable separately
    do({
        someDat <- data.frame(
                #N draws
                #for 2 parameters with means defined here
                #and covariance matrix defined here
                      mvrnorm(n=N, 
                      mu=c(.$int, .$slope),
                      Sigma=lineSigmas))
        #give each row an ID
        someDat$id <- 1:N
        #return the data frame, 'cause that's what `do` expects
        someDat
    })

#give it some descriptive names
names(indivSlopes) <- c('bfi', 'intercept', 'slope', 'id')

head(indivSlopes)

## Source: local data frame [6 x 4]
## 
##      bfi intercept     slope    id
##   (fctr)     (dbl)     (dbl) (int)
## 1      A  64.37993 1.8269765     1
## 2      A  69.59519 2.1257374     2
## 3      A  65.94703 0.5928178     3
## 4      A  59.86194 2.1275653     4
## 5      A  67.18973 2.0810316     5
## 6      A  58.58967 3.1965359     6

So now we have a data frame that has a row for every person, for every BFI variable, and a column for that person’s intercept and slope on that variable.

Now we can generate their observed values at every time point. I’ll do 4 waves for now. We will generate them using a basic linear equation and throw in some error. Also, in this step we’ll randomly draw their starting age, and then center everything at age 35 (just because this mirrors the data I’m familiar with :]).

The equation is just \(y= \beta_0 + \beta_1 * age + \epsilon\), where B0 is the intercept, B1 is the slope, age is age, and epsilon is ther normally distributed error term.

timeSeriesData <- indivSlopes %>% #use the slope-per-person-per-var df from above
    group_by(id, bfi) %>% #do something for every person, for every bfi variable
    do({
        age <- round(runif(1, 18, 50)) #randomly draw starting age
        ages <- age + 0:3 #generate the ages for the all 4 waves
        #generate 4 scores for 4 ages with random, uncorrelated error
        score <- .$intercept + .$slope * (ages-35) + rnorm(4, 0, 4) 
        #Return a data frame, 'cause that's what `do` expects.
        #Note that it always retains the grouping variables like `id` and
        #`bfi` even if we don't put them in this data frame.
        data.frame(age=ages-35,
               agew1=age-35,
               value=score)
    })


head(timeSeriesData)

## Source: local data frame [6 x 5]
## Groups: id, bfi [2]
## 
##      id    bfi   age agew1    value
##   (int) (fctr) (dbl) (dbl)    (dbl)
## 1     1      A     4     4 74.27345
## 2     1      A     5     4 59.96832
## 3     1      A     6     4 77.33757
## 4     1      A     7     4 85.42808
## 5     1      C     1     1 49.13083
## 6     1      C     2     1 63.09312

timeSeriesData %>%
    filter(bfi=='A') %>%
    ggplot(aes(x=age, y=value)) +
        geom_line(aes(group=id)) +
        geom_smooth()

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Now we have some data to work with, and you’ve learned about mvrnorm, and have a more complex example for how do can work.

Use `do` for batching analysis

Let’s pretend we don’t know the population parameters that generated our data, and want to estimate the proportion of variance due to individuals, compared to total variance.

We’ll use the multilevel modeling package lme4 to generate these statistics.

First, let’s figure out how to get the values out of lme4 we need for an ICC – that is, the variance in intercepts due to our grouping factor and total variance.

testMod <- lmer(value ~ 1 + (1|id), data = filter(timeSeriesData, bfi=='E'))
summary(testMod)

## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + (1 | id)
##    Data: filter(timeSeriesData, bfi == "E")
## 
## REML criterion at convergence: 2644.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.69346 -0.54994 -0.00728  0.58528  2.44704 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 173.25   13.162  
##  Residual              17.38    4.169  
## Number of obs: 400, groups:  id, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   59.360      1.333   44.54

VarCorr(testMod) #we can get SD for grouping factor from this

##  Groups   Name        Std.Dev.
##  id       (Intercept) 13.1624 
##  Residual              4.1686

sigma(testMod) #and residual SD from this

## [1] 4.16858

str(VarCorr(testMod))

## List of 1
##  $ id: num [1, 1] 173
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "(Intercept)"
##   .. ..$ : chr "(Intercept)"
##   ..- attr(*, "stddev")= Named num 13.2
##   .. ..- attr(*, "names")= chr "(Intercept)"
##   ..- attr(*, "correlation")= num [1, 1] 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr "(Intercept)"
##   .. .. ..$ : chr "(Intercept)"
##  - attr(*, "sc")= num 4.17
##  - attr(*, "useSc")= logi TRUE
##  - attr(*, "class")= chr "VarCorr.merMod"

VarCorr(testMod)$id['(Intercept)', '(Intercept)']

## [1] 173.2495

VarCorr(testMod)$id['(Intercept)', '(Intercept)']^.5 # Looks like this gives us the variance

## [1] 13.16243

idVariance <- VarCorr(testMod)[['id']]['(Intercept)', '(Intercept)']
residVar <- sigma(testMod)^2 #square the SD to get variance

anICC <- idVariance/(idVariance+residVar)
anICC

## [1] 0.9088424

Now all we have to do is use group_by and do to do this to each subset of the data frame. Remember that, within do, the subset of your data is called ..

iccDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + (1|id), data=.)
        idVariance <- VarCorr(amod)[['id']]['(Intercept)', '(Intercept)']
        residVar <- sigma(amod)^2 #square the SD to get variance
        anICC <- idVariance/(idVariance+residVar)
        data.frame(icc=anICC)
    })

iccDF

## Source: local data frame [5 x 2]
## Groups: bfi [5]
## 
##      bfi       icc
##   (fctr)     (dbl)
## 1      A 0.9411433
## 2      C 0.9694278
## 3      E 0.9088424
## 4      N 0.9010527
## 5      O 0.8576107

Write an ICC function

We could also write a function that takes the names of the outcome variable column, grouping variable column, and the data, and returns an ICC.

There’s a little magic in here where we construct the formula for lmer using the text. Quick explanation: paste0 just sticks together strings, so paste0('This', 'Works') gives us ThisWorks. We can use variables in it too, so if we set thing <- 'nonesense' and do paste0(thing, 'Works'), we get nonesenseWorks.

theICC <- function(yName, groupName, theData){
    if(!require(lme4)){
            stop('missing lme4 package')
    }

    theFormula <- formula(paste0(yName, '~1+(1|', groupName, ')'))
    amod <- lmer(theFormula, theData)
    idVariance <- VarCorr(amod)[[groupName]]['(Intercept)', '(Intercept)']
    residVar <- sigma(amod)^2 #square the SD to get variance

    anICC <- idVariance/(idVariance+residVar)
    anICC
}

#Test the function!
theICC('value', 'id', filter(timeSeriesData, bfi=='E'))

## [1] 0.9088424

iccDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        anICC <- theICC(yName='value',
                groupName='id',
                theData=.)
        data.frame(icc=anICC)
    })

iccDF

## Source: local data frame [5 x 2]
## Groups: bfi [5]
## 
##      bfi       icc
##   (fctr)     (dbl)
## 1      A 0.9411433
## 2      C 0.9694278
## 3      E 0.9088424
## 4      N 0.9010527
## 5      O 0.8576107

Other batched analyses

You can use do in this way for anything, so long as you return a data frame. Let’s go ahead and estimate the growth parameters for each BFI variable.

Again, start by getting a handle on the object you get back from your statistic function:

testMod <- lmer(value ~ 1 + age + (1|age), data=filter(timeSeriesData, bfi=='E'))
summary(testMod)

## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + age + (1 | age)
##    Data: filter(timeSeriesData, bfi == "E")
## 
## REML criterion at convergence: 3051.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2027 -0.5829  0.0264  0.5890  4.1250 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  age      (Intercept)   0       0.00   
##  Residual             120      10.96   
## Number of obs: 400, groups:  age, 36
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 58.88174    0.54870  107.31
## age          0.85460    0.05619   15.21
## 
## Correlation of Fixed Effects:
##     (Intr)
## age -0.057

fixef(testMod) #looks like we might be able to just coerce this into a DF

## (Intercept)         age 
##  58.8817372   0.8546026

as.data.frame(fixef(testMod))

##             fixef(testMod)
## (Intercept)     58.8817372
## age              0.8546026

Looks like, so long as we keep track of which parameter is on which row, we’ll be able to just return the output of fixef coerced into a data.frame.

growthParamsDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + age + (1|age), data=.)
        fx <- as.data.frame(fixef(amod))
    })
growthParamsDF

## Source: local data frame [10 x 2]
## Groups: bfi [5]
## 
##       bfi fixef(amod)
##    (fctr)       (dbl)
## 1       A 65.48942693
## 2       A  1.93237366
## 3       C 56.15908452
## 4       C  3.02224221
## 5       E 58.88173717
## 6       E  0.85460255
## 7       N 44.86838252
## 8       N -0.81417670
## 9       O 49.45933488
## 10      O -0.08210474

Oops, gotta add the rownames.

growthParamsDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + age + (1|age), data=.)
        fx <- fixef(amod)
        fxDF <- as.data.frame(fx)
        fxDF$param <- gsub('\\(|\\)', '', rownames(fxDF)) # and remove '(' or ')'
        fxDF
    })
growthParamsDF

## Source: local data frame [10 x 3]
## Groups: bfi [5]
## 
##       bfi          fx     param
##    (fctr)       (dbl)     (chr)
## 1       A 65.48942693 Intercept
## 2       A  1.93237366       age
## 3       C 56.15908452 Intercept
## 4       C  3.02224221       age
## 5       E 58.88173717 Intercept
## 6       E  0.85460255       age
## 7       N 44.86838252 Intercept
## 8       N -0.81417670       age
## 9       O 49.45933488 Intercept
## 10      O -0.08210474       age

Presentation of your results

If we want, we can spread out the values and output a table.

growthParamsDF_wide <- growthParamsDF %>%
    spread(param, fx)
growthParamsDF_wide

## Source: local data frame [5 x 3]
## 
##      bfi         age Intercept
##   (fctr)       (dbl)     (dbl)
## 1      A  1.93237366  65.48943
## 2      C  3.02224221  56.15908
## 3      E  0.85460255  58.88174
## 4      N -0.81417670  44.86838
## 5      O -0.08210474  49.45933

library(knitr)
#Just a preview of how to make nice tables:

kable(growthParamsDF_wide)

bfi	age	Intercept
A	1.9323737	65.48943
C	3.0222422	56.15908
E	0.8546026	58.88174
N	-0.8141767	44.86838
O	-0.0821047	49.45933