Tagged: examples

May 19, 2015

Time series modeling with MLM

library(dplyr)
library(ggplot2)
library(tidyr)
library(nlme)
library(knitr)

# Make sure you set your working dir to the location of the csv
# setwd('C:\\Users\\flournoy\\Downloads') 
aDF<-read.csv('timeseries_DanceRandom_RAW_forRClub_28Apr2015.csv')

First, we’ll lengthen the data set so we get a row for every dwell time observations using tidyr::gather and some dplyr functions

We also want to know which slide number each dwell time observation is from as well as which within-element position it represents.

aDF_l<-aDF %>%
  dplyr::select(-TDT,-Order) %>%
  gather(slide,dt,-SubjID) %>%
  mutate(slide_num=as.numeric(sub('X([0-9]+)','\\1',slide))) %>%
  arrange(SubjID,slide_num) %>%
  mutate(
        element_position=rep(1:4,n()/4),
        slide_num=slide_num-min(slide_num))

kable(head(aDF_l))

SubjID	slide	dt	slide_num	element_position
10	X18	0.186	0	1
10	X19	0.796	1	2
10	X20	0.216	2	3
10	X21	0.211	3	4
10	X22	0.453	4	1
10	X23	0.352	5	2

We can plot each participant’s curve dwell time curve:

ggplot(aDF_l,aes(x=slide_num,y=dt))+
  geom_line(aes(group=SubjID),alpha=.05)+
  coord_trans(ytrans='log10')+
  geom_smooth(aes(group=NULL),method=loess,se=F)+
  theme(panel.background=element_rect(fill='white'))

To test autocorrelation at different lags we’re going to make an expected correlation matrix structure for the residuals. For example, for a lag-1 autocorrelation structure we’re imposing a structure such that there will be a correlation between the size of the residual for slide 0 and slide 1 that is the same as the correlation between the residual for slide 1 and slide 2.

We’re really interested in the lag 4 correlation – that is, the correlation between the residual for slide i and slide i+4. In our case, the residual will be whatever is left over after we account for the mean (i.e., intercept) and the linear trend across the entire experiment of 720 slides.

First, set our autoregressive moving average options:

lag_4_pos_by_subj<-corARMA(
    value=c(0,0,0,.2), # Initial values for lag 1-4
    p=4,q=0, # We want 4 lags, and 0 moving averages estimated
    form=~slide_num|SubjID) # slide_num is our time variable, grouped by SubjID

This uses our data to initialize a correlation matrix, which we only do for illustration purposes. nlme will do this automatically later on. The figure is what our expected residual correlation matrix looks like.

lag_4_pos_by_subj_initd<-Initialize(lag_4_pos_by_subj,data=aDF_l)
aMat<-corMatrix(lag_4_pos_by_subj_initd)
single_mat<-aMat$`10`
single_mat_l<-single_mat %>% as.data.frame %>%
    mutate(y=1:n()) %>%
    gather(x_col,value,-y,convert=T) %>% 
    mutate(
        x=as.numeric(sub('V([0-9]+)','\\1',x_col)),
        group=rep(1:10,each=51840))

single_mat_l %>% 
    filter(value > .01) %>%
    ggplot(aes(x=x,y=y))+
        geom_tile(aes(fill=(value)))+
        # facet_wrap(~group,scales='free',shrink=T) +
        theme(
            line=element_line(color='white'),
            panel.background=element_rect(fill='white'))

Now we can build and compare our models. We’ll test models with lag-1, -2, -3, and -4 AR structures.

Create our different residual correlation matrices:

lag_1_pos_by_subj<-corARMA(
    value=c(0), 
    p=1,q=0, 
    form=~slide_num|SubjID)
lag_2_pos_by_subj<-corARMA(
    value=c(0,0), 
    p=2,q=0, 
    form=~slide_num|SubjID)
lag_3_pos_by_subj<-corARMA(
    value=c(0,0,0), 
    p=3,q=0, 
    form=~slide_num|SubjID)
lag_4_pos_by_subj<-corARMA(
    value=c(0,0,0,0), 
    p=4,q=0, 
    form=~slide_num|SubjID)

Build our null model (which includes a linear effect of time to account for that long linear trend we saw in the above plot).

nullModel<-lme(
    dt~1+slide_num,
    aDF_l,
    random=~1|SubjID)
summary(nullModel)

## Linear mixed-effects model fit by REML
##  Data: aDF_l 
##        AIC      BIC    logLik
##   31533.89 31566.97 -15762.95
## 
## Random effects:
##  Formula: ~1 | SubjID
##         (Intercept)  Residual
## StdDev:   0.4327605 0.4161845
## 
## Fixed effects: dt ~ 1 + slide_num 
##                  Value  Std.Error    DF   t-value p-value
## (Intercept)  0.7002197 0.06860064 28759  10.20719       0
## slide_num   -0.0005525 0.00001180 28759 -46.82514       0
##  Correlation: 
##           (Intr)
## slide_num -0.062
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.83833764 -0.30453025 -0.06975014  0.16771644 88.41714587 
## 
## Number of Observations: 28800
## Number of Groups: 40

lag1Model<-lme(
    dt~1+slide_num,
    aDF_l,
    random=~1|SubjID,
    correlation=lag_1_pos_by_subj)
summary(lag1Model)

## Linear mixed-effects model fit by REML
##  Data: aDF_l 
##        AIC      BIC    logLik
##   29359.82 29401.16 -14674.91
## 
## Random effects:
##  Formula: ~1 | SubjID
##         (Intercept) Residual
## StdDev:   0.4325128 0.416731
## 
## Correlation Structure: AR(1)
##  Formula: ~slide_num | SubjID 
##  Parameter estimate(s):
##       Phi 
## 0.2730642 
## Fixed effects: dt ~ 1 + slide_num 
##                  Value  Std.Error    DF   t-value p-value
## (Intercept)  0.7026000 0.06869298 28759  10.22812       0
## slide_num   -0.0005573 0.00001561 28759 -35.69883       0
##  Correlation: 
##           (Intr)
## slide_num -0.082
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.83058935 -0.30722993 -0.07147524  0.16737678 88.29390180 
## 
## Number of Observations: 28800
## Number of Groups: 40

anova(nullModel,lag1Model)

##           Model df      AIC      BIC    logLik   Test  L.Ratio p-value
## nullModel     1  4 31533.90 31566.97 -15762.95                        
## lag1Model     2  5 29359.82 29401.16 -14674.91 1 vs 2 2176.075  <.0001

Computing the next three models takes a really long time for each (90 minutes for lag-4), so I’ll parallelize it.

ar_lags<-list(
    lag1=lag_1_pos_by_subj,
    lag2=lag_2_pos_by_subj,
    lag3=lag_3_pos_by_subj,
    lag4=lag_4_pos_by_subj)

library(doParallel)

registerDoParallel(cores=4)
lag_models<-foreach(
    corstruct=ar_lags,
    .packages=c('nlme','dplyr'),
    .export=c('nullModel'),
    .combine=bind_rows,
    .multicombine=T
    ) %dopar% {
        aModel<-update(
            nullModel,
            correlation=corstruct)
        aSummary<-summary(aModel)
        as_data_frame(list(
            model=list(aModel),
            summary=list(aSummary)))
}
save(lag_models,file='lag_models.RData',compress=T)
print(lag_models)

## Source: local data frame [4 x 2]
## 
##      model               summary
## 1 <S3:lme> <S3:summary.lme, lme>
## 2 <S3:lme> <S3:summary.lme, lme>
## 3 <S3:lme> <S3:summary.lme, lme>
## 4 <S3:lme> <S3:summary.lme, lme>

Now we can compare them…

lag_models %>%
    mutate(Total_Lags=1:4) %>%
    group_by(Total_Lags) %>%
    do({
        phis<-coef(.$model[[1]]$modelStruct$corStruct,unconstrained=F)
        nphis<-paste('lag',1:length(phis))
        data.frame(
            lag=nphis,
            phi=phis
            )
    }) %>%
    spread(lag,phi) %>% 
    kable(digits=2)

## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character

Total_Lags	lag 1	lag 2	lag 3	lag 4
1	0.27	NA	NA	NA
2	0.23	0.18	NA	NA
3	0.20	0.14	0.15	NA
4	0.18	0.12	0.11	0.19

kable(
    cbind(
        list(Lags=c(1,2)),
        anova(lag_models$model[[1]],lag_models$model[[2]])[,c(-1,-2)]),
    row.names=F)

Lags	df	AIC	BIC	logLik	Test	L.Ratio	p-value
1	5	29359.82	29401.16	-14674.91		NA	NA
2	6	28486.57	28536.17	-14237.28	1 vs 2	875.2539	0

kable(
    cbind(
        list(Lags=c(2,3)),
        anova(lag_models$model[[2]],lag_models$model[[3]])[,c(-1,-2)]),
    row.names=F)

Lags	df	AIC	BIC	logLik	Test	L.Ratio	p-value
2	6	28486.57	28536.17	-14237.28		NA	NA
3	7	27892.35	27950.22	-13939.17	1 vs 2	596.2176	0

kable(
    cbind(
        list(Lags=c(3,4)),
        anova(lag_models$model[[3]],lag_models$model[[4]])[,c(-1,-2)]),
    row.names=F)

Lags	df	AIC	BIC	logLik	Test	L.Ratio	p-value
3	7	27892.35	27950.22	-13939.17		NA	NA
4	8	26898.26	26964.40	-13441.13	1 vs 2	996.0902	0

We can look at the predicted correlation matrix:

predicted_ARMA<-corARMA(value=c(.18,.12,.11,.19),p=4,form=~slide_num)

predicted_ARMA_init<-Initialize(predicted_ARMA,data=predict_df)
corMatARMA<-corMatrix(predicted_ARMA_init)

corMatARMA %>%
as.data.frame %>%
mutate(x=1:n()) %>%
gather(y,value,-x,convert=T) %>%
mutate(y=as.numeric(sub('V(.*)','\\1',y))) %>%
filter(value > .01) %>%
ggplot(aes(x=x,y=y))+
        geom_tile(aes(fill=(value)))+
        theme(
            line=element_line(color='white'),
            panel.background=element_rect(fill='white'))

And we can produce simulated data with our fixed effects estimates for the linear trend as our mu and our correlation matrix as our Sigma for mvrnorm.

predict_df<-data.frame(slide_num=1:720)
predict_df$dt<-predict(
    lag_models$model[[4]],
    predict_df,
    level=0)

library(MASS)

predict_df$dt_err<-mvrnorm(1,mu=scale(predict_df$dt),Sigma=corMatARMA)

Our predictions look like this:

predict_df %>% 
    ggplot(aes(x=slide_num))+
    geom_line(aes(y=dt_err))+
    theme(panel.background=element_rect(fill='white'))

Our data look like this:

aDF_l %>%
    ggplot(aes(x=slide_num,y=dt))+
    geom_line()+
    facet_wrap(~SubjID,scale='free_y')+
    theme(panel.background=element_rect(fill='white'))

Not bad, visually speaking. I suspect we could do better if we modeled the the underlying rate parameter for a Poisson process, but that’s a whole other can of worms.

Here’s what happens if we just dummy-variable each element position:


summary(update(nullModel,.~.+as.factor(element_position)))


## Linear mixed-effects model fit by REML
##  Data: aDF_l 
##        AIC      BIC    logLik
##   31554.64 31612.51 -15770.32
## 
## Random effects:
##  Formula: ~1 | SubjID
##         (Intercept)  Residual
## StdDev:   0.4327605 0.4161319
## 
## Fixed effects: dt ~ slide_num + as.factor(element_position) 
##                                   Value  Std.Error    DF   t-value p-value
## (Intercept)                   0.7135225 0.06873087 28756  10.38140  0.0000
## slide_num                    -0.0005524 0.00001180 28756 -46.81881  0.0000
## as.factor(element_position)2 -0.0202103 0.00693554 28756  -2.91402  0.0036
## as.factor(element_position)3 -0.0161110 0.00693557 28756  -2.32295  0.0202
## as.factor(element_position)4 -0.0170860 0.00693562 28756  -2.46351  0.0138
##  Correlation: 
##                              (Intr) sld_nm a.(_)2 a.(_)3
## slide_num                    -0.061                     
## as.factor(element_position)2 -0.050 -0.002              
## as.factor(element_position)3 -0.050 -0.003  0.500       
## as.factor(element_position)4 -0.050 -0.005  0.500  0.500
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.82219320 -0.30539798 -0.07029029  0.16628406 88.39624069 
## 
## Number of Observations: 28800
## Number of Groups: 40

February 17, 2015

Interactive Embedded Plots with Plotly and ggplot2

Largley lifted from this r-bloggers post

install.packages("devtools")  # so we can install from GitHub
devtools::install_github("ropensci/plotly")  # plotly is part of rOpenSci

library(plotly)

py <- plotly(username="jflournoy", key="mg34ox914h")  # open plotly connection
# I'll change my key after this, but you can still use: plotly(username="r_user_guide", key="mw5isa4yqp")
# Or just sign up for your own account!

 
gg <- ggplot(iris) +
    geom_point(aes(Sepal.Length, Sepal.Width,color=Species,size=Petal.Length))
gg

#This looks a little object-oriented like python  
py$ggplotly(gg)

You can embed code like this (which you get from the plotly ‘share’ dialogue):

<div>
<a href="https://plot.ly/~jflournoy/16/" target="_blank" title="Sepal.Width vs Sepal.Length" style="display: block; text-align: center;"><img src="https://plot.ly/~jflournoy/16.png" alt="Sepal.Width vs Sepal.Length" style="max-width: 100%;width: 797px;"  width="797" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="jflournoy:16" src="https://plot.ly/embed.js" async></script>
</div>

You can also directly embed a plotly plot using a code chunk if you set plotly=TRUE for the chunk, and include session="knitr" in the call.

#Set `plotly=TRUE`
py$ggplotly(gg, session="knitr")

<br />

There’s a wide world of plotly fun just waiting out there.

January 20, 2015

611 Highlights in R (Part II)

The raw .R file is here and the Rmd file is here.

Brief ‘Hello’ to R Markdown

Let me first introduce you to R markdown and kitr. This format allows you to integrate your analyses with text that can be automatically translated to html, word, and pdf formats. I just produced a supplementary materials document using this and it probably saved me a ridiculous number of hours. It’s easy to get started:

Save your file with extension .Rmd.

Place something like the following in the very top of your file:

---
author: John Flournoy
title: 611 Highlights in R (Part II)
output: html_document
---

Write as you normally would, using Markdown syntax for formatting. For example:

# This is a big heading

1. this is
2. a numbered list

- this is a bulleted list
- With **bold** and *italic*  text

When you want to throw in some R code, place it inside an R chunk by wrapping it in ```{r} at the beginning and ``` at the end. When you do, it comes out something like this:

# From Rose's presentation:
# a toy data set for us to work with
n <- 100
gender <- as.factor(c(rep("M", n/2), rep("F", n/2)))
IQ <- rnorm(n, mean=100, sd=15)
degree <- rep(c("HS", "BA", "MS", "PhD"), n/4)
height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1)
RT1 <- rchisq(n, 4)
RT2 <- rchisq(n, 4)
DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n)

df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2)
head(df)

##    awesome gender        IQ degree   height      RT1      RT2
## 1 52.89486      M  79.40263     HS 6.292706 1.761990 2.228117
## 2 57.04782      M  96.31289     BA 4.759042 2.236027 7.363253
## 3 55.59594      M 104.95331     MS 5.560292 5.159114 7.126281
## 4 55.44397      M  84.23347    PhD 5.354595 7.652900 2.836956
## 5 55.45191      M 107.38754     HS 4.523997 7.768904 4.504664
## 6 54.70468      M 110.07096     BA 6.139368 7.786462 7.508488

Notice that in 4, the we see the output of the command head(df). This is useful for outputting tables and plots, as we’ll see.

Here are more resources on

Streamlined Data Workflow with dplyr

Download the data here:

Make sure you have the package (library(dplyr)), and if you don’t, install it (install.packages('dplyr')). Here’s an intro to dplyr straight from the horse’s (Hadley’s) mouth.

By way of data manipulation that I’m actually using for a real project, this tutorial will introduce you to

%>%
- a function chaining operator
mutate and transmute
- adds new variables and preserves existing (transmute drops existing variables)
mutate_each
- adds new variables, applying functions in funs(...) to each column
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
left_join
- joins two data frames, returning all rows from x, and all columns from x and y
select
- keeps the variables/columns that you mention (see ?select for special functions that help you select variables)
filter
- returns rows with matching conditions (similar to selecting columns like: aDataFrame[variable == value,])
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
do
- use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same lm or t.test separately for each gender, country)

Continue reading →

January 20, 2015

Biased and unbiased estimates

(This was really little more than an excuse to get me to learn a little about Markdown in R. I have been toying with the idea of trying to move the 302 labs away from SPSS and into R. One such advantage could be the relative ease of employing simulations in R to to illustrate some of the concepts that students often report as challenging. The distinction between biased and unbiased estimates was something that students questioned me on last week, so it’s what I’ve tried to walk through here.)

In 302, we teach students that sample means provide an unbiased estimate of population means. Of course, this doesn’t mean that sample means are PERFECT estimates of population means. We should expect that the mean of any sample, no matter how representative, will differ a bit from the mean of the population from which it was drawn. This is sampling error. We say the sample mean is an unbiased estimate because it doesn’t differ systemmatically from the population mean–samples with means greater than the population mean are as likely as samples with means smaller than the population mean.

Let’s simulate this.

Creating a population

First, we need to create a population of scores. Let’s use IQ scores as an example. Here we’ll draw a million scores from a normal distribution with a mean of 100 and a standard deviation of 15. We’ll plot it as a histogram to show that everything is nice and normal.

# Define a few settings for our work today
popMean = 100; # The mean of our population
popSD = 15; # The standard deviation of our population
sampSize = 30; # The size of the samples we'll draw from the population

population <- rnorm(1000000, mean = popMean, sd = popSD)
hist(population)

Just to double check and make sure that R is doing its thing like it should, we can check some descriptive statistics for this population. If things have worked, these values should be pretty darn close to μ = 100 and σ = 15.

mean(population)

## [1] 100.0175

sd(population)

## [1] 14.99739

Yep.

Sample means are unbiased estimates of population means

Now, we need to create a sampling distribution. We will draw a sample from this population and find its mean. Then, we do that same thing over and over again a whole mess ’a times. This distribution of sample means is a sampling distribution.

Let’s give it a whirl. We’ll start with a little snippet that just gives us a single sample of 30 participants, using random sampling with replacement. When we take a peek at its mean, we should find that it is pretty darn close to 100, but with a little sampling error.

sampChamp <- sample(population, sampSize, replace=TRUE)
mean(sampChamp)

## [1] 96.36024

A quick aside: Just how close should it be to the population mean? Well, the expected deviation between any sample mean and the population mean is estimated by the standard error: σ²_M = σ / √(n). So, in this case, we’d have a σ²_M = 15 / √30 = 2.7386128

Let’s return to our simulation. We’ll now draw a whole bunch of samples and enter their means into a sampling distribution.

sampDist <- replicate(10000,mean(sample(population, sampSize, replace=TRUE)))
hist(sampDist)

Let’s get a few descriptive statistics here, too.

mean(sampDist)

## [1] 99.96665

sd(sampDist)

## [1] 2.743161

Note that the mean of the sampling distribution is a pretty darn good match for the original population mean. So, looky there, the sample mean is an unbaised estimator! Replications of the sampling procedure yield means that are just as likely to be above the population mean as below (in a symmetrical distribution like this, the mean and median are pretty much the same. If you’re interested, the actual value for the median is 99.950235. Also: The standard deviation of the sampling distribution is the standard error, so we see a pretty good match between the SD here (2.7431614) and the standard error calculated earlier (2.7386128).

Uncorrected sample standard deviations are biased estimates of population standard deviations

Okay, let’s put together a different sampling distribution. Rather than collecting means from each sample we’ll collect uncorrected sample standard deviations. This is the square root of the sample variance, where the sample variance is the sum of the squared deviations from the mean divided by the sample size (SS/n).

R uses the corrected sample standard devaition (and variance), by default. So I use some cheap tricks to get it to give me the uncorrected standard deviation.

sdDist <- replicate(10000,(((sqrt(var(sample(population, sampSize, replace=TRUE)))*(sampSize-1))/(sampSize))))
hist(sdDist)

Let’s grab a mean of this sampling distribution of standard deviations.

mean(sdDist)

## [1] 14.35129

Notice that it is an underestimate of the population variance. The deviation between this estimate (14.3512925) and the true population standard deviation (15) is 0.6487075. Uncorrected sample standard deviations are systemmatically smaller than the population standard deviations that we intend them to estimate.

Now, let’s try it again with the corrected sample standard deviation.

sdDist_2 <- replicate(10000,sd(sample(population, sampSize, replace=TRUE)))
hist(sdDist_2)

And its mean.

mean(sdDist_2)

## [1] 14.88884

Here, the deviation between the estimate from the sample(14.8888407) and the true population standard deviation (15) is reduced to 0.1111593. Sample standard deviations are less systemmatically underrepresenting the population standard deviation.

December 4, 2014

More R Graph Examples

Yet another compendium of clean, useful graphs in R (these by noted statistician E. J. Wagenmaker).

If you want more, still, check out this slightly older post too!

October 28, 2014

R Graph Catalog

This is Shahar’s distance contribution to R Club. (Thanks, Shahar!)

This catalog allows you to choose which graph you like best and gives you the code for that graph. You can also filter by things like Good vs. Bad graphs, type of graph, and different features that you might like to include (like subscripts and multiple plots).

http://shinyapps.stat.ubc.ca/r-graph-catalog/