Category: Consultation

Smoothing the map






Other ways to plot spatial data…

Here, I walk through some ggplot methods, and finally introduce the spatstat package, which allows one to plot smoothed heatmaps based on your variable of interest.

Basica Data Wrangling

library(zipcode)
library(dplyr)
library(ggplot2)
data(zipcode)
pzip<-read.csv('StudyFiveSampleOneWaveOne_forRclub.csv')
str(pzip)
## 'data.frame':    1192 obs. of  2 variables:
##  $ ZipCode: int  23231 85712 85281 23917 NA 84337 89104 60629 80503 43449 ...
##  $ PScore : num  3.37 4.3 2.97 3.2 2.9 ...
pzip_clean<-pzip %>%
    mutate(zip=clean.zipcodes(ZipCode))
str(pzip_clean)
## 'data.frame':    1192 obs. of  3 variables:
##  $ ZipCode: int  23231 85712 85281 23917 NA 84337 89104 60629 80503 43449 ...
##  $ PScore : num  3.37 4.3 2.97 3.2 2.9 ...
##  $ zip    : chr  "23231" "85712" "85281" "23917" ...
pzip_latlong <- left_join(pzip_clean,zipcode) %>%
    filter(longitude > -140) %>%
    na.omit
## Joining by: "zip"
str(pzip_latlong)
## 'data.frame':    1080 obs. of  7 variables:
##  $ ZipCode  : int  23231 85712 85281 23917 84337 89104 60629 80503 43449 93030 ...
##  $ PScore   : num  3.37 4.3 2.97 3.2 3.17 ...
##  $ zip      : chr  "23231" "85712" "85281" "23917" ...
##  $ city     : chr  "Richmond" "Tucson" "Tempe" "Boydton" ...
##  $ state    : chr  "VA" "AZ" "AZ" "VA" ...
##  $ latitude : num  37.5 32.2 33.4 36.6 41.7 ...
##  $ longitude: num  -77.4 -110.9 -111.9 -78.3 -112.2 ...

Points and Density

Close your eyes and infer the shape of the nation….

pzip_latlong %>% filter(longitude > -140) %>%
    ggplot(aes(y=latitude,x=longitude)) + 
    geom_point(aes(color=PScore))

pzip_latlong %>% filter(longitude > -140) %>%
    ggplot(aes(y=latitude,x=longitude)) + 
    stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE)

Spatially smoothed data on an actual map

Thanks, https://stat.ethz.ch/pipermail/r-sig-geo/2011-March/011284.html

library(spatstat)
library(maps)
library(maptools)

#map('usa')
#points(pzip_latlong$longitude, pzip_latlong$latitude)
usmap <- map('usa', fill=TRUE, col="transparent", plot=FALSE)
uspoly <- map2SpatialPolygons(usmap, IDs=usmap$names, 
    proj4string=CRS("+proj=longlat +datum=wgs84"))
spatstat.options(checkpolygons=FALSE)
## Warning: spatstat.options('checkpolygons') will be ignored in future
## versions of spatstat
usowin <- as.owin.SpatialPolygons(uspoly)
spatstat.options(checkpolygons=TRUE)
## Warning: spatstat.options('checkpolygons') will be ignored in future
## versions of spatstat
pzip_noDupes <- pzip_latlong %>% 
    group_by(latitude, longitude) %>%
    summarise(pscore=mean(PScore))

pzipPoints <- ppp(x=pzip_noDupes$longitude,
          y=pzip_noDupes$latitude,
          marks=pzip_noDupes$pscore, 
          window=usowin)
## Warning: point-in-polygon test had difficulty with 1 point (total score not
## 0 or 1)
## Warning in ppp(x = pzip_noDupes$longitude, y = pzip_noDupes$latitude,
## marks = pzip_noDupes$pscore, : 18 points were rejected as lying outside the
## specified window
plot(density(pzipPoints))

plot(Smooth(pzipPoints))



Map maker, map maker, make me a map






[Editor’s note: The following code takes a dataframe with zipcodes and a mystery variable of interest and makes it into a map. Thanks for the code, Rose, and thanks for the data, Rita!]

if(!require(zipcode)) install.packages("zipcode"); library(zipcode)
df.raw <- read.csv("StudyFiveSampleOneWaveOne_forRclub.csv")
str(df.raw)
## 'data.frame':    1192 obs. of  2 variables:
##  $ ZipCode: int  23231 85712 85281 23917 NA 84337 89104 60629 80503 43449 ...
##  $ PScore : num  3.37 4.3 2.97 3.2 2.9 ...
if(!identical(x=clean.zipcodes(df.raw$ZipCode), y=as.character(df.raw$ZipCode)) ) message("Uh-oh") #check that zip codes in df are clean
## Uh-oh
df <- df.raw
df$zip <- clean.zipcodes(df$ZipCode)
df <- na.omit(df)

data(zipcode) # reads in a dataframe with the state for each zipcode
df <- merge(df, zipcode[,c(1,3)], by="zip", all.x=T) # add a column of state abbreviations to the datafile

library(dplyr)
df.state <- df %>%
  group_by(state) %>%
  summarize(PScore=mean(PScore))

if(!require(choroplethr)) install.packages("choroplethr"); library(choroplethr)
if(!require(choroplethrMaps)) install.packages("choroplethrMaps"); library(choroplethrMaps)

data(state.regions) # reads in a dataframe with the state name and state abbreviations

df.state <- merge(df.state, state.regions[,c(1,2)], by.x="state", by.y="abb", all.x=T)

map.data = data.frame(region=df.state$region, value=df.state$PScore)
state_choropleth(map.data)
## Warning in super$initialize(map.df, user.df): Your data.frame contains the
## following regions which are not mappable: NA
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector



Time series modeling with MLM

library(dplyr)
library(ggplot2)
library(tidyr)
library(nlme)
library(knitr)

# Make sure you set your working dir to the location of the csv
# setwd('C:\\Users\\flournoy\\Downloads') 
aDF<-read.csv('timeseries_DanceRandom_RAW_forRClub_28Apr2015.csv')

First, we’ll lengthen the data set so we get a row for every dwell time observations using tidyr::gather and some dplyr functions

We also want to know which slide number each dwell time observation is from as well as which within-element position it represents.

aDF_l<-aDF %>%
  dplyr::select(-TDT,-Order) %>%
  gather(slide,dt,-SubjID) %>%
  mutate(slide_num=as.numeric(sub('X([0-9]+)','\\1',slide))) %>%
  arrange(SubjID,slide_num) %>%
  mutate(
        element_position=rep(1:4,n()/4),
        slide_num=slide_num-min(slide_num))

kable(head(aDF_l))
SubjID slide dt slide_num element_position
10 X18 0.186 0 1
10 X19 0.796 1 2
10 X20 0.216 2 3
10 X21 0.211 3 4
10 X22 0.453 4 1
10 X23 0.352 5 2

We can plot each participant’s curve dwell time curve:

ggplot(aDF_l,aes(x=slide_num,y=dt))+
  geom_line(aes(group=SubjID),alpha=.05)+
  coord_trans(ytrans='log10')+
  geom_smooth(aes(group=NULL),method=loess,se=F)+
  theme(panel.background=element_rect(fill='white'))

spg_long_plot-1

To test autocorrelation at different lags we’re going to make an expected correlation matrix structure for the residuals. For example, for a lag-1 autocorrelation structure we’re imposing a structure such that there will be a correlation between the size of the residual for slide 0 and slide 1 that is the same as the correlation between the residual for slide 1 and slide 2.

We’re really interested in the lag 4 correlation – that is, the correlation between the residual for slide i and slide i+4. In our case, the residual will be whatever is left over after we account for the mean (i.e., intercept) and the linear trend across the entire experiment of 720 slides.

First, set our autoregressive moving average options:

lag_4_pos_by_subj<-corARMA(
    value=c(0,0,0,.2), # Initial values for lag 1-4
    p=4,q=0, # We want 4 lags, and 0 moving averages estimated
    form=~slide_num|SubjID) # slide_num is our time variable, grouped by SubjID

This uses our data to initialize a correlation matrix, which we only do for illustration purposes. nlme will do this automatically later on. The figure is what our expected residual correlation matrix looks like.

lag_4_pos_by_subj_initd<-Initialize(lag_4_pos_by_subj,data=aDF_l)
aMat<-corMatrix(lag_4_pos_by_subj_initd)
single_mat<-aMat$`10`
single_mat_l<-single_mat %>% as.data.frame %>%
    mutate(y=1:n()) %>%
    gather(x_col,value,-y,convert=T) %>% 
    mutate(
        x=as.numeric(sub('V([0-9]+)','\\1',x_col)),
        group=rep(1:10,each=51840))
single_mat_l %>% 
    filter(value > .01) %>%
    ggplot(aes(x=x,y=y))+
        geom_tile(aes(fill=(value)))+
        # facet_wrap(~group,scales='free',shrink=T) +
        theme(
            line=element_line(color='white'),
            panel.background=element_rect(fill='white'))

cor_plot-1

Now we can build and compare our models. We’ll test models with lag-1, -2, -3, and -4 AR structures.

Create our different residual correlation matrices:

lag_1_pos_by_subj<-corARMA(
    value=c(0), 
    p=1,q=0, 
    form=~slide_num|SubjID)
lag_2_pos_by_subj<-corARMA(
    value=c(0,0), 
    p=2,q=0, 
    form=~slide_num|SubjID)
lag_3_pos_by_subj<-corARMA(
    value=c(0,0,0), 
    p=3,q=0, 
    form=~slide_num|SubjID)
lag_4_pos_by_subj<-corARMA(
    value=c(0,0,0,0), 
    p=4,q=0, 
    form=~slide_num|SubjID)

Build our null model (which includes a linear effect of time to account for that long linear trend we saw in the above plot).

nullModel<-lme(
    dt~1+slide_num,
    aDF_l,
    random=~1|SubjID)
summary(nullModel)
## Linear mixed-effects model fit by REML
##  Data: aDF_l 
##        AIC      BIC    logLik
##   31533.89 31566.97 -15762.95
## 
## Random effects:
##  Formula: ~1 | SubjID
##         (Intercept)  Residual
## StdDev:   0.4327605 0.4161845
## 
## Fixed effects: dt ~ 1 + slide_num 
##                  Value  Std.Error    DF   t-value p-value
## (Intercept)  0.7002197 0.06860064 28759  10.20719       0
## slide_num   -0.0005525 0.00001180 28759 -46.82514       0
##  Correlation: 
##           (Intr)
## slide_num -0.062
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.83833764 -0.30453025 -0.06975014  0.16771644 88.41714587 
## 
## Number of Observations: 28800
## Number of Groups: 40
lag1Model<-lme(
    dt~1+slide_num,
    aDF_l,
    random=~1|SubjID,
    correlation=lag_1_pos_by_subj)
summary(lag1Model)
## Linear mixed-effects model fit by REML
##  Data: aDF_l 
##        AIC      BIC    logLik
##   29359.82 29401.16 -14674.91
## 
## Random effects:
##  Formula: ~1 | SubjID
##         (Intercept) Residual
## StdDev:   0.4325128 0.416731
## 
## Correlation Structure: AR(1)
##  Formula: ~slide_num | SubjID 
##  Parameter estimate(s):
##       Phi 
## 0.2730642 
## Fixed effects: dt ~ 1 + slide_num 
##                  Value  Std.Error    DF   t-value p-value
## (Intercept)  0.7026000 0.06869298 28759  10.22812       0
## slide_num   -0.0005573 0.00001561 28759 -35.69883       0
##  Correlation: 
##           (Intr)
## slide_num -0.082
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.83058935 -0.30722993 -0.07147524  0.16737678 88.29390180 
## 
## Number of Observations: 28800
## Number of Groups: 40
anova(nullModel,lag1Model)
##           Model df      AIC      BIC    logLik   Test  L.Ratio p-value
## nullModel     1  4 31533.90 31566.97 -15762.95                        
## lag1Model     2  5 29359.82 29401.16 -14674.91 1 vs 2 2176.075  <.0001

Computing the next three models takes a really long time for each (90 minutes for lag-4), so I’ll parallelize it.

ar_lags<-list(
    lag1=lag_1_pos_by_subj,
    lag2=lag_2_pos_by_subj,
    lag3=lag_3_pos_by_subj,
    lag4=lag_4_pos_by_subj)

library(doParallel)
registerDoParallel(cores=4)
lag_models<-foreach(
    corstruct=ar_lags,
    .packages=c('nlme','dplyr'),
    .export=c('nullModel'),
    .combine=bind_rows,
    .multicombine=T
    ) %dopar% {
        aModel<-update(
            nullModel,
            correlation=corstruct)
        aSummary<-summary(aModel)
        as_data_frame(list(
            model=list(aModel),
            summary=list(aSummary)))
}
save(lag_models,file='lag_models.RData',compress=T)
print(lag_models)
## Source: local data frame [4 x 2]
## 
##      model               summary
## 1 <S3:lme> <S3:summary.lme, lme>
## 2 <S3:lme> <S3:summary.lme, lme>
## 3 <S3:lme> <S3:summary.lme, lme>
## 4 <S3:lme> <S3:summary.lme, lme>

Now we can compare them…

lag_models %>%
    mutate(Total_Lags=1:4) %>%
    group_by(Total_Lags) %>%
    do({
        phis<-coef(.$model[[1]]$modelStruct$corStruct,unconstrained=F)
        nphis<-paste('lag',1:length(phis))
        data.frame(
            lag=nphis,
            phi=phis
            )
    }) %>%
    spread(lag,phi) %>% 
    kable(digits=2)
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character
Total_Lags lag 1 lag 2 lag 3 lag 4
1 0.27 NA NA NA
2 0.23 0.18 NA NA
3 0.20 0.14 0.15 NA
4 0.18 0.12 0.11 0.19
kable(
    cbind(
        list(Lags=c(1,2)),
        anova(lag_models$model[[1]],lag_models$model[[2]])[,c(-1,-2)]),
    row.names=F)
Lags df AIC BIC logLik Test L.Ratio p-value
1 5 29359.82 29401.16 -14674.91 NA NA
2 6 28486.57 28536.17 -14237.28 1 vs 2 875.2539 0
kable(
    cbind(
        list(Lags=c(2,3)),
        anova(lag_models$model[[2]],lag_models$model[[3]])[,c(-1,-2)]),
    row.names=F)
Lags df AIC BIC logLik Test L.Ratio p-value
2 6 28486.57 28536.17 -14237.28 NA NA
3 7 27892.35 27950.22 -13939.17 1 vs 2 596.2176 0
kable(
    cbind(
        list(Lags=c(3,4)),
        anova(lag_models$model[[3]],lag_models$model[[4]])[,c(-1,-2)]),
    row.names=F)
Lags df AIC BIC logLik Test L.Ratio p-value
3 7 27892.35 27950.22 -13939.17 NA NA
4 8 26898.26 26964.40 -13441.13 1 vs 2 996.0902 0

We can look at the predicted correlation matrix:

predicted_ARMA<-corARMA(value=c(.18,.12,.11,.19),p=4,form=~slide_num)

predicted_ARMA_init<-Initialize(predicted_ARMA,data=predict_df)
corMatARMA<-corMatrix(predicted_ARMA_init)

corMatARMA %>%
as.data.frame %>%
mutate(x=1:n()) %>%
gather(y,value,-x,convert=T) %>%
mutate(y=as.numeric(sub('V(.*)','\\1',y))) %>%
filter(value > .01) %>%
ggplot(aes(x=x,y=y))+
        geom_tile(aes(fill=(value)))+
        theme(
            line=element_line(color='white'),
            panel.background=element_rect(fill='white'))

unnamed-chunk-11-1

And we can produce simulated data with our fixed effects estimates for the linear trend as our mu and our correlation matrix as our Sigma for mvrnorm.

predict_df<-data.frame(slide_num=1:720)
predict_df$dt<-predict(
    lag_models$model[[4]],
    predict_df,
    level=0)

library(MASS)

predict_df$dt_err<-mvrnorm(1,mu=scale(predict_df$dt),Sigma=corMatARMA)

Our predictions look like this:

predict_df %>% 
    ggplot(aes(x=slide_num))+
    geom_line(aes(y=dt_err))+
    theme(panel.background=element_rect(fill='white'))

unnamed-chunk-13-1

Our data look like this:

aDF_l %>%
    ggplot(aes(x=slide_num,y=dt))+
    geom_line()+
    facet_wrap(~SubjID,scale='free_y')+
    theme(panel.background=element_rect(fill='white'))

unnamed-chunk-14-1

Not bad, visually speaking. I suspect we could do better if we modeled the the underlying rate parameter for a Poisson process, but that’s a whole other can of worms.

Here’s what happens if we just dummy-variable each element position:


summary(update(nullModel,.~.+as.factor(element_position)))

## Linear mixed-effects model fit by REML
##  Data: aDF_l 
##        AIC      BIC    logLik
##   31554.64 31612.51 -15770.32
## 
## Random effects:
##  Formula: ~1 | SubjID
##         (Intercept)  Residual
## StdDev:   0.4327605 0.4161319
## 
## Fixed effects: dt ~ slide_num + as.factor(element_position) 
##                                   Value  Std.Error    DF   t-value p-value
## (Intercept)                   0.7135225 0.06873087 28756  10.38140  0.0000
## slide_num                    -0.0005524 0.00001180 28756 -46.81881  0.0000
## as.factor(element_position)2 -0.0202103 0.00693554 28756  -2.91402  0.0036
## as.factor(element_position)3 -0.0161110 0.00693557 28756  -2.32295  0.0202
## as.factor(element_position)4 -0.0170860 0.00693562 28756  -2.46351  0.0138
##  Correlation: 
##                              (Intr) sld_nm a.(_)2 a.(_)3
## slide_num                    -0.061                     
## as.factor(element_position)2 -0.050 -0.002              
## as.factor(element_position)3 -0.050 -0.003  0.500       
## as.factor(element_position)4 -0.050 -0.005  0.500  0.500
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.82219320 -0.30539798 -0.07029029  0.16628406 88.39624069 
## 
## Number of Observations: 28800
## Number of Groups: 40 

Latent Growth Curves in R

This is based in part on Nicole’s code found in this post.



sem_in_r.r




First, reading in the data:

library(foreign)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lavaan)
library(semPlot)
library(knitr)

setwd('/home/jflournoy/code/sem_in_r/')

pdr2<-read.spss("PDR Wave 2.sav", to.data.frame=T)
# pdr4<-read.spss("PDR Wave 4.sav", to.data.frame=T)

Generate a time variable

This indexes each call for each family.

pdr2_time <- pdr2 %>%
group_by(FAMILY) %>% #do the count by family
  arrange(YEAR,MONTH,DAY) %>% #sort by date
  mutate(callindex=1:n()) #create call index that's 1:end for each family

head(pdr2_time[,c('FAMILY','callindex')])
## Source: local data frame [6 x 2]
## Groups: FAMILY
## 
##   FAMILY callindex
## 1  TP001         1
## 2  TP001         2
## 3  TP001         3
## 4  TP001         4
## 5  TP001         5
## 6  TP001         6

Create composite score of kid bex

pdr2_time_bxtrans<-pdr2_time %>% ungroup %>% 
  mutate_each(
    funs(as.numeric(!(.=='DID NOT OCCUR'))),
    P31201:P31240)

To break down the above statement:

pdr2_time gets sent to ungroup, which removes the grouping by FAMILY we did above, and that gets sent to mutate_each. This takes a range of columns that we define in the second argument as P31201:P31240 which reads ‘from P31201 to P31240’.

The meat of mutate_each is what goes in the first argument, within the function funs(). You can list a bunch of functions here if you wanted to mutate all the columns in a number of different ways. The period character, ., represents the column that will be passed to that function. In this case, we just check if each element of the column is ‘DID NOT OCCUR’, and if so, we negate it (giving us FALSE) and then as.numeric it giving us ‘0’. If the response is any other option, we get a ‘1’, which is what we want. Importantly, this will return NA if the data is NA.

If you want to learn more ?mutate_each. Moving on now…

head(pdr2_time_bxtrans)
## Source: local data frame [6 x 63]
## 
##   FAMILY RESP MONTH DAY YEAR INT   WEEKDAY P31201 P31202 P31203 P31204
## 1  TP001    3     7  22 2004  4H WEDNESDAY      0      1      1      0
## 2  TP001    3     7  23 2004  4H   THURSAY      0      1      1      0
## 3  TP001    3     7  27 2004  4H    MONDAY      0      1      1      0
## 4  TP001    3     7  28 2004  4H   TUESDAY      0      1      1      0
## 5  TP001    3     8   3 2004  4H    MONDAY      0      1      1      0
## 6  TP001    3     8   4 2004  4H   TUESDAY      0      1      1      0
## Variables not shown: P31205 (dbl), P31206 (dbl), P31207 (dbl), P31208
##   (dbl), P31209 (dbl), P31210 (dbl), P31211 (dbl), P31212 (dbl), P31213
##   (dbl), P31214 (dbl), P31215 (dbl), P31216 (dbl), P31217 (dbl), P31218
##   (dbl), P31219 (dbl), P31220 (dbl), P31221 (dbl), P31222 (dbl), P31223
##   (dbl), P31224 (dbl), P31225 (dbl), P31226 (dbl), P31227 (dbl), P31228
##   (dbl), P31229 (dbl), P31230 (dbl), P31231 (dbl), P31232 (dbl), P31233
##   (dbl), P31234 (dbl), P31235 (dbl), P31236 (dbl), P31237 (dbl), P31238
##   (dbl), P31239 (dbl), P31240 (dbl), P31241 (fctr), P31242 (fctr), P31242A
##   (fctr), P31242B (fctr), P31242C (fctr), P31242D (fctr), P31243 (fctr),
##   P31243A (fctr), P31243B (fctr), P31243C (fctr), P31243D (fctr), P31244
##   (dbl), P31245 (dbl), WAVE (dbl), PILOT1 (fctr), callindex (int)

Now we can create the composite variable using a sum. There is missing data, so that should be delt with, but we’ll ignore that for now.

pdr2_time_bxtrans$bextot<-pdr2_time_bxtrans %>%
  select(P31201:P31240) %>% rowSums(na.rm=T)

summary(pdr2_time_bxtrans$bextot)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   4.000   5.307   8.000  26.000
hist(pdr2_time_bxtrans$bextot)

We have a call-index varibale, but if you want a time index that is really the number of days since they were first contacted, here’s how to do that.

pdr2_time_bxtrans_t2<-
  pdr2_time_bxtrans %>% 
  mutate(formd_date=as.Date(paste(MONTH,DAY,YEAR,sep='/'),"%m/%d/%Y")) %>% 
  group_by(FAMILY) %>%
  arrange(YEAR,MONTH,DAY) %>%
  mutate(days_since_1st=formd_date-min(formd_date))

Some descriptive plots

Let’s see what we’re working with.

First, the raw data – a line for every family:

pdr2_time_bxtrans_t2 %>% filter(callindex < 33) %>%
  ggplot(aes(x=callindex,y=bextot))+
    geom_line(aes(group=FAMILY),alpha=.1)+
    theme(panel.background=element_rect(fill='white'))

Next, a smoothed loess curve for every family:

pdr2_time_bxtrans_t2 %>% filter(callindex < 33) %>%
  ggplot(aes(x=callindex,y=bextot))+
    geom_smooth(aes(group=FAMILY),method=loess,se=F)+
    theme(panel.background=element_rect(fill='white'))

Finally, the first plot again, but with a group-level smoothed curve – a line, and a second and third degree polynomial just to see what that looks like:

pdr2_time_bxtrans_t2 %>% filter(callindex < 33) %>%
  ggplot(aes(x=callindex,y=bextot))+
    geom_line(aes(group=FAMILY),alpha=.05)+
    geom_smooth(method=lm,formula=y~poly(x,1),color='dark orange',se=F)+
    geom_smooth(method=lm,formula=y~poly(x,2),color='red',se=F)+
    geom_smooth(method=lm,formula=y~poly(x,3),color='black',se=F)+
    theme(panel.background=element_rect(fill='white'))

From the plot it looks like either a linear or quadratic would capture this pretty well.

Widen the data by call index

Now we can make a wide data file to examine the trend of bextot.

pdr2_bex_w_by_index<-
  pdr2_time_bxtrans_t2 %>% select(FAMILY,bextot,callindex) %>% 
  filter(callindex<33) %>% # just the first 32 to minimize missingness
  mutate(callindex=paste('call',formatC(callindex, width = 2, format = "d", flag = "0"),sep='_')) %>%
  spread(callindex,bextot)

Make R make the lavaan model code

let’s make the lavaan code. We don’t want to write out each ‘call_N’ var, so we can do it programatically

(allthecallnames<-grep('call',names(pdr2_bex_w_by_index),value=T))
##  [1] "call_01" "call_02" "call_03" "call_04" "call_05" "call_06" "call_07"
##  [8] "call_08" "call_09" "call_10" "call_11" "call_12" "call_13" "call_14"
## [15] "call_15" "call_16" "call_17" "call_18" "call_19" "call_20" "call_21"
## [22] "call_22" "call_23" "call_24" "call_25" "call_26" "call_27" "call_28"
## [29] "call_29" "call_30" "call_31" "call_32"
(callnameswith_i_weights<-paste('1*',allthecallnames,sep=''))
##  [1] "1*call_01" "1*call_02" "1*call_03" "1*call_04" "1*call_05"
##  [6] "1*call_06" "1*call_07" "1*call_08" "1*call_09" "1*call_10"
## [11] "1*call_11" "1*call_12" "1*call_13" "1*call_14" "1*call_15"
## [16] "1*call_16" "1*call_17" "1*call_18" "1*call_19" "1*call_20"
## [21] "1*call_21" "1*call_22" "1*call_23" "1*call_24" "1*call_25"
## [26] "1*call_26" "1*call_27" "1*call_28" "1*call_29" "1*call_30"
## [31] "1*call_31" "1*call_32"

Centering is really important for interpretting the linear effect in the presence of a quadratic. The intercept and linear slope term here are interpreted only at the first timepoint, though one can easily calculate the expected value of a point at any timepoint.

(callnameswith_s_weights<-paste(0:31,'*',allthecallnames,sep=''))
##  [1] "0*call_01"  "1*call_02"  "2*call_03"  "3*call_04"  "4*call_05" 
##  [6] "5*call_06"  "6*call_07"  "7*call_08"  "8*call_09"  "9*call_10" 
## [11] "10*call_11" "11*call_12" "12*call_13" "13*call_14" "14*call_15"
## [16] "15*call_16" "16*call_17" "17*call_18" "18*call_19" "19*call_20"
## [21] "20*call_21" "21*call_22" "22*call_23" "23*call_24" "24*call_25"
## [26] "25*call_26" "26*call_27" "27*call_28" "28*call_29" "29*call_30"
## [31] "30*call_31" "31*call_32"
(callnameswith_q_weights<-paste((0:31)^2,'*',allthecallnames,sep='')) 
##  [1] "0*call_01"   "1*call_02"   "4*call_03"   "9*call_04"   "16*call_05" 
##  [6] "25*call_06"  "36*call_07"  "49*call_08"  "64*call_09"  "81*call_10" 
## [11] "100*call_11" "121*call_12" "144*call_13" "169*call_14" "196*call_15"
## [16] "225*call_16" "256*call_17" "289*call_18" "324*call_19" "361*call_20"
## [21] "400*call_21" "441*call_22" "484*call_23" "529*call_24" "576*call_25"
## [26] "625*call_26" "676*call_27" "729*call_28" "784*call_29" "841*call_30"
## [31] "900*call_31" "961*call_32"
(i_weights_collapsed<-paste(callnameswith_i_weights,collapse=' + '))
## [1] "1*call_01 + 1*call_02 + 1*call_03 + 1*call_04 + 1*call_05 + 1*call_06 + 1*call_07 + 1*call_08 + 1*call_09 + 1*call_10 + 1*call_11 + 1*call_12 + 1*call_13 + 1*call_14 + 1*call_15 + 1*call_16 + 1*call_17 + 1*call_18 + 1*call_19 + 1*call_20 + 1*call_21 + 1*call_22 + 1*call_23 + 1*call_24 + 1*call_25 + 1*call_26 + 1*call_27 + 1*call_28 + 1*call_29 + 1*call_30 + 1*call_31 + 1*call_32"
(s_weights_collapsed<-paste(callnameswith_s_weights,collapse=' + '))
## [1] "0*call_01 + 1*call_02 + 2*call_03 + 3*call_04 + 4*call_05 + 5*call_06 + 6*call_07 + 7*call_08 + 8*call_09 + 9*call_10 + 10*call_11 + 11*call_12 + 12*call_13 + 13*call_14 + 14*call_15 + 15*call_16 + 16*call_17 + 17*call_18 + 18*call_19 + 19*call_20 + 20*call_21 + 21*call_22 + 22*call_23 + 23*call_24 + 24*call_25 + 25*call_26 + 26*call_27 + 27*call_28 + 28*call_29 + 29*call_30 + 30*call_31 + 31*call_32"
(q_weights_collapsed<-paste(callnameswith_q_weights,collapse=' + '))
## [1] "0*call_01 + 1*call_02 + 4*call_03 + 9*call_04 + 16*call_05 + 25*call_06 + 36*call_07 + 49*call_08 + 64*call_09 + 81*call_10 + 100*call_11 + 121*call_12 + 144*call_13 + 169*call_14 + 196*call_15 + 225*call_16 + 256*call_17 + 289*call_18 + 324*call_19 + 361*call_20 + 400*call_21 + 441*call_22 + 484*call_23 + 529*call_24 + 576*call_25 + 625*call_26 + 676*call_27 + 729*call_28 + 784*call_29 + 841*call_30 + 900*call_31 + 961*call_32"
model <- paste(
  ' i =~ ',i_weights_collapsed,'\n\n',
  ' s =~ ',s_weights_collapsed,'\n\n',
  ' q =~ ',q_weights_collapsed,sep='')
cat(model)
##  i =~ 1*call_01 + 1*call_02 + 1*call_03 + 1*call_04 + 1*call_05 + 1*call_06 + 1*call_07 + 1*call_08 + 1*call_09 + 1*call_10 + 1*call_11 + 1*call_12 + 1*call_13 + 1*call_14 + 1*call_15 + 1*call_16 + 1*call_17 + 1*call_18 + 1*call_19 + 1*call_20 + 1*call_21 + 1*call_22 + 1*call_23 + 1*call_24 + 1*call_25 + 1*call_26 + 1*call_27 + 1*call_28 + 1*call_29 + 1*call_30 + 1*call_31 + 1*call_32
## 
##  s =~ 0*call_01 + 1*call_02 + 2*call_03 + 3*call_04 + 4*call_05 + 5*call_06 + 6*call_07 + 7*call_08 + 8*call_09 + 9*call_10 + 10*call_11 + 11*call_12 + 12*call_13 + 13*call_14 + 14*call_15 + 15*call_16 + 16*call_17 + 17*call_18 + 18*call_19 + 19*call_20 + 20*call_21 + 21*call_22 + 22*call_23 + 23*call_24 + 24*call_25 + 25*call_26 + 26*call_27 + 27*call_28 + 28*call_29 + 29*call_30 + 30*call_31 + 31*call_32
## 
##  q =~ 0*call_01 + 1*call_02 + 4*call_03 + 9*call_04 + 16*call_05 + 25*call_06 + 36*call_07 + 49*call_08 + 64*call_09 + 81*call_10 + 100*call_11 + 121*call_12 + 144*call_13 + 169*call_14 + 196*call_15 + 225*call_16 + 256*call_17 + 289*call_18 + 324*call_19 + 361*call_20 + 400*call_21 + 441*call_22 + 484*call_23 + 529*call_24 + 576*call_25 + 625*call_26 + 676*call_27 + 729*call_28 + 784*call_29 + 841*call_30 + 900*call_31 + 961*call_32

Fit the model!

fit <- growth(model, data=pdr2_bex_w_by_index)

Here’s the model we fit:

semPaths(fit)

And here’s the summary:

summary(fit)
## lavaan (0.5-18) converged normally after 169 iterations
## 
##                                                   Used       Total
##   Number of observations                            72         100
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic              838.225
##   Degrees of freedom                               519
##   P-value (Chi-square)                           0.000
## 
## Parameter estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
##                    Estimate  Std.err  Z-value  P(>|z|)
## Latent variables:
##   i =~
##     call_01           1.000
##     call_02           1.000
##     call_03           1.000
##     call_04           1.000
##     call_05           1.000
##     call_06           1.000
##     call_07           1.000
##     call_08           1.000
##     call_09           1.000
##     call_10           1.000
##     call_11           1.000
##     call_12           1.000
##     call_13           1.000
##     call_14           1.000
##     call_15           1.000
##     call_16           1.000
##     call_17           1.000
##     call_18           1.000
##     call_19           1.000
##     call_20           1.000
##     call_21           1.000
##     call_22           1.000
##     call_23           1.000
##     call_24           1.000
##     call_25           1.000
##     call_26           1.000
##     call_27           1.000
##     call_28           1.000
##     call_29           1.000
##     call_30           1.000
##     call_31           1.000
##     call_32           1.000
##   s =~
##     call_01           0.000
##     call_02           1.000
##     call_03           2.000
##     call_04           3.000
##     call_05           4.000
##     call_06           5.000
##     call_07           6.000
##     call_08           7.000
##     call_09           8.000
##     call_10           9.000
##     call_11          10.000
##     call_12          11.000
##     call_13          12.000
##     call_14          13.000
##     call_15          14.000
##     call_16          15.000
##     call_17          16.000
##     call_18          17.000
##     call_19          18.000
##     call_20          19.000
##     call_21          20.000
##     call_22          21.000
##     call_23          22.000
##     call_24          23.000
##     call_25          24.000
##     call_26          25.000
##     call_27          26.000
##     call_28          27.000
##     call_29          28.000
##     call_30          29.000
##     call_31          30.000
##     call_32          31.000
##   q =~
##     call_01           0.000
##     call_02           1.000
##     call_03           4.000
##     call_04           9.000
##     call_05          16.000
##     call_06          25.000
##     call_07          36.000
##     call_08          49.000
##     call_09          64.000
##     call_10          81.000
##     call_11         100.000
##     call_12         121.000
##     call_13         144.000
##     call_14         169.000
##     call_15         196.000
##     call_16         225.000
##     call_17         256.000
##     call_18         289.000
##     call_19         324.000
##     call_20         361.000
##     call_21         400.000
##     call_22         441.000
##     call_23         484.000
##     call_24         529.000
##     call_25         576.000
##     call_26         625.000
##     call_27         676.000
##     call_28         729.000
##     call_29         784.000
##     call_30         841.000
##     call_31         900.000
##     call_32         961.000
## 
## Covariances:
##   i ~~
##     s                -0.266    0.154   -1.721    0.085
##     q                 0.004    0.004    1.008    0.313
##   s ~~
##     q                -0.001    0.000   -2.018    0.044
## 
## Intercepts:
##     call_01           0.000
##     call_02           0.000
##     call_03           0.000
##     call_04           0.000
##     call_05           0.000
##     call_06           0.000
##     call_07           0.000
##     call_08           0.000
##     call_09           0.000
##     call_10           0.000
##     call_11           0.000
##     call_12           0.000
##     call_13           0.000
##     call_14           0.000
##     call_15           0.000
##     call_16           0.000
##     call_17           0.000
##     call_18           0.000
##     call_19           0.000
##     call_20           0.000
##     call_21           0.000
##     call_22           0.000
##     call_23           0.000
##     call_24           0.000
##     call_25           0.000
##     call_26           0.000
##     call_27           0.000
##     call_28           0.000
##     call_29           0.000
##     call_30           0.000
##     call_31           0.000
##     call_32           0.000
##     i                 7.402    0.417   17.760    0.000
##     s                -0.261    0.038   -6.913    0.000
##     q                 0.005    0.001    5.015    0.000
## 
## Variances:
##     call_01          16.731    3.107
##     call_02           9.807    1.915
##     call_03          11.202    2.087
##     call_04           9.089    1.700
##     call_05          17.265    3.023
##     call_06          15.249    2.667
##     call_07          18.412    3.178
##     call_08          15.511    2.684
##     call_09          16.776    2.886
##     call_10          10.061    1.764
##     call_11          12.751    2.208
##     call_12          14.664    2.524
##     call_13          12.104    2.097
##     call_14          13.257    2.289
##     call_15          11.621    2.017
##     call_16           7.045    1.257
##     call_17          10.488    1.828
##     call_18          11.085    1.927
##     call_19           9.291    1.628
##     call_20           8.794    1.545
##     call_21          11.203    1.945
##     call_22          11.655    2.020
##     call_23           7.416    1.316
##     call_24           7.003    1.249
##     call_25          10.300    1.800
##     call_26          13.740    2.377
##     call_27          13.083    2.275
##     call_28           8.501    1.526
##     call_29           8.287    1.507
##     call_30           8.291    1.533
##     call_31           7.677    1.469
##     call_32          10.034    1.899
##     i                 9.498    2.095
##     s                 0.042    0.017
##     q                 0.000    0.000

Starting with this growth model, we could now add in predictors and outcomes for the latent i, s, and q variables.



SEM in R: Adding a time index and recoding the Bahevioral Scores

#' First, reading in the data:
#' 
library(foreign)
setwd('C:/Users/flournoy/Downloads')

pdr2<-read.spss("PDR Wave 2.sav", to.data.frame=T)
pdr4<-read.spss("PDR Wave 4.sav", to.data.frame=T)
head(pdr2)
summary(pdr2)

#' ## Generate a time variable
#' 
#' This indexes each call for each family. This time variable
#' will be important later for widening the data for SEM.
#' 
library(dplyr)

pdr2_time <- pdr2 %>%
group_by(FAMILY) %>% #do the count by family
 arrange(YEAR,MONTH,DAY) %>% #sort by date
 mutate(callindex=1:n()) #create call index that’s 1:end for each family

head(pdr2_time)

#' ## Create composite score of kid bex
#' 
#' Scores Found in (cols 8:47).

library(car)
#?recode
## recode all of the kid behavior items into numeric variables in one go
# ignore difference between “occurred, stressed” and “occurred, not stressed”
# (if it occurred at all, it gets a 1; if it didn’t occur, it’s a 0)
# the defaults for as.XX.result are both TRUE; if you leave it that way,
# it will return character variables.

pdr2_time[,8:47]<-sapply(pdr2_time[,8:47],
 function(x)
 {x<-recode(x,"'DID NOT OCCUR'='0'; else = '1'",
 as.factor.result=F, as.numeric.result=T)})

head(pdr2_time)
str(pdr2_time)

#' Now we can create the composite (sum) variable
#' 
pdr2_time$bextot<-rowSums(pdr2_time[,8:47],na.rm=F)

summary(pdr2_time$bextot)

Some Data Manipulation in R with SPSS Variable Names and Labels

Following our conversation today, here are a few steps to take to use SPSS variable “labels” in place of variable names — we were working with a dataset that had variable names that were less informative than the variable labels, but that still did need to be referenced from time to time.


setwd("C:/Users/jleverni/Desktop")

# Import the SPSS file:
library('foreign')
data.wave2 <- read.spss("PDR Wave 2.sav", to.data.frame = TRUE, use.value.labels = TRUE) # Note that the variable lables from SPSS have been imported as an "attribute" called "variable.labels": str(data.wave2) # We're about to overwrite the existing variable names with those variable labels. Thus, we'll save a backup copy of the existing variable names as an attribute first: attr(data.wave2, "variable.originalName") <- colnames(data.wave2) # Check our work: str(data.wave2) # Take all columns that have an attribute label (i.e., where the attribute label is not ""), and replace that column's name with that attribute label) colnames(data.wave2)[attr(data.wave2, "variable.labels") != ""] <- attr(data.wave2, "variable.labels")[attr(data.wave2, "variable.labels") != ""] # Check our work: str(data.wave2) # Make it easy to look up the original variable name, if we want to at any point: lookupName <- function(columnName, dataFrame = data.wave2, attribute = "variable.originalName") { print(attr(dataFrame, attribute)[which(colnames(dataFrame) == columnName)]) } # Look up the original variable name of the column that is now titled "Lying", for example: lookupName("Lying")

Week 3: SEM!

We have another consultation lined up for our third R Club this term: Jocelyn has a few models she wants to run for her dissertation, using SEM. She started the analysis in SPSS Amos, but she’s not that far into it yet and she’s wondering if R would be a better choice. She’ll tell us about her models, and we’ll try to write out some code for them! It will be a great opportunity to play around with SEM in R, for those of you who have been looking for an excuse to do that, and I’m guessing it will also be an instructive conversation about how to articulate theories and ideas as statistical models.

Tue (tomorrow) 3:00-4:20pm, in Straub 008.

See you there!

wide to long: what we learned

As the French say, there are many methods to pluck a duck, and this duck is particularly feathery, so this is by no means the only reasonable approach. That said, the solution we ended up going with uses tidyr, and I think it’s a pretty nice one (note that you do need a recent version of R to use it, though).

Note that we used the dummy dataset Melissa provided (see the original post), so the column names are all correct, but the data are just random numbers (that’s why, for example, there are negative subject IDs).

A couple nifty little tidbits:

  1. The sep argument in the reshape() function from the stats package does not work the way you might expect it to! Here is an rpubs post demonstrating the issue.
  2. Also, reshape() chokes on the missing columns in wide format, as I mentioned in the original post, but tidyr handles them just fine, actually (it just assumes you want NAs filled in there). Nice little package. 🙂

Here’s the code!

setwd("~/Downloads")
data <- read.csv("ML Data for R.csv", row.names=1)


# standardize naming convention for variables
colnames(data) <- sub(pattern=".T", replacement="_T", x=colnames(data), fixed=TRUE)

# note that reshape() does not work as expected!! The sep argument in reshape doesn't do what it appears to do.

library(tidyr); library(dplyr) 
# I mostly stole this code from John Flournoy :)
data.long <- data %>% 
  # We gather all the columns except for ADS.id into a single column 
  # of variable names and another column of values.
  gather(variable,value,-ADS.id) %>%
  # We now can separate the variable column by splitting each variable name
  # at the '_T' -- now we have two columns, one with the variable name and
  # one with the timepoint
  separate(variable,c('variable','timepoint'),sep='_T') %>%
  # This step spreads it out so that each variable gets a column with the
  # associated values
  spread(variable,value)

# snapshots of the process...
head(gather(data, variable,value,-ADS.id)) 
##      ADS.id   variable      value
## 1 1.0018516 age.mri_T1 -0.3092724
## 2 1.1181216 age.mri_T1 -0.6904551
## 3 1.1211195 age.mri_T1  0.7601895
## 4 1.0241864 age.mri_T1 -1.0652325
## 5 0.2439787 age.mri_T1  2.1723328
## 6 1.5826904 age.mri_T1  1.0067890
head(separate(gather(data, variable,value,-ADS.id), variable,c('variable','timepoint'),sep='_T'))
##      ADS.id variable timepoint      value
## 1 1.0018516  age.mri         1 -0.3092724
## 2 1.1181216  age.mri         1 -0.6904551
## 3 1.1211195  age.mri         1  0.7601895
## 4 1.0241864  age.mri         1 -1.0652325
## 5 0.2439787  age.mri         1  2.1723328
## 6 1.5826904  age.mri         1  1.0067890
head(data.long) # the result!
##      ADS.id timepoint        age    age.mri   ccesdtot    ccesdtot2
## 1 -3.188239         1         NA  0.2349363 -0.3167452 -0.502415557
## 2 -3.188239         2         NA         NA -1.5850726 -0.004856296
## 3 -3.188239         3 -0.6822444         NA -0.8233850  0.630903711
## 4 -3.188239         4         NA         NA  0.8308063  0.047788788
## 5 -2.384302         1         NA -0.4050702 -0.1712919  1.599556240
## 6 -2.384302         2         NA         NA  0.5175698 -0.888349761
##   D.15.age.months errors.c.cong errors.c.cong.PERC errors.c.inc
## 1              NA     1.1203254         -0.2384952   -1.9943953
## 2              NA            NA                 NA           NA
## 3       -1.294505     0.8450592         -0.2829815    1.1129147
## 4              NA     0.5954789         -0.8289942    0.2315604
## 5              NA    -0.3532590          0.9763195    0.7725308
## 6              NA            NA                 NA           NA
##   errors.c.inc.PERC Errors.congb Errors.congt errors.i.cong
## 1         0.2549887    1.5140582    0.9740986     0.1182785
## 2                NA           NA           NA            NA
## 3         0.3035785    0.6272806   -0.4459871    -0.4603392
## 4         0.4983972   -0.2078286    0.9844005     1.4174055
## 5         0.1856251   -1.7249345    0.0948151    -0.3314606
## 6                NA           NA           NA            NA
##   errors.i.cong.PERC errors.i.inc errors.i.inc.PERC Errors.incb
## 1          0.1808000    1.6272805        -1.5131672  -0.8726752
## 2                 NA           NA                NA          NA
## 3          1.6620035   -0.5556201        -0.3994536   0.5846239
## 4          0.2401470   -0.1398306         1.4187102  -0.9474623
## 5         -0.1865299   -0.7414438         0.2517365   0.4851024
## 6                 NA           NA                NA          NA
##   Errors.inct errors.prac errors.slow.c.cong errors.slow.c.inc
## 1   -1.062035   1.2304198          0.2057605          0.453230
## 2          NA          NA                 NA                NA
## 3   -2.079247   0.2014654         -1.1325324          1.384324
## 4   -0.114631  -0.3909164          0.8461512          2.109982
## 5   -1.908316  -1.2991850          0.2152569          1.251143
## 6          NA          NA                 NA                NA
##   errors.slow.i.cong errors.slow.i.inc errors.slow.prac Errors.total
## 1         -0.5650468      -0.260194217      -1.05898000    1.1322255
## 2                 NA                NA               NA           NA
## 3         -1.5816222       0.005833924       1.05445791    1.3550896
## 4         -1.0825809       0.107324351       0.08879325    0.2439465
## 5          1.1278321      -0.937676510      -0.43545535   -0.3518373
## 6                 NA                NA               NA           NA
##   errors.total.c.cong errors.total.c.inc errors.total.i.cong
## 1           0.9362488         -0.2737423           0.1255926
## 2                  NA                 NA                  NA
## 3          -1.6188499         -0.6009969           0.8908975
## 4          -0.1001239         -0.4400096          -1.5393167
## 5          -0.7427545          1.2649423          -1.2623598
## 6                  NA                 NA                  NA
##   errors.total.i.inc errors.total.PERC errors.total.prac Int.RT.All
## 1         -0.2323253         0.5850435         1.3739468 -1.6736488
## 2                 NA                NA                NA         NA
## 3          0.2039495         1.5558254        -1.3495371 -1.2342580
## 4          1.8545902        -0.8207177        -0.1254748 -0.5403726
## 5          0.9093587        -1.8192946         0.9457083 -1.1837205
## 6                 NA                NA                NA         NA
##   Int.RT.ProB Int.RT.ReaB     RT.avg  rt.c.cong   rt.c.inc    RT.conb
## 1  -0.4787541   0.4969283 -2.3207763  0.1943951 -1.4280047  0.1893392
## 2          NA          NA         NA         NA         NA         NA
## 3  -1.6069571   0.2826211 -0.8354791 -0.4977420 -1.3454648         NA
## 4   0.9181784   0.5912347 -0.7084950  0.4040288  1.9530110         NA
## 5  -2.0270125   0.3067939 -0.1157335 -0.9217092  0.7747549 -1.1509858
## 6          NA          NA         NA         NA         NA         NA
##    RT.congb   RT.congt  rt.i.cong   rt.i.inc    RT.incb    RT.inct
## 1        NA  0.5066862 -0.8200358  0.1091896 -0.1578207 -0.2115042
## 2        NA         NA         NA         NA         NA         NA
## 3 -1.295408  0.2670604 -0.6976511 -0.8435179 -0.9275700  0.8909898
## 4 -1.026102 -0.7030719  1.0139764  1.1513498 -0.9378490 -0.5800233
## 5        NA -0.1882044  0.2780732  0.3299927  1.1015153 -0.5091075
## 6        NA         NA         NA         NA         NA         NA
##        rt.prac
## 1 -1.177762051
## 2           NA
## 3 -0.698241844
## 4 -0.354985516
## 5  0.009850542
## 6           NA
# tidying, checking...
data.long$timepoint <- as.factor(data.long$timepoint)
data.long$ADS.id <- as.factor(data.long$ADS.id)
summary(data.long)
##                ADS.id    timepoint      age             age.mri       
##  -3.18823887209027:  4   1:246     Min.   :-2.8179   Min.   :-2.2391  
##  -2.38430177492141:  4   2:246     1st Qu.:-0.8310   1st Qu.:-0.6778  
##  -2.19765389063435:  4   3:246     Median :-0.0078   Median : 0.0045  
##  -1.93658469426135:  4   4:246     Mean   :-0.0353   Mean   : 0.0473  
##  -1.88871398596907:  4             3rd Qu.: 0.6482   3rd Qu.: 0.7528  
##  -1.87775103867895:  4             Max.   : 2.8244   Max.   : 3.9297  
##  (Other)          :960             NA's   :738       NA's   :738      
##     ccesdtot          ccesdtot2         D.15.age.months  
##  Min.   :-2.73145   Min.   :-3.165353   Min.   :-2.7352  
##  1st Qu.:-0.73198   1st Qu.:-0.627016   1st Qu.:-0.6356  
##  Median :-0.06160   Median :-0.007192   Median :-0.0218  
##  Mean   :-0.04439   Mean   : 0.023682   Mean   : 0.0226  
##  3rd Qu.: 0.60229   3rd Qu.: 0.692493   3rd Qu.: 0.7302  
##  Max.   : 2.87977   Max.   : 2.831624   Max.   : 2.5312  
##                                         NA's   :738      
##  errors.c.cong      errors.c.cong.PERC  errors.c.inc     
##  Min.   :-3.02694   Min.   :-3.02622   Min.   :-3.07194  
##  1st Qu.:-0.65846   1st Qu.:-0.69937   1st Qu.:-0.65187  
##  Median : 0.03656   Median :-0.09637   Median : 0.01856  
##  Mean   :-0.00862   Mean   :-0.01181   Mean   : 0.00975  
##  3rd Qu.: 0.64811   3rd Qu.: 0.63642   3rd Qu.: 0.64352  
##  Max.   : 3.48170   Max.   : 3.13567   Max.   : 3.27508  
##  NA's   :246        NA's   :246        NA's   :246       
##  errors.c.inc.PERC   Errors.congb       Errors.congt     
##  Min.   :-3.83647   Min.   :-3.61005   Min.   :-4.00876  
##  1st Qu.:-0.65952   1st Qu.:-0.67655   1st Qu.:-0.70759  
##  Median :-0.00111   Median :-0.05863   Median :-0.01951  
##  Mean   :-0.00783   Mean   :-0.02327   Mean   :-0.00887  
##  3rd Qu.: 0.59038   3rd Qu.: 0.63483   3rd Qu.: 0.66770  
##  Max.   : 2.95515   Max.   : 3.22904   Max.   : 3.20207  
##  NA's   :246        NA's   :246        NA's   :246       
##  errors.i.cong      errors.i.cong.PERC  errors.i.inc     
##  Min.   :-3.21592   Min.   :-3.28548   Min.   :-3.16182  
##  1st Qu.:-0.71059   1st Qu.:-0.65964   1st Qu.:-0.71632  
##  Median :-0.06600   Median :-0.06822   Median : 0.00092  
##  Mean   :-0.04063   Mean   : 0.00061   Mean   :-0.03424  
##  3rd Qu.: 0.59837   3rd Qu.: 0.64682   3rd Qu.: 0.64873  
##  Max.   : 4.63775   Max.   : 3.26292   Max.   : 2.61837  
##  NA's   :246        NA's   :246        NA's   :246       
##  errors.i.inc.PERC   Errors.incb        Errors.inct      
##  Min.   :-3.18711   Min.   :-2.85501   Min.   :-2.85304  
##  1st Qu.:-0.71387   1st Qu.:-0.60294   1st Qu.:-0.66899  
##  Median :-0.00953   Median : 0.00232   Median :-0.02129  
##  Mean   : 0.00542   Mean   : 0.02473   Mean   : 0.00316  
##  3rd Qu.: 0.71589   3rd Qu.: 0.66226   3rd Qu.: 0.71004  
##  Max.   : 3.72285   Max.   : 3.05961   Max.   : 3.36236  
##  NA's   :246        NA's   :246        NA's   :246       
##   errors.prac       errors.slow.c.cong errors.slow.c.inc 
##  Min.   :-2.65139   Min.   :-2.87464   Min.   :-2.71655  
##  1st Qu.:-0.67979   1st Qu.:-0.68237   1st Qu.:-0.68067  
##  Median : 0.00760   Median : 0.00764   Median : 0.02927  
##  Mean   :-0.02728   Mean   :-0.00233   Mean   : 0.03072  
##  3rd Qu.: 0.66139   3rd Qu.: 0.67304   3rd Qu.: 0.73078  
##  Max.   : 2.64629   Max.   : 3.34258   Max.   : 2.96471  
##  NA's   :246        NA's   :246        NA's   :246       
##  errors.slow.i.cong errors.slow.i.inc  errors.slow.prac  
##  Min.   :-2.97736   Min.   :-3.48673   Min.   :-3.38586  
##  1st Qu.:-0.67957   1st Qu.:-0.64765   1st Qu.:-0.77620  
##  Median : 0.09023   Median :-0.00268   Median :-0.07298  
##  Mean   : 0.05189   Mean   :-0.02303   Mean   :-0.04885  
##  3rd Qu.: 0.70666   3rd Qu.: 0.62909   3rd Qu.: 0.66078  
##  Max.   : 2.64540   Max.   : 2.88500   Max.   : 2.63078  
##  NA's   :246        NA's   :246        NA's   :246       
##   Errors.total      errors.total.c.cong errors.total.c.inc
##  Min.   :-3.02983   Min.   :-3.94584    Min.   :-3.04545  
##  1st Qu.:-0.66766   1st Qu.:-0.71108    1st Qu.:-0.68723  
##  Median : 0.03033   Median :-0.01434    Median :-0.04592  
##  Mean   : 0.00020   Mean   :-0.03496    Mean   :-0.00674  
##  3rd Qu.: 0.67346   3rd Qu.: 0.68086    3rd Qu.: 0.68570  
##  Max.   : 2.83657   Max.   : 2.96111    Max.   : 3.49591  
##  NA's   :246        NA's   :246         NA's   :246       
##  errors.total.i.cong errors.total.i.inc errors.total.PERC 
##  Min.   :-3.35570    Min.   :-3.06428   Min.   :-3.52159  
##  1st Qu.:-0.69939    1st Qu.:-0.75856   1st Qu.:-0.68296  
##  Median :-0.01178    Median :-0.04273   Median :-0.06017  
##  Mean   :-0.02606    Mean   :-0.06233   Mean   :-0.06873  
##  3rd Qu.: 0.68439    3rd Qu.: 0.60996   3rd Qu.: 0.54828  
##  Max.   : 3.05471    Max.   : 2.70558   Max.   : 3.38433  
##  NA's   :246         NA's   :246        NA's   :246       
##  errors.total.prac    Int.RT.All        Int.RT.ProB      
##  Min.   :-2.78635   Min.   :-2.70803   Min.   :-2.93328  
##  1st Qu.:-0.75211   1st Qu.:-0.64684   1st Qu.:-0.64654  
##  Median :-0.11411   Median :-0.01643   Median : 0.06557  
##  Mean   :-0.05913   Mean   :-0.00633   Mean   : 0.02768  
##  3rd Qu.: 0.57326   3rd Qu.: 0.63270   3rd Qu.: 0.68377  
##  Max.   : 3.28847   Max.   : 3.13618   Max.   : 2.98533  
##  NA's   :246        NA's   :246        NA's   :246       
##   Int.RT.ReaB           RT.avg           rt.c.cong       
##  Min.   :-3.11339   Min.   :-2.61333   Min.   :-2.90894  
##  1st Qu.:-0.65444   1st Qu.:-0.64842   1st Qu.:-0.69766  
##  Median : 0.01216   Median :-0.00995   Median :-0.04453  
##  Mean   : 0.03416   Mean   :-0.00645   Mean   :-0.03129  
##  3rd Qu.: 0.72491   3rd Qu.: 0.60780   3rd Qu.: 0.56930  
##  Max.   : 3.30060   Max.   : 3.69951   Max.   : 3.52680  
##  NA's   :246        NA's   :246        NA's   :246       
##     rt.c.inc           RT.conb           RT.congb          RT.congt       
##  Min.   :-3.41278   Min.   :-3.1444   Min.   :-3.7782   Min.   :-3.24379  
##  1st Qu.:-0.67249   1st Qu.:-0.7331   1st Qu.:-0.7437   1st Qu.:-0.64294  
##  Median : 0.04412   Median :-0.0374   Median :-0.0559   Median : 0.02415  
##  Mean   :-0.00284   Mean   :-0.0629   Mean   :-0.0678   Mean   : 0.03364  
##  3rd Qu.: 0.66931   3rd Qu.: 0.5235   3rd Qu.: 0.5887   3rd Qu.: 0.75205  
##  Max.   : 2.85384   Max.   : 2.8513   Max.   : 2.8365   Max.   : 3.49464  
##  NA's   :246        NA's   :738       NA's   :492       NA's   :246       
##    rt.i.cong          rt.i.inc           RT.incb        
##  Min.   :-3.1709   Min.   :-3.28774   Min.   :-3.14474  
##  1st Qu.:-0.6826   1st Qu.:-0.67278   1st Qu.:-0.73181  
##  Median : 0.0261   Median :-0.00448   Median :-0.01798  
##  Mean   : 0.0232   Mean   :-0.01430   Mean   :-0.02712  
##  3rd Qu.: 0.7156   3rd Qu.: 0.62549   3rd Qu.: 0.62710  
##  Max.   : 3.7440   Max.   : 3.14034   Max.   : 3.30890  
##  NA's   :246       NA's   :246        NA's   :246       
##     RT.inct            rt.prac        
##  Min.   :-3.82650   Min.   :-3.29600  
##  1st Qu.:-0.69212   1st Qu.:-0.72985  
##  Median :-0.01255   Median :-0.04557  
##  Mean   :-0.01289   Mean   :-0.04722  
##  3rd Qu.: 0.63413   3rd Qu.: 0.64025  
##  Max.   : 3.81651   Max.   : 2.45001  
##  NA's   :246        NA's   :246

Wide to long!

Melissa has a lovely, complex longitudinal dataset, and she needs it reformatted from wide to long. She made a dummy version of the dataset for us to play with, with all of the correct column names but randomly generated data. Side note: If you ever need to, you can do this with your own data, too, with a couple quick commands:

dummy.data = matrix(data=rnorm(n=dim(my.data)[1] * dim(my.data)[2]), ncol=dim(my.data)[2])
dummy.data = as.data.frame(dummy.data)
colnames(dummy.data) = colnames(my.data)

Here is her fake dataset. The data are currently in wide format, with everything for each participant in one row. Melissa would like it with four rows for each participant, for each of the four time points, and then all of the measures as columns. Be careful, though – several of the measures were not assessed at every time point!  So, for example, in wide format there might be four columns for Measure A (MeasureA_T1,  MeasureA_T2, MeasureA_T3, MeasureA_T4) but only two columns for Measure B (MeasureB_T1, MeasureB_T4). Like this:

Before…
Subj A_T1 A_T2 A_T3 A_T4 B_T1 B_T4
1 -0.9718834 0.6276666 -0.5228272 -0.6400311 -0.5308298 1.6060458
2 -1.9756228 0.5156263 -0.4056288 0.4194956 0.3696320 -1.8104448
3 -0.1338431 -1.5766120 -0.1066071 -0.3906287 0.6693724 -0.2181601
4 -0.8970927 0.1203673 -0.3418302 0.0911664 1.6115600 -0.8444743
5 1.6684360 1.1707871 -0.4292292 -0.5215027 0.5450576 -0.4567965

 

In the end, there should be just one column for each measure (e.g. MeasureA, MeasureB), and four rows for each participant, with NAs entered where the measure was not assessed. Like this:

After!
Subj time A B
1 1 1.0107164 1.0408222
1 2 -0.5727259 NA
1 3 -0.9831144 NA
1 4 -1.1895119 -1.9333893
2 1 -0.8442759 0.9567628
2 2 -0.5484172 NA
2 3 0.0375418 NA
2 4 -1.4851271 -1.1012295
3 1 -0.0313766 -1.5844977
3 2 -0.8810030 NA
3 3 -0.4638228 NA
3 4 0.3639906 -1.8326602
4 1 0.6995141 0.4447129
4 2 0.1064798 NA
4 3 0.1506644 NA
4 4 -1.1496762 -0.5327433
5 1 -0.8448360 0.3588507
5 2 2.1525641 NA
5 3 -1.9898227 NA
5 4 -0.7260369 1.7552476

 

Except there are like a hundred variables (mwah-ha-ha-ha!). Have fun!

Spring has sprung, and R Club is back!

Remember, R Club is meeting every week this term instead of biweekly. Our first meeting is 3:00-4:20 tomorrow, Tues 3/31, in Straub 008.

We’re hitting the ground running this term with our very first R Club consultation in week 1: Jenny Mendoza is about to launch a new study, and she wants help doing a power analysis for it to determine an appropriate sample size. Flash the bat signal!

Slide2

Here are the details:

The study is called DwellTone. She’s already got a draft of a script for the power analysis, using the pwr package (manual, and a QuickR post), but she’s not sure if it’s working the way she intends it to, or if this is the best approach for her design. Jenny will be there to tell us more about the design and her intended analyses, and we’ll all work on writing code to do her power analyses during R Club. Given the nature of R (i.e. sprawly, organic, redundant, lovely), we’ll probably come up with a couple different solutions, and hopefully we’ll all learn and have fun in the process! And Jenny leaves with some beautiful new code. Smiles all ’round.

See you there!