wide to long: what we learned
As the French say, there are many methods to pluck a duck, and this duck is particularly feathery, so this is by no means the only reasonable approach. That said, the solution we ended up going with uses tidyr, and I think it’s a pretty nice one (note that you do need a recent version of R to use it, though).
Note that we used the dummy dataset Melissa provided (see the original post), so the column names are all correct, but the data are just random numbers (that’s why, for example, there are negative subject IDs).
A couple nifty little tidbits:
- The sep argument in the reshape() function from the stats package does not work the way you might expect it to! Here is an rpubs post demonstrating the issue.
- Also, reshape() chokes on the missing columns in wide format, as I mentioned in the original post, but tidyr handles them just fine, actually (it just assumes you want NAs filled in there). Nice little package. 🙂
Here’s the code!
setwd("~/Downloads")
data <- read.csv("ML Data for R.csv", row.names=1)
# standardize naming convention for variables
colnames(data) <- sub(pattern=".T", replacement="_T", x=colnames(data), fixed=TRUE)
# note that reshape() does not work as expected!! The sep argument in reshape doesn't do what it appears to do.
library(tidyr); library(dplyr)
# I mostly stole this code from John Flournoy :)
data.long <- data %>%
# We gather all the columns except for ADS.id into a single column
# of variable names and another column of values.
gather(variable,value,-ADS.id) %>%
# We now can separate the variable column by splitting each variable name
# at the '_T' -- now we have two columns, one with the variable name and
# one with the timepoint
separate(variable,c('variable','timepoint'),sep='_T') %>%
# This step spreads it out so that each variable gets a column with the
# associated values
spread(variable,value)
# snapshots of the process...
head(gather(data, variable,value,-ADS.id))
## ADS.id variable value
## 1 1.0018516 age.mri_T1 -0.3092724
## 2 1.1181216 age.mri_T1 -0.6904551
## 3 1.1211195 age.mri_T1 0.7601895
## 4 1.0241864 age.mri_T1 -1.0652325
## 5 0.2439787 age.mri_T1 2.1723328
## 6 1.5826904 age.mri_T1 1.0067890
head(separate(gather(data, variable,value,-ADS.id), variable,c('variable','timepoint'),sep='_T'))
## ADS.id variable timepoint value
## 1 1.0018516 age.mri 1 -0.3092724
## 2 1.1181216 age.mri 1 -0.6904551
## 3 1.1211195 age.mri 1 0.7601895
## 4 1.0241864 age.mri 1 -1.0652325
## 5 0.2439787 age.mri 1 2.1723328
## 6 1.5826904 age.mri 1 1.0067890
head(data.long) # the result!
## ADS.id timepoint age age.mri ccesdtot ccesdtot2
## 1 -3.188239 1 NA 0.2349363 -0.3167452 -0.502415557
## 2 -3.188239 2 NA NA -1.5850726 -0.004856296
## 3 -3.188239 3 -0.6822444 NA -0.8233850 0.630903711
## 4 -3.188239 4 NA NA 0.8308063 0.047788788
## 5 -2.384302 1 NA -0.4050702 -0.1712919 1.599556240
## 6 -2.384302 2 NA NA 0.5175698 -0.888349761
## D.15.age.months errors.c.cong errors.c.cong.PERC errors.c.inc
## 1 NA 1.1203254 -0.2384952 -1.9943953
## 2 NA NA NA NA
## 3 -1.294505 0.8450592 -0.2829815 1.1129147
## 4 NA 0.5954789 -0.8289942 0.2315604
## 5 NA -0.3532590 0.9763195 0.7725308
## 6 NA NA NA NA
## errors.c.inc.PERC Errors.congb Errors.congt errors.i.cong
## 1 0.2549887 1.5140582 0.9740986 0.1182785
## 2 NA NA NA NA
## 3 0.3035785 0.6272806 -0.4459871 -0.4603392
## 4 0.4983972 -0.2078286 0.9844005 1.4174055
## 5 0.1856251 -1.7249345 0.0948151 -0.3314606
## 6 NA NA NA NA
## errors.i.cong.PERC errors.i.inc errors.i.inc.PERC Errors.incb
## 1 0.1808000 1.6272805 -1.5131672 -0.8726752
## 2 NA NA NA NA
## 3 1.6620035 -0.5556201 -0.3994536 0.5846239
## 4 0.2401470 -0.1398306 1.4187102 -0.9474623
## 5 -0.1865299 -0.7414438 0.2517365 0.4851024
## 6 NA NA NA NA
## Errors.inct errors.prac errors.slow.c.cong errors.slow.c.inc
## 1 -1.062035 1.2304198 0.2057605 0.453230
## 2 NA NA NA NA
## 3 -2.079247 0.2014654 -1.1325324 1.384324
## 4 -0.114631 -0.3909164 0.8461512 2.109982
## 5 -1.908316 -1.2991850 0.2152569 1.251143
## 6 NA NA NA NA
## errors.slow.i.cong errors.slow.i.inc errors.slow.prac Errors.total
## 1 -0.5650468 -0.260194217 -1.05898000 1.1322255
## 2 NA NA NA NA
## 3 -1.5816222 0.005833924 1.05445791 1.3550896
## 4 -1.0825809 0.107324351 0.08879325 0.2439465
## 5 1.1278321 -0.937676510 -0.43545535 -0.3518373
## 6 NA NA NA NA
## errors.total.c.cong errors.total.c.inc errors.total.i.cong
## 1 0.9362488 -0.2737423 0.1255926
## 2 NA NA NA
## 3 -1.6188499 -0.6009969 0.8908975
## 4 -0.1001239 -0.4400096 -1.5393167
## 5 -0.7427545 1.2649423 -1.2623598
## 6 NA NA NA
## errors.total.i.inc errors.total.PERC errors.total.prac Int.RT.All
## 1 -0.2323253 0.5850435 1.3739468 -1.6736488
## 2 NA NA NA NA
## 3 0.2039495 1.5558254 -1.3495371 -1.2342580
## 4 1.8545902 -0.8207177 -0.1254748 -0.5403726
## 5 0.9093587 -1.8192946 0.9457083 -1.1837205
## 6 NA NA NA NA
## Int.RT.ProB Int.RT.ReaB RT.avg rt.c.cong rt.c.inc RT.conb
## 1 -0.4787541 0.4969283 -2.3207763 0.1943951 -1.4280047 0.1893392
## 2 NA NA NA NA NA NA
## 3 -1.6069571 0.2826211 -0.8354791 -0.4977420 -1.3454648 NA
## 4 0.9181784 0.5912347 -0.7084950 0.4040288 1.9530110 NA
## 5 -2.0270125 0.3067939 -0.1157335 -0.9217092 0.7747549 -1.1509858
## 6 NA NA NA NA NA NA
## RT.congb RT.congt rt.i.cong rt.i.inc RT.incb RT.inct
## 1 NA 0.5066862 -0.8200358 0.1091896 -0.1578207 -0.2115042
## 2 NA NA NA NA NA NA
## 3 -1.295408 0.2670604 -0.6976511 -0.8435179 -0.9275700 0.8909898
## 4 -1.026102 -0.7030719 1.0139764 1.1513498 -0.9378490 -0.5800233
## 5 NA -0.1882044 0.2780732 0.3299927 1.1015153 -0.5091075
## 6 NA NA NA NA NA NA
## rt.prac
## 1 -1.177762051
## 2 NA
## 3 -0.698241844
## 4 -0.354985516
## 5 0.009850542
## 6 NA
# tidying, checking...
data.long$timepoint <- as.factor(data.long$timepoint)
data.long$ADS.id <- as.factor(data.long$ADS.id)
summary(data.long)
## ADS.id timepoint age age.mri
## -3.18823887209027: 4 1:246 Min. :-2.8179 Min. :-2.2391
## -2.38430177492141: 4 2:246 1st Qu.:-0.8310 1st Qu.:-0.6778
## -2.19765389063435: 4 3:246 Median :-0.0078 Median : 0.0045
## -1.93658469426135: 4 4:246 Mean :-0.0353 Mean : 0.0473
## -1.88871398596907: 4 3rd Qu.: 0.6482 3rd Qu.: 0.7528
## -1.87775103867895: 4 Max. : 2.8244 Max. : 3.9297
## (Other) :960 NA's :738 NA's :738
## ccesdtot ccesdtot2 D.15.age.months
## Min. :-2.73145 Min. :-3.165353 Min. :-2.7352
## 1st Qu.:-0.73198 1st Qu.:-0.627016 1st Qu.:-0.6356
## Median :-0.06160 Median :-0.007192 Median :-0.0218
## Mean :-0.04439 Mean : 0.023682 Mean : 0.0226
## 3rd Qu.: 0.60229 3rd Qu.: 0.692493 3rd Qu.: 0.7302
## Max. : 2.87977 Max. : 2.831624 Max. : 2.5312
## NA's :738
## errors.c.cong errors.c.cong.PERC errors.c.inc
## Min. :-3.02694 Min. :-3.02622 Min. :-3.07194
## 1st Qu.:-0.65846 1st Qu.:-0.69937 1st Qu.:-0.65187
## Median : 0.03656 Median :-0.09637 Median : 0.01856
## Mean :-0.00862 Mean :-0.01181 Mean : 0.00975
## 3rd Qu.: 0.64811 3rd Qu.: 0.63642 3rd Qu.: 0.64352
## Max. : 3.48170 Max. : 3.13567 Max. : 3.27508
## NA's :246 NA's :246 NA's :246
## errors.c.inc.PERC Errors.congb Errors.congt
## Min. :-3.83647 Min. :-3.61005 Min. :-4.00876
## 1st Qu.:-0.65952 1st Qu.:-0.67655 1st Qu.:-0.70759
## Median :-0.00111 Median :-0.05863 Median :-0.01951
## Mean :-0.00783 Mean :-0.02327 Mean :-0.00887
## 3rd Qu.: 0.59038 3rd Qu.: 0.63483 3rd Qu.: 0.66770
## Max. : 2.95515 Max. : 3.22904 Max. : 3.20207
## NA's :246 NA's :246 NA's :246
## errors.i.cong errors.i.cong.PERC errors.i.inc
## Min. :-3.21592 Min. :-3.28548 Min. :-3.16182
## 1st Qu.:-0.71059 1st Qu.:-0.65964 1st Qu.:-0.71632
## Median :-0.06600 Median :-0.06822 Median : 0.00092
## Mean :-0.04063 Mean : 0.00061 Mean :-0.03424
## 3rd Qu.: 0.59837 3rd Qu.: 0.64682 3rd Qu.: 0.64873
## Max. : 4.63775 Max. : 3.26292 Max. : 2.61837
## NA's :246 NA's :246 NA's :246
## errors.i.inc.PERC Errors.incb Errors.inct
## Min. :-3.18711 Min. :-2.85501 Min. :-2.85304
## 1st Qu.:-0.71387 1st Qu.:-0.60294 1st Qu.:-0.66899
## Median :-0.00953 Median : 0.00232 Median :-0.02129
## Mean : 0.00542 Mean : 0.02473 Mean : 0.00316
## 3rd Qu.: 0.71589 3rd Qu.: 0.66226 3rd Qu.: 0.71004
## Max. : 3.72285 Max. : 3.05961 Max. : 3.36236
## NA's :246 NA's :246 NA's :246
## errors.prac errors.slow.c.cong errors.slow.c.inc
## Min. :-2.65139 Min. :-2.87464 Min. :-2.71655
## 1st Qu.:-0.67979 1st Qu.:-0.68237 1st Qu.:-0.68067
## Median : 0.00760 Median : 0.00764 Median : 0.02927
## Mean :-0.02728 Mean :-0.00233 Mean : 0.03072
## 3rd Qu.: 0.66139 3rd Qu.: 0.67304 3rd Qu.: 0.73078
## Max. : 2.64629 Max. : 3.34258 Max. : 2.96471
## NA's :246 NA's :246 NA's :246
## errors.slow.i.cong errors.slow.i.inc errors.slow.prac
## Min. :-2.97736 Min. :-3.48673 Min. :-3.38586
## 1st Qu.:-0.67957 1st Qu.:-0.64765 1st Qu.:-0.77620
## Median : 0.09023 Median :-0.00268 Median :-0.07298
## Mean : 0.05189 Mean :-0.02303 Mean :-0.04885
## 3rd Qu.: 0.70666 3rd Qu.: 0.62909 3rd Qu.: 0.66078
## Max. : 2.64540 Max. : 2.88500 Max. : 2.63078
## NA's :246 NA's :246 NA's :246
## Errors.total errors.total.c.cong errors.total.c.inc
## Min. :-3.02983 Min. :-3.94584 Min. :-3.04545
## 1st Qu.:-0.66766 1st Qu.:-0.71108 1st Qu.:-0.68723
## Median : 0.03033 Median :-0.01434 Median :-0.04592
## Mean : 0.00020 Mean :-0.03496 Mean :-0.00674
## 3rd Qu.: 0.67346 3rd Qu.: 0.68086 3rd Qu.: 0.68570
## Max. : 2.83657 Max. : 2.96111 Max. : 3.49591
## NA's :246 NA's :246 NA's :246
## errors.total.i.cong errors.total.i.inc errors.total.PERC
## Min. :-3.35570 Min. :-3.06428 Min. :-3.52159
## 1st Qu.:-0.69939 1st Qu.:-0.75856 1st Qu.:-0.68296
## Median :-0.01178 Median :-0.04273 Median :-0.06017
## Mean :-0.02606 Mean :-0.06233 Mean :-0.06873
## 3rd Qu.: 0.68439 3rd Qu.: 0.60996 3rd Qu.: 0.54828
## Max. : 3.05471 Max. : 2.70558 Max. : 3.38433
## NA's :246 NA's :246 NA's :246
## errors.total.prac Int.RT.All Int.RT.ProB
## Min. :-2.78635 Min. :-2.70803 Min. :-2.93328
## 1st Qu.:-0.75211 1st Qu.:-0.64684 1st Qu.:-0.64654
## Median :-0.11411 Median :-0.01643 Median : 0.06557
## Mean :-0.05913 Mean :-0.00633 Mean : 0.02768
## 3rd Qu.: 0.57326 3rd Qu.: 0.63270 3rd Qu.: 0.68377
## Max. : 3.28847 Max. : 3.13618 Max. : 2.98533
## NA's :246 NA's :246 NA's :246
## Int.RT.ReaB RT.avg rt.c.cong
## Min. :-3.11339 Min. :-2.61333 Min. :-2.90894
## 1st Qu.:-0.65444 1st Qu.:-0.64842 1st Qu.:-0.69766
## Median : 0.01216 Median :-0.00995 Median :-0.04453
## Mean : 0.03416 Mean :-0.00645 Mean :-0.03129
## 3rd Qu.: 0.72491 3rd Qu.: 0.60780 3rd Qu.: 0.56930
## Max. : 3.30060 Max. : 3.69951 Max. : 3.52680
## NA's :246 NA's :246 NA's :246
## rt.c.inc RT.conb RT.congb RT.congt
## Min. :-3.41278 Min. :-3.1444 Min. :-3.7782 Min. :-3.24379
## 1st Qu.:-0.67249 1st Qu.:-0.7331 1st Qu.:-0.7437 1st Qu.:-0.64294
## Median : 0.04412 Median :-0.0374 Median :-0.0559 Median : 0.02415
## Mean :-0.00284 Mean :-0.0629 Mean :-0.0678 Mean : 0.03364
## 3rd Qu.: 0.66931 3rd Qu.: 0.5235 3rd Qu.: 0.5887 3rd Qu.: 0.75205
## Max. : 2.85384 Max. : 2.8513 Max. : 2.8365 Max. : 3.49464
## NA's :246 NA's :738 NA's :492 NA's :246
## rt.i.cong rt.i.inc RT.incb
## Min. :-3.1709 Min. :-3.28774 Min. :-3.14474
## 1st Qu.:-0.6826 1st Qu.:-0.67278 1st Qu.:-0.73181
## Median : 0.0261 Median :-0.00448 Median :-0.01798
## Mean : 0.0232 Mean :-0.01430 Mean :-0.02712
## 3rd Qu.: 0.7156 3rd Qu.: 0.62549 3rd Qu.: 0.62710
## Max. : 3.7440 Max. : 3.14034 Max. : 3.30890
## NA's :246 NA's :246 NA's :246
## RT.inct rt.prac
## Min. :-3.82650 Min. :-3.29600
## 1st Qu.:-0.69212 1st Qu.:-0.72985
## Median :-0.01255 Median :-0.04557
## Mean :-0.01289 Mean :-0.04722
## 3rd Qu.: 0.63413 3rd Qu.: 0.64025
## Max. : 3.81651 Max. : 2.45001
## NA's :246 NA's :246