Tagged: data manipulation

reformatting data to wide

Here’s some thing we were working on in R Club today.

This example shows how to reformat data from (sort of) long to wide. We’ve got pre and post measurements for each subject, some subject-level variables (called dyad vars in the example), and two variables that vary by pre and post. These are the outcome of interest (DV) and a covariate (the length of the video). The data is “sort of” long in that pre and post are each on their own line, but the type of measure (DV or video length) is still each in its own column. To make it fully wide, where there’s only one line for each subject, first we need to make it fully long, and then spread it all out. I’m using dplyr and tidyr here since they’re such lovely tools, but of course there’s more than one way to go about this.

Enjoy!









Here’s the whole thing:

data <- data.frame(subj=sort(c(1:10, 1:10)), cond=rep(c("pre", "post"), 10), DV=rnorm(20), video=rnorm(20), dyadvar1=sort(c(1:10, 1:10)), dyadvar2=sort(c(1:10, 1:10)))

data
##    subj cond          DV      video dyadvar1 dyadvar2
## 1     1  pre -0.09424282 -0.5553683        1        1
## 2     1 post  1.50787647  0.1086909        1        1
## 3     2  pre  1.63471884 -0.5022278        2        2
## 4     2 post  0.06910946 -0.1067023        2        2
## 5     3  pre  0.60311557 -0.3074021        3        3
## 6     3 post  1.34918285  1.0210010        3        3
## 7     4  pre -0.52288040  0.3636248        4        4
## 8     4 post -1.14408452  0.9635112        4        4
## 9     5  pre -1.67596739 -0.5665085        5        5
## 10    5 post  0.45287833  0.1465894        5        5
## 11    6  pre  0.96739333 -0.2250043        6        6
## 12    6 post  0.20794296  1.0133354        6        6
## 13    7  pre  1.25381345 -1.0679324        7        7
## 14    7 post  0.73957481 -1.3426321        7        7
## 15    8  pre -1.26477564 -0.4130490        8        8
## 16    8 post -1.80489529 -0.5602744        8        8
## 17    9  pre -0.85829518  0.8735288        9        9
## 18    9 post  0.52721178  0.1642551        9        9
## 19   10  pre  0.08717665 -0.3085623       10       10
## 20   10 post -1.28951524 -1.7575259       10       10
library(tidyr); library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
wide.data <- data %>%
  gather(key="measure", value="value", DV:video) %>%
  unite(col=key, cond, measure)  %>%
  spread(key=key, value=value)

Here’s each step:

library(knitr)
data %>%
  gather(key="measure", value="value", DV:video)
##    subj cond dyadvar1 dyadvar2 measure       value
## 1     1  pre        1        1      DV -0.09424282
## 2     1 post        1        1      DV  1.50787647
## 3     2  pre        2        2      DV  1.63471884
## 4     2 post        2        2      DV  0.06910946
## 5     3  pre        3        3      DV  0.60311557
## 6     3 post        3        3      DV  1.34918285
## 7     4  pre        4        4      DV -0.52288040
## 8     4 post        4        4      DV -1.14408452
## 9     5  pre        5        5      DV -1.67596739
## 10    5 post        5        5      DV  0.45287833
## 11    6  pre        6        6      DV  0.96739333
## 12    6 post        6        6      DV  0.20794296
## 13    7  pre        7        7      DV  1.25381345
## 14    7 post        7        7      DV  0.73957481
## 15    8  pre        8        8      DV -1.26477564
## 16    8 post        8        8      DV -1.80489529
## 17    9  pre        9        9      DV -0.85829518
## 18    9 post        9        9      DV  0.52721178
## 19   10  pre       10       10      DV  0.08717665
## 20   10 post       10       10      DV -1.28951524
## 21    1  pre        1        1   video -0.55536830
## 22    1 post        1        1   video  0.10869087
## 23    2  pre        2        2   video -0.50222782
## 24    2 post        2        2   video -0.10670226
## 25    3  pre        3        3   video -0.30740210
## 26    3 post        3        3   video  1.02100096
## 27    4  pre        4        4   video  0.36362482
## 28    4 post        4        4   video  0.96351116
## 29    5  pre        5        5   video -0.56650848
## 30    5 post        5        5   video  0.14658942
## 31    6  pre        6        6   video -0.22500427
## 32    6 post        6        6   video  1.01333537
## 33    7  pre        7        7   video -1.06793244
## 34    7 post        7        7   video -1.34263213
## 35    8  pre        8        8   video -0.41304900
## 36    8 post        8        8   video -0.56027441
## 37    9  pre        9        9   video  0.87352882
## 38    9 post        9        9   video  0.16425512
## 39   10  pre       10       10   video -0.30856231
## 40   10 post       10       10   video -1.75752594
data %>%
  gather(key="measure", value="value", DV:video) %>%
  unite(col=key, cond, measure)
##    subj        key dyadvar1 dyadvar2       value
## 1     1     pre_DV        1        1 -0.09424282
## 2     1    post_DV        1        1  1.50787647
## 3     2     pre_DV        2        2  1.63471884
## 4     2    post_DV        2        2  0.06910946
## 5     3     pre_DV        3        3  0.60311557
## 6     3    post_DV        3        3  1.34918285
## 7     4     pre_DV        4        4 -0.52288040
## 8     4    post_DV        4        4 -1.14408452
## 9     5     pre_DV        5        5 -1.67596739
## 10    5    post_DV        5        5  0.45287833
## 11    6     pre_DV        6        6  0.96739333
## 12    6    post_DV        6        6  0.20794296
## 13    7     pre_DV        7        7  1.25381345
## 14    7    post_DV        7        7  0.73957481
## 15    8     pre_DV        8        8 -1.26477564
## 16    8    post_DV        8        8 -1.80489529
## 17    9     pre_DV        9        9 -0.85829518
## 18    9    post_DV        9        9  0.52721178
## 19   10     pre_DV       10       10  0.08717665
## 20   10    post_DV       10       10 -1.28951524
## 21    1  pre_video        1        1 -0.55536830
## 22    1 post_video        1        1  0.10869087
## 23    2  pre_video        2        2 -0.50222782
## 24    2 post_video        2        2 -0.10670226
## 25    3  pre_video        3        3 -0.30740210
## 26    3 post_video        3        3  1.02100096
## 27    4  pre_video        4        4  0.36362482
## 28    4 post_video        4        4  0.96351116
## 29    5  pre_video        5        5 -0.56650848
## 30    5 post_video        5        5  0.14658942
## 31    6  pre_video        6        6 -0.22500427
## 32    6 post_video        6        6  1.01333537
## 33    7  pre_video        7        7 -1.06793244
## 34    7 post_video        7        7 -1.34263213
## 35    8  pre_video        8        8 -0.41304900
## 36    8 post_video        8        8 -0.56027441
## 37    9  pre_video        9        9  0.87352882
## 38    9 post_video        9        9  0.16425512
## 39   10  pre_video       10       10 -0.30856231
## 40   10 post_video       10       10 -1.75752594
data %>%
  gather(key="measure", value="value", DV:video) %>%
  unite(col=key, cond, measure)  %>%
  spread(key=key, value=value)
##    subj dyadvar1 dyadvar2     post_DV post_video      pre_DV  pre_video
## 1     1        1        1  1.50787647  0.1086909 -0.09424282 -0.5553683
## 2     2        2        2  0.06910946 -0.1067023  1.63471884 -0.5022278
## 3     3        3        3  1.34918285  1.0210010  0.60311557 -0.3074021
## 4     4        4        4 -1.14408452  0.9635112 -0.52288040  0.3636248
## 5     5        5        5  0.45287833  0.1465894 -1.67596739 -0.5665085
## 6     6        6        6  0.20794296  1.0133354  0.96739333 -0.2250043
## 7     7        7        7  0.73957481 -1.3426321  1.25381345 -1.0679324
## 8     8        8        8 -1.80489529 -0.5602744 -1.26477564 -0.4130490
## 9     9        9        9  0.52721178  0.1642551 -0.85829518  0.8735288
## 10   10       10       10 -1.28951524 -1.7575259  0.08717665 -0.3085623



611 Highlights in R (Part II)



611 Highlights in R (Part II)


The raw .R file is here and the Rmd file is here.

Brief ‘Hello’ to R Markdown

Let me first introduce you to R markdown and kitr. This format allows you to integrate your analyses with text that can be automatically translated to html, word, and pdf formats. I just produced a supplementary materials document using this and it probably saved me a ridiculous number of hours. It’s easy to get started:

  1. Save your file with extension .Rmd.
  2. Place something like the following in the very top of your file:

    ---
    author: John Flournoy
    title: 611 Highlights in R (Part II)
    output: html_document
    ---
  3. Write as you normally would, using Markdown syntax for formatting. For example:

    # This is a big heading
    
    1. this is
    2. a numbered list
    
    - this is a bulleted list
    - With **bold** and *italic*  text
  4. When you want to throw in some R code, place it inside an R chunk by wrapping it in ```{r} at the beginning and ``` at the end. When you do, it comes out something like this:

    # From Rose's presentation:
    # a toy data set for us to work with
    n <- 100
    gender <- as.factor(c(rep("M", n/2), rep("F", n/2)))
    IQ <- rnorm(n, mean=100, sd=15)
    degree <- rep(c("HS", "BA", "MS", "PhD"), n/4)
    height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1)
    RT1 <- rchisq(n, 4)
    RT2 <- rchisq(n, 4)
    DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n)
    
    df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2)
    head(df)
    ##    awesome gender        IQ degree   height      RT1      RT2
    ## 1 52.89486      M  79.40263     HS 6.292706 1.761990 2.228117
    ## 2 57.04782      M  96.31289     BA 4.759042 2.236027 7.363253
    ## 3 55.59594      M 104.95331     MS 5.560292 5.159114 7.126281
    ## 4 55.44397      M  84.23347    PhD 5.354595 7.652900 2.836956
    ## 5 55.45191      M 107.38754     HS 4.523997 7.768904 4.504664
    ## 6 54.70468      M 110.07096     BA 6.139368 7.786462 7.508488

Notice that in 4, the we see the output of the command head(df). This is useful for outputting tables and plots, as we’ll see.

Here are more resources on

Streamlined Data Workflow with dplyr

Download the data here:

Make sure you have the package (library(dplyr)), and if you don’t, install it (install.packages('dplyr')). Here’s an intro to dplyr straight from the horse’s (Hadley’s) mouth.

By way of data manipulation that I’m actually using for a real project, this tutorial will introduce you to

  • %>%
    • a function chaining operator
  • mutate and transmute
    • adds new variables and preserves existing (transmute drops existing variables)
  • mutate_each
    • adds new variables, applying functions in funs(...) to each column
  • group_by
    • groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
  • summarise
    • summarizes multiple values to a single value
  • left_join
    • joins two data frames, returning all rows from x, and all columns from x and y
  • select
    • keeps the variables/columns that you mention (see ?select for special functions that help you select variables)
  • filter
    • returns rows with matching conditions (similar to selecting columns like: aDataFrame[variable == value,])
  • group_by
    • groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
  • summarise
    • summarizes multiple values to a single value
  • do
    • use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same lm or t.test separately for each gender, country)

Continue reading

dplyr

So useful!

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Hadley explains (at the above link):

The dplyr package makes each of these [data processing] steps as fast and easy as possible by:

  • Elucidating the most common data manipulation operations, so that your options are helpfully constrained when thinking about how to tackle a problem.
  • Providing simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.
  • Using efficient data storage backends, so that you spend as little time waiting for the computer as possible.

The dplyr debut blog post may also be of interest.