Dealing with messy data in R (Dplyr, Tidyr, Jsonlite, and Facebook API)




For Tuesday’s R Club meeting, I’d like to go over two related things if there’s time to get through both:

  1. Messy data: Rearranging messy datasets with Dplyr and Tidyr
  2. Really, really, really messy data: Using the Facebook API

My Facebook code is not completely done, but it is functional and I think it would be good for anyone who is not familiar with API’s to see how you get and work with a real dataset. It comes in Json format (like XML; tableless), so it’s pretty exciting. I will post the Facebook code once I’ve finished tuning it up in the next week or two.

An overview of data wrangling in R

RMD file. Based on a recent presentation by Hadley Wickham. You’ll definitely want to get the Data Wrangling Cheat Sheet. Here’s the video:

library(RCurl, warn = FALSE)
## Loading required package: bitops
# get tb data from Hadley Wickham's github
myData <- getURL("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/tb.csv", ssl.verifypeer = FALSE) 
tbdata <- read.csv(textConnection(myData)) 
head(tbdata, n=7)
##   iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514
## 1   AD 1989  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 2   AD 1990  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 3   AD 1991  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 4   AD 1992  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 5   AD 1993  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 6   AD 1994  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 7   AD 1996  NA   NA    0     0     0     4     1     0   0 NA  NA   NA
##   f014 f1524 f2534 f3544 f4554 f5564 f65 fu
## 1   NA    NA    NA    NA    NA    NA  NA NA
## 2   NA    NA    NA    NA    NA    NA  NA NA
## 3   NA    NA    NA    NA    NA    NA  NA NA
## 4   NA    NA    NA    NA    NA    NA  NA NA
## 5   NA    NA    NA    NA    NA    NA  NA NA
## 6   NA    NA    NA    NA    NA    NA  NA NA
## 7    0     1     1     0     0     1   0 NA

tidyr

Messy -> Tidy Data

library(tidyr)
library(dplyr, warn = FALSE)

# gather and separate the data to make it "tidy"
tb2 <- tbdata %>%
  gather(demo, n, -iso2, -year, na.rm = TRUE) %>% 
  separate(demo, c("sex","age"), 1)
head(tb2, n=7)
##   iso2 year sex age n
## 1   AD 2005   m  04 0
## 2   AD 2006   m  04 0
## 3   AD 2008   m  04 0
## 4   AE 2006   m  04 0
## 5   AE 2007   m  04 0
## 6   AE 2008   m  04 0
## 7   AG 2007   m  04 0

dplyr

Manipulate data

# rename variables and sort observations (arrange)
tb3 <- tb2 %>%
  rename(country = iso2) %>%
  arrange(country, year, sex, age)
head(tb3, n=7)
##   country year sex  age n
## 1      AD 1996   f  014 0
## 2      AD 1996   f 1524 1
## 3      AD 1996   f 2534 1
## 4      AD 1996   f 3544 0
## 5      AD 1996   f 4554 0
## 6      AD 1996   f 5564 1
## 7      AD 1996   f   65 0

more examples

demo(package = "tidyr") # produces a list of all of the demos in package 'tidyr'
demo('so-15668870', package = "tidyr", ask = FALSE)
## 
## 
##  demo(so-15668870)
##  ---- ~~~~~~~~~~~
## 
## > # http://stackoverflow.com/questions/15668870/
## > library(tidyr)
## 
## > library(dplyr)
## 
## > grades <- tbl_df(read.table(header = TRUE, text = "
## +    ID   Test Year   Fall Spring Winter
## +     1   1   2008    15      16      19
## +     1   1   2009    12      13      27
## +     1   2   2008    22      22      24
## +     1   2   2009    10      14      20
## +     2   1   2008    12      13      25
## +     2   1   2009    16      14      21
## +     2   2   2008    13      11      29
## +     2   2   2009    23      20      26
## +     3   1   2008    11      12      22
## +     3   1   2009    13      11      27
## +     3   2   2008    17      12      23
## +     3   2   2009    14      9       31
## + "))
## 
## > grades %>%
## +   gather(Semester, Score, Fall:Winter) %>%
## +   mutate(Test = paste0("Test", Test)) %>%
## +   spread(Test, Score) %>%
## +   arrange(ID, Year, Semester)
## Source: local data frame [18 x 5]
## 
##    ID Year Semester Test1 Test2
## 1   1 2008     Fall    15    22
## 2   1 2008   Spring    16    22
## 3   1 2008   Winter    19    24
## 4   1 2009     Fall    12    10
## 5   1 2009   Spring    13    14
## 6   1 2009   Winter    27    20
## 7   2 2008     Fall    12    13
## 8   2 2008   Spring    13    11
## 9   2 2008   Winter    25    29
## 10  2 2009     Fall    16    23
## 11  2 2009   Spring    14    20
## 12  2 2009   Winter    21    26
## 13  3 2008     Fall    11    17
## 14  3 2008   Spring    12    12
## 15  3 2008   Winter    22    23
## 16  3 2009     Fall    13    14
## 17  3 2009   Spring    11     9
## 18  3 2009   Winter    27    31
demo('dadmom', package = "tidyr", ask = FALSE)
## 
## 
##  demo(dadmom)
##  ---- ~~~~~~
## 
## > library(tidyr)
## 
## > library(dplyr)
## 
## > dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
## 
## > dadmom %>%
## +   gather(key, value, named:incm) %>%
## +   separate(key, c("variable", "type"), -2) %>%
## +   spread(variable, value, convert = TRUE)
##   famid type   inc name
## 1     1    d 30000 Bill
## 2     1    m 15000 Bess
## 3     2    d 22000  Art
## 4     2    m 18000  Amy
## 5     3    d 25000 Paul
## 6     3    m 50000  Pat

more resources

vignette("tidy-data")



Comments are closed.