Dealing with messy data in R (Dplyr, Tidyr, Jsonlite, and Facebook API)
For Tuesday’s R Club meeting, I’d like to go over two related things if there’s time to get through both:
- Messy data: Rearranging messy datasets with Dplyr and Tidyr
- Really, really, really messy data: Using the Facebook API
My Facebook code is not completely done, but it is functional and I think it would be good for anyone who is not familiar with API’s to see how you get and work with a real dataset. It comes in Json format (like XML; tableless), so it’s pretty exciting. I will post the Facebook code once I’ve finished tuning it up in the next week or two.
An overview of data wrangling in R
RMD file. Based on a recent presentation by Hadley Wickham. You’ll definitely want to get the Data Wrangling Cheat Sheet. Here’s the video:
library(RCurl, warn = FALSE)
## Loading required package: bitops
# get tb data from Hadley Wickham's github
myData <- getURL("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/tb.csv", ssl.verifypeer = FALSE)
tbdata <- read.csv(textConnection(myData))
head(tbdata, n=7)
## iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514
## 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA
## 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA
## 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA
## 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA
## 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA
## f014 f1524 f2534 f3544 f4554 f5564 f65 fu
## 1 NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA
## 7 0 1 1 0 0 1 0 NA
tidyr
Messy -> Tidy Data
library(tidyr)
library(dplyr, warn = FALSE)
# gather and separate the data to make it "tidy"
tb2 <- tbdata %>%
gather(demo, n, -iso2, -year, na.rm = TRUE) %>%
separate(demo, c("sex","age"), 1)
head(tb2, n=7)
## iso2 year sex age n
## 1 AD 2005 m 04 0
## 2 AD 2006 m 04 0
## 3 AD 2008 m 04 0
## 4 AE 2006 m 04 0
## 5 AE 2007 m 04 0
## 6 AE 2008 m 04 0
## 7 AG 2007 m 04 0
dplyr
Manipulate data
# rename variables and sort observations (arrange)
tb3 <- tb2 %>%
rename(country = iso2) %>%
arrange(country, year, sex, age)
head(tb3, n=7)
## country year sex age n
## 1 AD 1996 f 014 0
## 2 AD 1996 f 1524 1
## 3 AD 1996 f 2534 1
## 4 AD 1996 f 3544 0
## 5 AD 1996 f 4554 0
## 6 AD 1996 f 5564 1
## 7 AD 1996 f 65 0
more examples
demo(package = "tidyr") # produces a list of all of the demos in package 'tidyr'
demo('so-15668870', package = "tidyr", ask = FALSE)
##
##
## demo(so-15668870)
## ---- ~~~~~~~~~~~
##
## > # http://stackoverflow.com/questions/15668870/
## > library(tidyr)
##
## > library(dplyr)
##
## > grades <- tbl_df(read.table(header = TRUE, text = "
## + ID Test Year Fall Spring Winter
## + 1 1 2008 15 16 19
## + 1 1 2009 12 13 27
## + 1 2 2008 22 22 24
## + 1 2 2009 10 14 20
## + 2 1 2008 12 13 25
## + 2 1 2009 16 14 21
## + 2 2 2008 13 11 29
## + 2 2 2009 23 20 26
## + 3 1 2008 11 12 22
## + 3 1 2009 13 11 27
## + 3 2 2008 17 12 23
## + 3 2 2009 14 9 31
## + "))
##
## > grades %>%
## + gather(Semester, Score, Fall:Winter) %>%
## + mutate(Test = paste0("Test", Test)) %>%
## + spread(Test, Score) %>%
## + arrange(ID, Year, Semester)
## Source: local data frame [18 x 5]
##
## ID Year Semester Test1 Test2
## 1 1 2008 Fall 15 22
## 2 1 2008 Spring 16 22
## 3 1 2008 Winter 19 24
## 4 1 2009 Fall 12 10
## 5 1 2009 Spring 13 14
## 6 1 2009 Winter 27 20
## 7 2 2008 Fall 12 13
## 8 2 2008 Spring 13 11
## 9 2 2008 Winter 25 29
## 10 2 2009 Fall 16 23
## 11 2 2009 Spring 14 20
## 12 2 2009 Winter 21 26
## 13 3 2008 Fall 11 17
## 14 3 2008 Spring 12 12
## 15 3 2008 Winter 22 23
## 16 3 2009 Fall 13 14
## 17 3 2009 Spring 11 9
## 18 3 2009 Winter 27 31
demo('dadmom', package = "tidyr", ask = FALSE)
##
##
## demo(dadmom)
## ---- ~~~~~~
##
## > library(tidyr)
##
## > library(dplyr)
##
## > dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
##
## > dadmom %>%
## + gather(key, value, named:incm) %>%
## + separate(key, c("variable", "type"), -2) %>%
## + spread(variable, value, convert = TRUE)
## famid type inc name
## 1 1 d 30000 Bill
## 2 1 m 15000 Bess
## 3 2 d 22000 Art
## 4 2 m 18000 Amy
## 5 3 d 25000 Paul
## 6 3 m 50000 Pat
more resources
vignette("tidy-data")
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });