Hadley Wickham, being awesome

So I was JUST wishing I understood about how to use R tools really well in version control, so I can version control not only my documents and my R scripts, but also the data itself in .rdata format, and all the other cool tools I don’t even know I wish I knew. Then I opened twitter, and found this:

http://pages.rstudio.net/Webinar—February-2015_Registration.html

I’m pretty sure Hadley Wickham is my fairy godmother. I’m going to try wishing for a ball gown and a big, shiny pumpkin car now, to see what happens.

Dealing with messy data in R (Dplyr, Tidyr, Jsonlite, and Facebook API)




For Tuesday’s R Club meeting, I’d like to go over two related things if there’s time to get through both:

  1. Messy data: Rearranging messy datasets with Dplyr and Tidyr
  2. Really, really, really messy data: Using the Facebook API

My Facebook code is not completely done, but it is functional and I think it would be good for anyone who is not familiar with API’s to see how you get and work with a real dataset. It comes in Json format (like XML; tableless), so it’s pretty exciting. I will post the Facebook code once I’ve finished tuning it up in the next week or two.

An overview of data wrangling in R

RMD file. Based on a recent presentation by Hadley Wickham. You’ll definitely want to get the Data Wrangling Cheat Sheet. Here’s the video:

library(RCurl, warn = FALSE)
## Loading required package: bitops
# get tb data from Hadley Wickham's github
myData <- getURL("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/tb.csv", ssl.verifypeer = FALSE) 
tbdata <- read.csv(textConnection(myData)) 
head(tbdata, n=7)
##   iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514
## 1   AD 1989  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 2   AD 1990  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 3   AD 1991  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 4   AD 1992  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 5   AD 1993  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 6   AD 1994  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 7   AD 1996  NA   NA    0     0     0     4     1     0   0 NA  NA   NA
##   f014 f1524 f2534 f3544 f4554 f5564 f65 fu
## 1   NA    NA    NA    NA    NA    NA  NA NA
## 2   NA    NA    NA    NA    NA    NA  NA NA
## 3   NA    NA    NA    NA    NA    NA  NA NA
## 4   NA    NA    NA    NA    NA    NA  NA NA
## 5   NA    NA    NA    NA    NA    NA  NA NA
## 6   NA    NA    NA    NA    NA    NA  NA NA
## 7    0     1     1     0     0     1   0 NA

tidyr

Messy -> Tidy Data

library(tidyr)
library(dplyr, warn = FALSE)

# gather and separate the data to make it "tidy"
tb2 <- tbdata %>%
  gather(demo, n, -iso2, -year, na.rm = TRUE) %>% 
  separate(demo, c("sex","age"), 1)
head(tb2, n=7)
##   iso2 year sex age n
## 1   AD 2005   m  04 0
## 2   AD 2006   m  04 0
## 3   AD 2008   m  04 0
## 4   AE 2006   m  04 0
## 5   AE 2007   m  04 0
## 6   AE 2008   m  04 0
## 7   AG 2007   m  04 0

dplyr

Manipulate data

# rename variables and sort observations (arrange)
tb3 <- tb2 %>%
  rename(country = iso2) %>%
  arrange(country, year, sex, age)
head(tb3, n=7)
##   country year sex  age n
## 1      AD 1996   f  014 0
## 2      AD 1996   f 1524 1
## 3      AD 1996   f 2534 1
## 4      AD 1996   f 3544 0
## 5      AD 1996   f 4554 0
## 6      AD 1996   f 5564 1
## 7      AD 1996   f   65 0

more examples

demo(package = "tidyr") # produces a list of all of the demos in package 'tidyr'
demo('so-15668870', package = "tidyr", ask = FALSE)
## 
## 
##  demo(so-15668870)
##  ---- ~~~~~~~~~~~
## 
## > # http://stackoverflow.com/questions/15668870/
## > library(tidyr)
## 
## > library(dplyr)
## 
## > grades <- tbl_df(read.table(header = TRUE, text = "
## +    ID   Test Year   Fall Spring Winter
## +     1   1   2008    15      16      19
## +     1   1   2009    12      13      27
## +     1   2   2008    22      22      24
## +     1   2   2009    10      14      20
## +     2   1   2008    12      13      25
## +     2   1   2009    16      14      21
## +     2   2   2008    13      11      29
## +     2   2   2009    23      20      26
## +     3   1   2008    11      12      22
## +     3   1   2009    13      11      27
## +     3   2   2008    17      12      23
## +     3   2   2009    14      9       31
## + "))
## 
## > grades %>%
## +   gather(Semester, Score, Fall:Winter) %>%
## +   mutate(Test = paste0("Test", Test)) %>%
## +   spread(Test, Score) %>%
## +   arrange(ID, Year, Semester)
## Source: local data frame [18 x 5]
## 
##    ID Year Semester Test1 Test2
## 1   1 2008     Fall    15    22
## 2   1 2008   Spring    16    22
## 3   1 2008   Winter    19    24
## 4   1 2009     Fall    12    10
## 5   1 2009   Spring    13    14
## 6   1 2009   Winter    27    20
## 7   2 2008     Fall    12    13
## 8   2 2008   Spring    13    11
## 9   2 2008   Winter    25    29
## 10  2 2009     Fall    16    23
## 11  2 2009   Spring    14    20
## 12  2 2009   Winter    21    26
## 13  3 2008     Fall    11    17
## 14  3 2008   Spring    12    12
## 15  3 2008   Winter    22    23
## 16  3 2009     Fall    13    14
## 17  3 2009   Spring    11     9
## 18  3 2009   Winter    27    31
demo('dadmom', package = "tidyr", ask = FALSE)
## 
## 
##  demo(dadmom)
##  ---- ~~~~~~
## 
## > library(tidyr)
## 
## > library(dplyr)
## 
## > dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
## 
## > dadmom %>%
## +   gather(key, value, named:incm) %>%
## +   separate(key, c("variable", "type"), -2) %>%
## +   spread(variable, value, convert = TRUE)
##   famid type   inc name
## 1     1    d 30000 Bill
## 2     1    m 15000 Bess
## 3     2    d 22000  Art
## 4     2    m 18000  Amy
## 5     3    d 25000 Paul
## 6     3    m 50000  Pat

more resources

vignette("tidy-data")



Welcome to a Winter Wonderland… of R!

Welcome to winter term! R Club meets on odd weeks this term, Tuesdays 3:00pm-4:20pm in the Straub computer lab (room 008). You don’t need to register for R Club to attend; if you’re still deciding whether or not to register for R Club (CRN25270), see this post.

We’ll spend our first meeting (Tuesday 1/6) getting set up for the term, which includes installing R (for anyone that doesn’t have it already), setting up our schedule for super fun low-pressure extremely useful presentations (a.k.a. nuggets, see last term’s schedule for examples), and then enjoying a couple intro nuggets by yours truly and possibly also John truly, and maybe a nugget by Jacob as well!

Nugget 1 (Intro): Swirl! The most funnest way ever to learn R, f’real.

Nugget 2 (Intro): 611 Highlights in R: An overview of some of the handy functions that we used to manipulate data and run analyses in PSY611.

Nugget 3: Nifty trick: Network graphs from plaintext notes using the qgraph
package.

End of term reminder

Just a friendly reminder, since we’re approaching the end of the term. If you’re taking R Club for credit, you need to:

1) attend R Club (we’ve been keeping track of attendance, so you’re covered there)

2) send a one-page-ish write-up to Sanjay (sanjay@uoregon.edu) by the end of the term, describing what you’ve done and/or what you’ve learned in R Club this term.

If you are signed up for credit and you think you’ll have trouble meeting one or both of these requirements, email Sanjay.

Neat non-academic job post

I thought you might be interested in this awesome-sounding position at Etsy: https://www.etsy.com/careers/job/oWRzZfw5

Note that they require R (or python). 🙂 Note that they also require SQL! I’ve heard from lots of sources that the transition from working with data that can be stored in spreadsheets (e.g. csv, spss, or excel files) to database systems is one of the stumbling blocks for academically-trained analysts interested in industry positions.

slight change to the schedule for 11/25

Hi all,

As you may have heard, psychology has a cool speaker coming rather last minute next week: Antonella Pavese, leader on the Google User Experience team. She sounds pretty baller (http://www.antonellapavese.com/about/). She did a post-doc here at the UO with Mike Posner, and will be talking about her work. If you want to hear her speak, here’s where you need to be:

2:00 on Tuesday Nov 25 in Franklin, room 271B

We’re not exactly sure how long her presentation (or questions afterward) will run, so I’m afraid R Club may start late. If you’re planning to come to the talk, great, we can all just trot next door to R Club when it’s over. If not, plan to arrive at 3:30 next week instead of 3:00 (unless you’re cool with hanging out more or less alone for a little while, in which case, 3:00 will work just fine).

In other news, it sounds like we have the opportunity to get some $ponsor$hip for R Club (see what I did there?) from Revolution Analytics. We can chat about it during our meeting next week. For some background, check out this blog post written by a student at UC-Davis who runs a group like ours (they’re sponsored): http://software-carpentry.org/blog/2014/11/users-groups-for-ongoing-learning.html
Here are the details about sponsorship: http://www.revolutionanalytics.com/r-user-group-sponsorship-program

Dealing with missing data via multiple imputation

Here’s a little teaser for one of tomorrow’s nuggets.

I’ll be talking about using multiple imputation as a remedy for missing data, using the Amelia package. To whet your appetite, check out this pithy post about how R handles missing values in general:

http://www.ats.ucla.edu/stat/r/faq/missing.htm

Multiple Imputation!

Here’s a link to a webinar on missing data (you need to register with your email address to get access to the videos): http://www.theanalysisfactor.com/webinars/recordings/downloads/#v5

Here’s a link to an Rpubs handout: http://rpubs.com/rosemm/33543

And here’s all the relevant code:

install.packages("Amelia")
library(Amelia)

data() # Amelia comes with some datasets
data(africa) # let's pull in the africa dataset
str(africa)
?africa
View(africa)
summary(africa)
summary(lm(africa$civlib ~ africa$trade)) # listwise deletion

?amelia
m <- 5 # the number of datasets to create (5 is typical) a.out <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc") # note that we're using all the variables, even though we won't use them all in the regression summary(a.out) plot(a.out) par(mfrow=c(1,1)) missmap(a.out) # run our regression on each dataset b.out<-NULL se.out<-NULL for(i in 1:m) { ols.out <- lm(civlib ~ trade ,data = a.out$imputations[[i]]) b.out <- rbind(b.out, ols.out$coef) se.out <- rbind(se.out, coef(summary(ols.out))[,2]) } # combine the results from all of the different regressions combined.results <- mi.meld(q = b.out, se = se.out) ?AmeliaView # Sounds fun, but it didn't work for me. Meh.

R you excited?

R Club starts tomorrow! We’ll be talking about how we want to run R Club this year, we’ll try to figure out a schedule of mini-presentations (a.k.a. “nuggets”), and, for those of you getting into R for the first time, we’ll set you up with installation help, practice problems, and a super fun intro-to-r package called swirl. We’ll also have an awesome nugget on bootstrapping/simulations to kick things off!
When: Tuesdays on odd weeks (beginning this week), 3:00pm-4:20pm
Where: Franklin room 271A
See you there!
P.S. If you’re still deciding whether to officially register for R Club or just come casually, you might want to review our recent post on registration.

Code Reviews?

Here’s an idea for an activity we could add to R Club: http://cacm.acm.org/magazines/2014/9/177929-refining-students-coding-and-reviewing-skills/fulltext

The author writes about it with more traditional programming examples rather than data analysis stuff per se, but I think it would work really well for R Club. A lot of the learning curve for R is the programming-y stuff (getting used to using variables, functions, arguments, etc.), and I think code reviews would be a great way to learn that stuff from each other, in the same way that the author describes in his post. I also think code reviews would be especially, specifically awesome for stats and R because of the super redundant nature of both – there’s always more than one way to skin a cat, as the disturbing proverb goes, and it’s great to learn a variety of tools and approaches for statistical analysis in general, and stats in R in particular. Code review would certainly facilitate learning alternate ways of doing things from each other.

Thoughts?