So I was JUST wishing I understood about how to use R tools really well in version control, so I can version control not only my documents and my R scripts, but also the data itself in .rdata format, and all the other cool tools I don’t even know I wish I knew. Then I opened twitter, and found this:
For Tuesday’s R Club meeting, I’d like to go over two related things if there’s time to get through both:
Messy data: Rearranging messy datasets with Dplyr and Tidyr
Really, really, really messy data: Using the Facebook API
My Facebook code is not completely done, but it is functional and I think it would be good for anyone who is not familiar with API’s to see how you get and work with a real dataset. It comes in Json format (like XML; tableless), so it’s pretty exciting. I will post the Facebook code once I’ve finished tuning it up in the next week or two.
An overview of data wrangling in R
RMD file. Based on a recent presentation by Hadley Wickham. You’ll definitely want to get the Data Wrangling Cheat Sheet. Here’s the video:
library(RCurl, warn = FALSE)
## Loading required package: bitops
# get tb data from Hadley Wickham's github
myData <- getURL("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/tb.csv", ssl.verifypeer = FALSE)
tbdata <- read.csv(textConnection(myData))
head(tbdata, n=7)
## iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514
## 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA
## 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA
## 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA
## 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA
## 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA
## f014 f1524 f2534 f3544 f4554 f5564 f65 fu
## 1 NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA
## 7 0 1 1 0 0 1 0 NA
tidyr
Messy -> Tidy Data
library(tidyr)
library(dplyr, warn = FALSE)
# gather and separate the data to make it "tidy"
tb2 <- tbdata %>%
gather(demo, n, -iso2, -year, na.rm = TRUE) %>%
separate(demo, c("sex","age"), 1)
head(tb2, n=7)
## iso2 year sex age n
## 1 AD 2005 m 04 0
## 2 AD 2006 m 04 0
## 3 AD 2008 m 04 0
## 4 AE 2006 m 04 0
## 5 AE 2007 m 04 0
## 6 AE 2008 m 04 0
## 7 AG 2007 m 04 0
## country year sex age n
## 1 AD 1996 f 014 0
## 2 AD 1996 f 1524 1
## 3 AD 1996 f 2534 1
## 4 AD 1996 f 3544 0
## 5 AD 1996 f 4554 0
## 6 AD 1996 f 5564 1
## 7 AD 1996 f 65 0
more examples
demo(package = "tidyr") # produces a list of all of the demos in package 'tidyr'
demo('so-15668870', package = "tidyr", ask = FALSE)
Welcome to winter term! R Club meets on odd weeks this term, Tuesdays 3:00pm-4:20pm in the Straub computer lab (room 008). You don’t need to register for R Club to attend; if you’re still deciding whether or not to register for R Club (CRN25270), see this post.
We’ll spend our first meeting (Tuesday 1/6) getting set up for the term, which includes installing R (for anyone that doesn’t have it already), setting up our schedule for super fun low-pressure extremely useful presentations (a.k.a. nuggets, see last term’s schedule for examples), and then enjoying a couple intro nuggets by yours truly and possibly also John truly, and maybe a nugget by Jacob as well!
Nugget 1 (Intro): Swirl! The most funnest way ever to learn R, f’real.
Nugget 2 (Intro): 611 Highlights in R: An overview of some of the handy functions that we used to manipulate data and run analyses in PSY611.
Nugget 3: Nifty trick: Network graphs from plaintext notes using the qgraph
package.
Might be of interest! If you end up doing it, I encourage you to put together a nugget or two for R Club on the coolest tips. 🙂 Info here: http://theanalysisinstitute.com/r-graphics/
Just a friendly reminder, since we’re approaching the end of the term. If you’re taking R Club for credit, you need to:
1) attend R Club (we’ve been keeping track of attendance, so you’re covered there)
2) send a one-page-ish write-up to Sanjay (sanjay@uoregon.edu) by the end of the term, describing what you’ve done and/or what you’ve learned in R Club this term.
If you are signed up for credit and you think you’ll have trouble meeting one or both of these requirements, email Sanjay.
Note that they require R (or python). 🙂 Note that they also require SQL! I’ve heard from lots of sources that the transition from working with data that can be stored in spreadsheets (e.g. csv, spss, or excel files) to database systems is one of the stumbling blocks for academically-trained analysts interested in industry positions.
As you may have heard, psychology has a cool speaker coming rather last minute next week: Antonella Pavese, leader on the Google User Experience team. She sounds pretty baller (http://www.antonellapavese.com/about/). She did a post-doc here at the UO with Mike Posner, and will be talking about her work. If you want to hear her speak, here’s where you need to be:
2:00 on Tuesday Nov 25 in Franklin, room 271B
We’re not exactly sure how long her presentation (or questions afterward) will run, so I’m afraid R Club may start late. If you’re planning to come to the talk, great, we can all just trot next door to R Club when it’s over. If not, plan to arrive at 3:30 next week instead of 3:00 (unless you’re cool with hanging out more or less alone for a little while, in which case, 3:00 will work just fine).
Here’s a little teaser for one of tomorrow’s nuggets.
I’ll be talking about using multiple imputation as a remedy for missing data, using the Amelia package. To whet your appetite, check out this pithy post about how R handles missing values in general:
And here’s all the relevant code:
install.packages("Amelia")
library(Amelia)
data() # Amelia comes with some datasets
data(africa) # let's pull in the africa dataset
str(africa)
?africa
View(africa)
summary(africa)
summary(lm(africa$civlib ~ africa$trade)) # listwise deletion
?amelia
m <- 5 # the number of datasets to create (5 is typical)
a.out <- amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc") # note that we're using all the variables, even though we won't use them all in the regression
summary(a.out)
plot(a.out)
par(mfrow=c(1,1))
missmap(a.out)
# run our regression on each dataset
b.out<-NULL
se.out<-NULL
for(i in 1:m) {
ols.out <- lm(civlib ~ trade ,data = a.out$imputations[[i]])
b.out <- rbind(b.out, ols.out$coef)
se.out <- rbind(se.out, coef(summary(ols.out))[,2])
}
# combine the results from all of the different regressions
combined.results <- mi.meld(q = b.out, se = se.out)
?AmeliaView # Sounds fun, but it didn't work for me. Meh.
R Club starts tomorrow! We’ll be talking about how we want to run R Club this year, we’ll try to figure out a schedule of mini-presentations (a.k.a. “nuggets”), and, for those of you getting into R for the first time, we’ll set you up with installation help, practice problems, and a super fun intro-to-r package called swirl. We’ll also have an awesome nugget on bootstrapping/simulations to kick things off!
When: Tuesdays on odd weeks (beginning this week), 3:00pm-4:20pm
Where: Franklin room 271A
See you there!
P.S. If you’re still deciding whether to officially register for R Club or just come casually, you might want to review our recent post on registration.
The author writes about it with more traditional programming examples rather than data analysis stuff per se, but I think it would work really well for R Club. A lot of the learning curve for R is the programming-y stuff (getting used to using variables, functions, arguments, etc.), and I think code reviews would be a great way to learn that stuff from each other, in the same way that the author describes in his post. I also think code reviews would be especially, specifically awesome for stats and R because of the super redundant nature of both – there’s always more than one way to skin a cat, as the disturbing proverb goes, and it’s great to learn a variety of tools and approaches for statistical analysis in general, and stats in R in particular. Code review would certainly facilitate learning alternate ways of doing things from each other.