Statistical Learning (in R) at Stanford: MOOC!

Stanford is offering a MOOC (massively open online course) on statistical learning, based on the book  An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013), which will be freely available to students as a pdf. The course is taught by the third and fourth authors of the book. It’s free, it’s in R, and it sounds like fascinating material. Check it out:

https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about

If you decide to register and you want to coordinate study times, let me know (rosem@uoregon.edu). I’m all over this.

R-Related Resources for Documenting and Publishing Work

I’ve come across a few exciting resources in the past few days:

R + Markdown + Knitr

tl;dr: There’s an R plugin that allows writing up manuscripts in a version-controllable way and then dynamically generating graphs so that you don’t have to do it manually every time you change some little thing.

This part isn’t as much about learning R as about documenting work in R:

Carl Boettiger is a researcher at UC Santa Cruz who’s published what I think is a remarkable Open Lab Notebook. The concept of Open Lab Notebooks is interesting in itself — the notes that he takes on everything that he does in his lab, on a daily basis, are version-controlled and posted to his public website. He apparently is even working on a plugin that would embed his recent Twitter conversations and Mendeley readings at the top of every one of his notebook entries, making it clear what he was reading about and discussing at the time. Open Notebooks like this are a big part of the Open Science movement.

What’s especially cool about Boettiger’s notebook, though, is how he makes it work. He’s posted detailed write-ups about this here and here. As I understand, he currently writes up his daily notes in Markdown, which can be learned in about 15 minutes (it allows the writer to use a plain-text editor to write while still allowing headings, lists, bolding and italicizing, etc.). R code from his analyses can be embedded directly into the Markdown. Then, when he’s finished writing, he sends the file to an R plugin called Knitr. Knitr parses the Markdown and the R code, producing all of the output (including graphs) dynamically. He can then use a program called pandoc to create ready-formatted manuscripts as PDFs, HTML, Word documents, etc.

That sounds extremely exciting to me: I can imagine writing up a manuscript in a plain-text format, making it able to be properly version-controlled and not subject to proprietary file format changes over time, and having the analysis code embedded such that, if I decide to clean the data a different way, or to take out some outlier, no additional work would be necessary to re-generate all of the graphs and output and embed them in the manuscript. How cool is that?

The author of Knitr also has a post on integrating this workflow directly with WordPress, That post is here.

Free Book on Advanced R Programming

This week’s Hacker Newsletter contained a link to a new book called Advanced R development, which is itself currently under development. The author has posted the in-progress content of the book for free; it looks worth checking out!

Industry Jobs

In this blog post, Paul Litvak, a Quantitative Researcher at Google (with a PhD. in Behavioral Decision Research), talks about life in industry.

Here’s one of his suggestions for launching a career outside of academia:

Learn some programming. R, then SQL, then Python, or some other scripting language. The more programming you learn the higher up the food chain you can go. If you know a lot of programming, you aren’t limited by what data exists, but only by what data you can create. This is hugely empowering, and increases your impact considerably. However, if all you learn is R, that is still incredibly useful,and will still get you into a variety of jobs. (emphasis added)

Mercurial Cheat Sheets

If you’re interested in using Mercurial / Hg from the command line (i.e., not through TortoiseHg), you may be interested in any of a number of cheat sheets that are available. I keep one like this on my desk:

http://www.cheatography.com/codeshane/cheat-sheets/mercurial-hg/

To start with Hg on the command line, open a terminal, type cd /path/to/your/repository/folder (cd for Change Directory), and then use any of the commands given on the sheet.

hg summary, for instance, will give you a summary of the Commit to which your working directory is Updated. hg commit -m "Commit message goes here" makes a commit with a given commit message. hg pull https://url_from_bitbucket Pulls changes from a remote repository (like from BitBucket).

On Mac and Linux machines (and Windows machines with Cygwin installed), you can use man hg (for ‘Manual for hg’) from the command line to see the full Mercurial documentation, with all of the command line flags (like -m above) explained.

Tools for R Awesomeness: Data Science API and RStudio Shiny

I’ve been scraping the interwebs for tools that show off how R can be awesome. Part of that awesomeness stems from a vibrant community that, out of sheer joy, builds tools for you, for free. If you know of any other tools that you think might be generally useful, post in the comments.

First: Maybe you’re interested in analyzing a corpus of text, or regional distribution of personality. The Data Science Toolkit is here for you. There is an astounding wealth of tools for taking bunches of data, that are often just laying around, and doing something useful with them. Seriously, click on the link and check it out. Crucially, there’s an R package for accessing these tools.

Second, in the spirit of open science, and communicating to a broad audience, I’d like to introduce you to RStudio – Shiny . Shiny makes it easy to write user-friendly web applications you can wrap your data in. If you think scatter-plots are a step toward data transparency, this is a giant leap. Let the curious reader use drop-downs, radio buttons, and sliders to page through the patterns in your data. This project is still in it’s early stages so you’ll either need to be running your own linux web server (aren’t we all?) or you can register for a free beta trial of their server. In the future, I hope that universities might begin to support such tools for us to supplement our publications.

The 611/612/613 R Revolution!

This post describes the effort to translate the Psychology Department’s graduate level statistics courses into R. The goal is to provide R code for all of the lab handouts for all three courses making up the data analysis sequence: PSY611, PSY612, and PSY613.

See the bitbucket repository to get the handouts for 611, 612, and 613: https://bitbucket.org/rosemm/611-612-613-r-code

To work on this project, please download the handouts and look through for any computation done in SPSS, and figure out how to do it in R instead. Also keep an eye out for simulations and demonstrations (most of which are currently done with Java apps online), and see if you can translate those to R as well – they will probably take the form of R scripts students could run, and then change certain parameters and re-run to see, for example, how changing sample size affects the variance of the sampling distribution of the mean. This code should all be written for an audience with no assumed R experience, so please make sure to include lots of clarifying comments/annotations. For example, it would be super helpful to include one or more comments about the output that the code generates, drawing students’ attention to the relevant pieces and explaining how it differs from the SPSS output.

If you’re working on this project, consider commenting on this post to let others know where you’re directing your efforts (e.g. “I’m working on 611 Lab 1: EDA”). It’s fine for multiple people to work on the same handouts – we might uncover some cool alternate solutions that way! – but we’ll get the most done if we can allocate our efforts efficiently. Please save your code as an R file with the name of the course and handout it belongs to (e.g. “611_Lab1.r”) and post to BitBucket.

If you come across cool resources while you’re working on this or if you have questions/comments/advice/whatever relevant to this project, please comment on this post.

The Plan: Individual Projects

We’ve pretty much gotten through the material in the Intro to R course from UCLA, and covered some neat supplementary content as well (for a one-stop shopping experience of the topics we’ve covered so far, see the Week 6 Review). For the last few weeks of Fall 13 R Club, we’d like to focus on individual projects. The goal is for everyone to come up with an R project for him/herself, and then spend the rest of term trying to do it. Ideally, this should be a task that’s actually useful to you. For example, perhaps you have some analyses you’ve been doing in another environment (SPSS? HLM? MPlus?), but you’d like to try to translate the work to R to check whether the results come out the same. Or maybe there’s a data cleaning procedure in your lab that currently involves work by hand in Excel or something, and you’d like to automate it. Alternatively, if you don’t have any project of your own that you’d like to work on, you can join the effort to translate PSY611/612/613 into R – we’d like to be able to provide clearly annotated R scripts to do all the tasks currently presented in SPSS for the whole sequence. Vive la revolution!

The aim of these individual projects is two-fold: 1) To give you a productive, collaborative space so you can work through a project that’s useful to you and (this is key) have the opportunity to ask questions when you get stuck/frustrated instead of just smashing your computer with a coffee mug alone in your apartment at 3am; and 2) To develop a pool of resources that we all share, so everyone can benefit from our collective problem solving efforts. To facilitate both question-answering and resource-sharing, please post your code online (preferably to BitBucket) and share it with your fellow rclubers. BitBucket provides an excellent environment for people to help you with your code when you get stuck, and also if everyone has access to each other’s code, then we can all learn from the solutions everyone comes up with. Also, it would be great if you could write up a brief summary of what your code does, and include any useful resources you found while working on it, and then add it to the blog as a new post with a link to your code on BitBucket. I’ll add a post category called “projects”, so please tag your post with that label – that way we’ll be able to view all the project posts in one place easily. You might find it useful to create the post early, before your code is done, to give your R Club collaborators some sense of what you’re trying to do so it’s easier for them to help you. I think you should all have “author” access, but if you have trouble figuring out how to post to the blog, send me an email.

If you have trouble getting set up with BitBucket or TortoiseHg (which is not unlikely if you’ve never used programs like this before), the weekly R Club meeting is a great place to troubleshoot with friends. Ask for help.

Week 6 Materials: Catch Up and Review

We’re pretty much through the UCLA slides for Intro to R, so we wanted to take a day to tie up loose ends. This is an opportunity to go back and review what we’ve covered already and post your code to BitBucket if you want to share it with other members of the group. For your convenience, here are the practice sets (with relevant data files) from weeks 1-5:

Week 1: Intro and Dataframe Practice (data1) (data2)
Week 2: Exploring Data (data)
Week 3: ggplot2
Week 4: More Practice Manipulating Datasets
Week 5: Inferential Statistics

Also think about how you want to use the last few weeks of Fall13 R Club. Is there a dataset you’d like to analyze? Do you have a set of data cleaning procedures you do often, and you’d like to make it into a R script that you can easily run over and over? Perhaps you have some analyses you’ve already conducted, but you’d like to translate the work into R code so you can post it when you publish the results, like all the cool kids are doing?