Category: practice problems

DIY functions

DIY functions

See also here.
You can write and save your own functions in R, which is very handy for automating a series of commands you do often. It can also make your code much more transparent, which is great news for anyone trying to understand your scripts (including Future You). Here’s how it works:

  1. Give your new function a name.
  2. Define the arguments.
  3. Spell out the code you want R to run each time you call your function.
  4. Tell it what output you want from it.

Here’s the rough structure to follow:

myfunction = function(arg1, arg2, ... ){

Example time!

Write a function that can take a vector of numbers as input, and return the mean of the numbers as output.

GetMean <- function(vector){
result <- mean(vector, na.rm=TRUE)

What happens if you run that code? Not much, on the surface. R saves that new function for you, though, so later you can call it and provide the necessary argument(s). If you’re using RStudio, you’ll notice your brand new function shows up in the environment window.

Run the code above, and then try this:

## [1] 5.5
## [1] 6
## [1] 2.167
GetMean(iris$Petal.Length) # note that iris is one of the datasets that's built into R.
## [1] 3.758

How might you want to use this?

If you find yourself writing the same set of commands over and over, consider putting it into a function. For example, maybe you are doing a series of transformations and you want to generate a histogram after each step, but you’re a data artiste and you refuse to compromise on aethestics – you can use a function to save all the relevant plotting code in one place and then just call it every time you want to use it.


# Use the iris data as an example (built into R)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Define the function to generate a histogram with all of the settings I like
PlotHist <- function(var,, fig.num, transformation){
  data <- data.frame( # convert variable vector to data frame for ggplot
  p <- ggplot(data, aes( +
    geom_histogram(aes(y=..density..), colour="black", fill="white") +
    geom_density(alpha=.2, fill="red") +  # Overlay with transparent density plot
    xlab( +
    ylab(NULL) +
    ggtitle(paste("Figure ", fig.num, ": ",, " (", transformation, ")", sep=""))  

# Plot raw Petal.Length data
PlotHist(var=iris$Petal.Length,"Petal Length", fig.num=1, transformation="raw")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

# Run transformations on Petal.Length and get plot after each one
iris$PL.sqrt <- sqrt(iris$Petal.Length)
PlotHist(iris$PL.sqrt, "Petal Length", 2, "square root")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

iris$PL.negrec <- -1/iris$Petal.Length
PlotHist(iris$PL.negrec, "Petal Length", 3, "negative reciprocal")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

# How cool?? So cool.
# Also, how much easier is this to read than if I had copy/pasted my ggplot code three times? So much easier.

centering and standardizing with scale()

Welcome to some handy functions! These are quick ways to get some common tasks done: centering, standardizing, and getting stats (i.e. mean) for each level of a factor.

# get some data to play with
# Ooo! Chickens. Let's use the ChickWeight dataset.
df <- ChickWeight str(df) summary(df) head(df) # ------------------- # # centering # # ------------------- # ?scale df$weight.c <- scale(df$weight, center=TRUE, scale=FALSE) hist(df$weight.c) # ---------------------------- # # scaling (z scores) # # ---------------------------- # df$weight.z <- scale(df$weight, center=TRUE, scale=TRUE) hist(df$weight.z) # ----------------------------------- # # within levels of a factor # # ----------------------------------- # # lots of great ways to do this, here are two (there are so many more!) # strategy number 1 ?ave df$ave.weight <- ave(df$weight, df$Chick) head(df, n=15) # you don't have to stick with the mean. you can put in any function you like. df$max.weight <- ave(df$weight, df$Chick, FUN=max) # you can center within levels of a factor! df$weight.z.within <- ave(df$weight, df$Chick, FUN=scale) head(df, n=15) # strategy number 2 ?by hist(by(df$weight, df$Chick, FUN=mean), main = "How heavy are those chickens??") # note that this one produces only one mean for each chick: length(unique(df$Chick)) length(by(df$weight, df$Chick, FUN=mean)) nrow(df) # you can put in any function you like hist(by(df$weight, df$Chick, FUN=max), main = "What's the fattest those chickens get??")

Welcome to the wonderful world of R!

Getting started

Slides on R basics (including installation): (you may want to toggle the format using the button in the bottom right)

The way to learn R is by using it, so jump into some practice problems. You can use the “practice with datasets” code copied below, or download swirl and play around with that. Or both!

Learning with Swirl

To install swirl, first install R if you haven’t already (see instructions in the slides above), and open it. In the command line, type
and hit Enter. You need a working internet connection. Once R has installed the package, you also need to load it. Type
and hit Enter. Once you do that, swirl will take over and start giving you instructions (and peppy feedback!) to take you though the basics of R. Have fun!

Practice with datasets

This is some code for you to work through to practice using datasets in R.

This coordinates (roughly) with the [Intro to R slides from UCLA](“>

Download the relevant datasets from dropbox:

Make a note of where these data files get saved when you download them (your downloads folder, maybe? Or the desktop?). You’ll need to know where they are to be able to access them from R.

Note: this is meant to provide practice moving data in and out of R. suggested functions are provided, but you need to fill in the arguments, etc. For help on how to use the functions, enter ? in the command window (e.g. ?read.table).

To save your work, I recommend copying your code into an R script and working from there. In R Studio, go to File > New File > R Script. In R, go to File > New Document. In either case, this will open up a blank text file. You can copy-paste this code there, and then fill in your answers as work through the problems. Any line of text that begins with # is a comment.

First, check out the file: open the file rclub1_data1.txt in a text editor or excel. you’ll see it’s tab-deliminated. now we want to bring it into R so we can use it. before we tell R which file to open, we need to know what folder R is currently set to look in (working directory). If you’re trying unsucessfully to open a file, the first thing you should check is that your wd is correct.

## [1] "/Users/TARDIS/Dropbox/RClub/rclub_code"
# if necessary, change wd so it matches the folder where you've got your data files

# we want to read in the data file, so first you need to learn about the appropriate function (read.table) so you know how to structure the command. You'll want to do this for each new function you use.

# now we can tell R to open the file, and save it as an object called df (short for dataframe). you could name it whatever you want, though. the name of the file ("rclub_data1.txt" must be in quotes, so R knows to look for a string matching that rather than to look for an existing object with that name).
df <- read.table("rclub1_data1.txt", header = T, sep = "\t")

# look at the first part of your dataframe, to eyeball the data.
##   gender HowSmartTheyAre HowManyPointsTheyGot DidTheyEatBreakfast
## 1      M              95                    3                   Y
## 2      F              85                    3                   Y
## 3      F             114                    1                   Y
## 4      M             108                    2                   Y
## 5      M              98                    3                   Y
## 6      F              86                    3                   Y
# check the last chunk of the dataset, too, just because you're curious.
##    gender HowSmartTheyAre HowManyPointsTheyGot DidTheyEatBreakfast
## 45      F              87                    5                   N
## 46      M             108                    2                   N
## 47      F              91                    3                   N
## 48      M              95                    4                   N
## 49      F             112                    1                   N
## 50      F             104                    1                   N
# actually, let's look at it in a new window. 

# get basic information about your dataframe (dimensions, variable types, etc.)
## [1] 50  4
## 'data.frame':    50 obs. of  4 variables:
##  $ gender              : Factor w/ 2 levels "F","M": 2 1 1 2 2 1 2 1 2 1 ...
##  $ HowSmartTheyAre     : int  95 85 114 108 98 86 85 107 100 97 ...
##  $ HowManyPointsTheyGot: int  3 3 1 2 3 3 3 5 2 3 ...
##  $ DidTheyEatBreakfast : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 1 ...
# what are the variable names (column names)?
## [1] "gender"               "HowSmartTheyAre"      "HowManyPointsTheyGot"
## [4] "DidTheyEatBreakfast"
# you don't like these column names. rename several of the variables.
colnames(df) <- c("gender", "IQ", "Score", "Breakfast?")

# rename just one variable: change "gender" to "male".
colnames(df)[1] <- "male"

# change the coding scheme for "male" from M/F to Y/N. Remeber that "male" is currently a factor variable. Use str() and View() to check how the coding scheme is applied when you change the levels of the variable.
levels(df$male) # check what they are originally first
## [1] "F" "M"
levels(df$male) <- c("N","Y") # change them to what you want

# you notice a typo (the first subject should actually be female, not male). damn RAs and their sloppy data entry! edit just that cell.
df$male[1] <- "N"

# save this dataframe in csv format, so you can open it in other software:
# write.csv()

# open that csv file back in R, because you're fickle. (note that you can also just use read.table to read csv files - they're actually the same function, just with different defaults)
# read.csv()

# and now save it again, but this time as tab-deliminated. or try using other deliminators, if you like (e.g spaces).
# write.table()

# open the file rclub1_data2.sav (an SPSS file). first, learn about read.spss:
## No documentation for 'read.spss' in specified packages and libraries:
## you could try '??read.spss'
# if R can't find the function, use two ?'s to conduct a boarder search:

#read.spss is in the package "foreign". do you have the foreign package attached already?
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## loaded via a namespace (and not attached):
## [1] digest_0.6.4     evaluate_0.5.5   formatR_0.10     htmltools_0.2.4 
## [5] knitr_1.6        rmarkdown_0.3.10 stringr_0.6.2    tools_3.0.2
# if not, load the foreign library.

# now try ?read.spss again, then open the fie rclub1_data2.sav
df <- read.spss("rclub1_data2.sav")
## Warning: rclub1_data2.sav: Unrecognized record type 7, subtype 18
## encountered in system file
## re-encoding from latin1
# check out the dataframe to see what variables you have, how many cases you have, etc. You can use the same functions you used before.

# add a new column on the end with average test score by taking the mean of each person's scores for the three tests
df$TestAve <- (df$Test1 + df$Test2 + df$Test3)/3


Week 6 Materials: Catch Up and Review

We’re pretty much through the UCLA slides for Intro to R, so we wanted to take a day to tie up loose ends. This is an opportunity to go back and review what we’ve covered already and post your code to BitBucket if you want to share it with other members of the group. For your convenience, here are the practice sets (with relevant data files) from weeks 1-5:

Week 1: Intro and Dataframe Practice (data1) (data2)
Week 2: Exploring Data (data)
Week 3: ggplot2
Week 4: More Practice Manipulating Datasets
Week 5: Inferential Statistics

Also think about how you want to use the last few weeks of Fall13 R Club. Is there a dataset you’d like to analyze? Do you have a set of data cleaning procedures you do often, and you’d like to make it into a R script that you can easily run over and over? Perhaps you have some analyses you’ve already conducted, but you’d like to translate the work into R code so you can post it when you publish the results, like all the cool kids are doing?

Week 3 Materials: ggplot2

This text from an RStudio tutorial will help you get an idea of what ggplot is all about:

ggplot2 has a rich underlying theory: the Grammar of Graphics, proposed by Leland Wilkinson. The grammer is based on of composition of building blocks according to certain rules. Statistical graphics are viewed as layers, each consisting of 4 elements:

  • Data
  • Mapping between variables and aesthetics (e.g. color, shape,scale)
  • Geometric Objects (e.g. points, lines, polygons)
  • Statistical Transformation (e.g. smoothing, binning in a histogram)

The user can explicitly specify these layers, and put them together according to the rules of the grammar. Layers can be saved or shared between plots, as they have a high-level representation in the code.

The lecture and exercises: rclub-ggplot2.r

Full ggplot documentation can be helpful.

As we’ve seen in a previous post, this package is pretty powerful for a variety of data viz needs (and desires).

week 1 materials

Slides on R basics (including installation): (you may want to toggle the format using the button in the bottom right)

Some practice problems:

The relevant datasets:

R Club survey (how you sign up for the blog):