October 6, 2015

R programming and data structures

week2.r

Class miscellanea

For those who are enrolled:

Attendance taken every class
- If you’re enrolled, email us and we will give you access to the attendance Google Doc.
Final, brief summary due to Sanjay at course end

If you need help outside of class, first go through the swirl tutorials and then contact us (or contact us to help you get started on swirl.

install.packages("swirl") 
library(swirl)
swirl()  # have fun! :)

Recap of what we’ve learned

R is a giant calculator
- More properly, a Turing Machine
- Let’s thank Alan and Ada
R keeps data in vectors, matrices, and (new to you) data frames
- Use special syntax to index, or reference parts of each of these structures.
R uses functions to do things
- ploting
- random number generation
- stats

R is programming

R stores data, and functions that do things with that data, in your Environment.

You can use functions by typing commands directly into the console, or you can write commands in a script, like this one, and run commands from there (the preferred method).

Let’s compartmentalize things a bit:

Raw data stored as text in a comma separated value file on your computer
- SPSS files work too *
Your R script saves all of your procedures for
1. Loading your raw data
2. Cleaning your raw data
  - saving your cleaned data for faster loading
3. Analyzing your cleaned data
4. Saving your results to text, html, pdf, and so forth.
The R environment holds in working memory (RAM) your
- loaded data (like, so loaded man)
- base functions
- loaded package functions (after using, e.g., library(AwesomePackage)

Here’s an example, in flowchart, of what this looks like put together:

R Workflow

If this doesn’t fully grock right now, don’t worry, it will. For now just be thinking about how R sort of keeps separate the data on your hard drive, the copy of that data you’re working on, and all of the functions you’re applying to that data, either to fix it up, analyze it, or present your analyses.

Functions: The Basics

Functions take input (data, other functions, option flags) and often give you output.

You can save this output, as we saw, using <-.

You can write your own functions and save them like this:

reverseScore  <- function(aVector, minValue, maxValue){
    reversed <- maxValue - aVector + minValue
    return(reversed)
}

#Let's test:
reverseScore(c(1,2,3,4,5), 1, 5)

## [1] 5 4 3 2 1

Functions are stored using variable names just like data. Every function you’ll be using is written in this way. If you want to look inside a function, just type the function name without ().

reverseScore

## function(aVector, minValue, maxValue){
##  reversed <- maxValue - aVector + minValue
##  return(reversed)
## }

The structure of Data Frames

Quick aside into the land of prepackaged data

To get started, let’s load the ‘psych’ package by the preeminent personality psychologist, Bill Ravel, at Northwestern

library(psych)

Aside from providing a host of useful functions, we will also get access to a ton of neat data.

# Run this to see all available data in `psych`: data(package='psych')

We’re going to load the data about vegetable preferences…

data(vegetables)

str(veg)

## 'data.frame':    9 obs. of  9 variables:
##  $ Turn   : num  0.5 0.182 0.23 0.189 0.122 0.108 0.101 0.108 0.074
##  $ Cab    : num  0.818 0.5 0.399 0.277 0.257 0.264 0.189 0.155 0.142
##  $ Beet   : num  0.77 0.601 0.5 0.439 0.264 0.324 0.155 0.203 0.182
##  $ Asp    : num  0.811 0.723 0.561 0.5 0.439 0.412 0.324 0.399 0.27
##  $ Car    : num  0.878 0.743 0.736 0.561 0.5 0.507 0.426 0.291 0.236
##  $ Spin   : num  0.892 0.736 0.676 0.588 0.493 0.5 0.372 0.318 0.372
##  $ S.Beans: num  0.899 0.811 0.845 0.676 0.574 0.628 0.5 0.473 0.358
##  $ Peas   : num  0.892 0.845 0.797 0.601 0.709 0.682 0.527 0.5 0.372
##  $ Corn   : num  0.926 0.858 0.818 0.73 0.764 0.628 0.642 0.628 0.5

How to `data.frame`

Unlike vectors and matrices, data frames can hold data of all different types: a column of numbers, another column of words (i.e., character strings), and maybe another column of a special type called factors.

To demonstrate this, I’m going to quickly add a column of random grouping to the veg data frame:

veg$group <- sample(c('Awesome', 'NotSoMuch'), size=dim(veg)[1], replace=T)

str(veg)

## 'data.frame':    9 obs. of  10 variables:
##  $ Turn   : num  0.5 0.182 0.23 0.189 0.122 0.108 0.101 0.108 0.074
##  $ Cab    : num  0.818 0.5 0.399 0.277 0.257 0.264 0.189 0.155 0.142
##  $ Beet   : num  0.77 0.601 0.5 0.439 0.264 0.324 0.155 0.203 0.182
##  $ Asp    : num  0.811 0.723 0.561 0.5 0.439 0.412 0.324 0.399 0.27
##  $ Car    : num  0.878 0.743 0.736 0.561 0.5 0.507 0.426 0.291 0.236
##  $ Spin   : num  0.892 0.736 0.676 0.588 0.493 0.5 0.372 0.318 0.372
##  $ S.Beans: num  0.899 0.811 0.845 0.676 0.574 0.628 0.5 0.473 0.358
##  $ Peas   : num  0.892 0.845 0.797 0.601 0.709 0.682 0.527 0.5 0.372
##  $ Corn   : num  0.926 0.858 0.818 0.73 0.764 0.628 0.642 0.628 0.5
##  $ group  : chr  "Awesome" "Awesome" "NotSoMuch" "NotSoMuch" ...

Just like matrices, you can reference rows and columns numerically

# first row, all columns:
veg[1, ]

##      Turn   Cab Beet   Asp   Car  Spin S.Beans  Peas  Corn   group
## Turn  0.5 0.818 0.77 0.811 0.878 0.892   0.899 0.892 0.926 Awesome

# first column, all rows:
veg[, 1] # Notice we just get back a vector (no 'orientation')

## [1] 0.500 0.182 0.230 0.189 0.122 0.108 0.101 0.108 0.074

# the cell at row 8, column 9:
veg[8, 9]

## [1] 0.628

Have you noticed the columns are named?

names(veg)

##  [1] "Turn"    "Cab"     "Beet"    "Asp"     "Car"     "Spin"    "S.Beans"
##  [8] "Peas"    "Corn"    "group"

You can also reference columns by these names:

veg[, 'Asp']

## [1] 0.811 0.723 0.561 0.500 0.439 0.412 0.324 0.399 0.270

veg[1, 'group']

## [1] "Awesome"

Here’s another way, for some reason:

veg$Beet

## [1] 0.770 0.601 0.500 0.439 0.264 0.324 0.155 0.203 0.182

veg$Peas[4]

## [1] 0.601

You can also refer to ranges:

veg[1:3, 'Beet']

## [1] 0.770 0.601 0.500

veg[1:3, 1:4]

##       Turn   Cab  Beet   Asp
## Turn 0.500 0.818 0.770 0.811
## Cab  0.182 0.500 0.601 0.723
## Beet 0.230 0.399 0.500 0.561

veg$Car[5:6]

## [1] 0.500 0.507

There are row names too because this is actually a matrix of the proportion of times one vegetable was preferred over another – ignore that for now.

You can also index the ranges you don’t want:

#who cares about Turnips?
veg[-1, -1]

##           Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn     group
## Cab     0.500 0.601 0.723 0.743 0.736   0.811 0.845 0.858   Awesome
## Beet    0.399 0.500 0.561 0.736 0.676   0.845 0.797 0.818 NotSoMuch
## Asp     0.277 0.439 0.500 0.561 0.588   0.676 0.601 0.730 NotSoMuch
## Car     0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764   Awesome
## Spin    0.264 0.324 0.412 0.507 0.500   0.628 0.682 0.628 NotSoMuch
## S.Beans 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642   Awesome
## Peas    0.155 0.203 0.399 0.291 0.318   0.473 0.500 0.628   Awesome
## Corn    0.142 0.182 0.270 0.236 0.372   0.358 0.372 0.500 NotSoMuch

Some classic R subsetting

You’ll see this stuff a lot, and it’s convenient, but ugly shorthand. At least it’s not MATLAB.

In addition to putting the number(s) indexing the rows and columns you want, you can also index using a vector of TRUE or FALSE values:

TFVector <- rep(c(TRUE, FALSE), length=dim(veg)[1])
TFVector

## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

veg[TFVector, ]

##          Turn   Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn     group
## Turn    0.500 0.818 0.770 0.811 0.878 0.892   0.899 0.892 0.926   Awesome
## Beet    0.230 0.399 0.500 0.561 0.736 0.676   0.845 0.797 0.818 NotSoMuch
## Car     0.122 0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764   Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642   Awesome
## Corn    0.074 0.142 0.182 0.270 0.236 0.372   0.358 0.372 0.500 NotSoMuch

This seems dumb, but you can use it to subset your data by whether or not some row has a certain value:

awesomeVector <- veg$group == 'Awesome'
awesomeVector

## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE

veg[awesomeVector, ]

##          Turn   Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn   group
## Turn    0.500 0.818 0.770 0.811 0.878 0.892   0.899 0.892 0.926 Awesome
## Cab     0.182 0.500 0.601 0.723 0.743 0.736   0.811 0.845 0.858 Awesome
## Car     0.122 0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764 Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642 Awesome
## Peas    0.108 0.155 0.203 0.399 0.291 0.318   0.473 0.500 0.628 Awesome

# Or just combine it:

veg[veg$group == 'Awesome', ]

##          Turn   Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn   group
## Turn    0.500 0.818 0.770 0.811 0.878 0.892   0.899 0.892 0.926 Awesome
## Cab     0.182 0.500 0.601 0.723 0.743 0.736   0.811 0.845 0.858 Awesome
## Car     0.122 0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764 Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642 Awesome
## Peas    0.108 0.155 0.203 0.399 0.291 0.318   0.473 0.500 0.628 Awesome

There are better ways, a lot of which we’ll see next week.

R Ain’t Loopy

for loops are an incredibly necessary and useful programming topic to master. However, R wants you to perform all your operations on those collections of numbers called vectors. That is, R works faster if you don’t visit each cell of each data frame using a for loop, but instead ask it to do something to each column. Even basic arithmetic works this way:

aVec <- 1:10
aVec

##  [1]  1  2  3  4  5  6  7  8  9 10

aVec + 2.5

##  [1]  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5

aVec*100

##  [1]  100  200  300  400  500  600  700  800  900 1000

aVec + c(-.5, .25)

##  [1]  0.50  2.25  2.50  4.25  4.50  6.25  6.50  8.25  8.50 10.25

In other words, most R functions already know how to do something to each element in a vector.

Combine that with the fact that each column in a data.frame is just a vector, and that R has built in functions for doing something to each column in a data frame, and you can write more R-like code that runs faster, and is easier to read.

These functions are called the apply functions. We’ll show you recent innovations on these functions next week.

I’m first just going to show you how to take the mean of each column in veg, except for the group column:

#

vegMeans <- sapply(veg[, -dim(veg)[2]], mean)
vegMeans

##      Turn       Cab      Beet       Asp       Car      Spin   S.Beans 
## 0.1793333 0.3334444 0.3820000 0.4932222 0.5420000 0.5496667 0.6404444 
##      Peas      Corn 
## 0.6583333 0.7215556

If you wanted to transform each cell in the data, say by log transforming it, you could do this:

logVeg <- sapply(veg[, -dim(veg)[2]], log)
logVeg

##             Turn        Cab       Beet        Asp        Car       Spin
##  [1,] -0.6931472 -0.2008929 -0.2613648 -0.2094872 -0.1301087 -0.1142891
##  [2,] -1.7037486 -0.6931472 -0.5091603 -0.3243461 -0.2970592 -0.3065252
##  [3,] -1.4696760 -0.9187939 -0.6931472 -0.5780344 -0.3065252 -0.3915622
##  [4,] -1.6660083 -1.2837378 -0.8232559 -0.6931472 -0.5780344 -0.5310283
##  [5,] -2.1037342 -1.3586792 -1.3318062 -0.8232559 -0.6931472 -0.7072461
##  [6,] -2.2256241 -1.3318062 -1.1270118 -0.8867319 -0.6792443 -0.6931472
##  [7,] -2.2926348 -1.6660083 -1.8643302 -1.1270118 -0.8533159 -0.9888614
##  [8,] -2.2256241 -1.8643302 -1.5945493 -0.9187939 -1.2344320 -1.1457039
##  [9,] -2.6036902 -1.9519282 -1.7037486 -1.3093333 -1.4439235 -0.9888614
##          S.Beans       Peas        Corn
##  [1,] -0.1064722 -0.1142891 -0.07688104
##  [2,] -0.2094872 -0.1684187 -0.15315118
##  [3,] -0.1684187 -0.2269006 -0.20089294
##  [4,] -0.3915622 -0.5091603 -0.31471074
##  [5,] -0.5551259 -0.3438998 -0.26918749
##  [6,] -0.4652151 -0.3827256 -0.46521511
##  [7,] -0.6931472 -0.6405547 -0.44316698
##  [8,] -0.7486599 -0.6931472 -0.46521511
##  [9,] -1.0272223 -0.9888614 -0.69314718

This is where writing your own functions comes in handy. I want to get back both the mean and SD of each column:

meanAndSD <- function(aVec){
    aMean <- mean(aVec)
    aSD <- sd(aVec)
    return(c(mean=aMean, sd=aSD))
}

vegMeanAndSD <- sapply(veg[, -dim(veg)[2]], meanAndSD)
vegMeanAndSD

##           Turn       Cab     Beet       Asp       Car      Spin   S.Beans
## mean 0.1793333 0.3334444 0.382000 0.4932222 0.5420000 0.5496667 0.6404444
## sd   0.1304751 0.2150704 0.211109 0.1786405 0.2134174 0.1909908 0.1841040
##           Peas      Corn
## mean 0.6583333 0.7215556
## sd   0.1729855 0.1344016

Save that clean data

Let’s pretend that the log transformed data is our ‘cleaned’ data, and we want to save it for later (’cause log transforming takes SOOO LOOOONG):

write.csv(logVeg, file='logTransVegetables.csv')
#to read it later:
read.csv('logTransVegetables.csv')

##   X       Turn        Cab       Beet        Asp        Car       Spin
## 1 1 -0.6931472 -0.2008929 -0.2613648 -0.2094872 -0.1301087 -0.1142891
## 2 2 -1.7037486 -0.6931472 -0.5091603 -0.3243461 -0.2970592 -0.3065252
## 3 3 -1.4696760 -0.9187939 -0.6931472 -0.5780344 -0.3065252 -0.3915622
## 4 4 -1.6660083 -1.2837378 -0.8232559 -0.6931472 -0.5780344 -0.5310283
## 5 5 -2.1037342 -1.3586792 -1.3318062 -0.8232559 -0.6931472 -0.7072461
## 6 6 -2.2256241 -1.3318062 -1.1270118 -0.8867319 -0.6792443 -0.6931472
## 7 7 -2.2926348 -1.6660083 -1.8643302 -1.1270118 -0.8533159 -0.9888614
## 8 8 -2.2256241 -1.8643302 -1.5945493 -0.9187939 -1.2344320 -1.1457039
## 9 9 -2.6036902 -1.9519282 -1.7037486 -1.3093333 -1.4439235 -0.9888614
##      S.Beans       Peas        Corn
## 1 -0.1064722 -0.1142891 -0.07688104
## 2 -0.2094872 -0.1684187 -0.15315118
## 3 -0.1684187 -0.2269006 -0.20089294
## 4 -0.3915622 -0.5091603 -0.31471074
## 5 -0.5551259 -0.3438998 -0.26918749
## 6 -0.4652151 -0.3827256 -0.46521511
## 7 -0.6931472 -0.6405547 -0.44316698
## 8 -0.7486599 -0.6931472 -0.46521511
## 9 -1.0272223 -0.9888614 -0.69314718

Some notes on EDA

What you want to do is look at all your data, however you can make that happen.

We could use sapply to get a histogram for every column of the veg data:

sapply(veg[, -dim(veg)[2]], hist)

##          Turn      Cab       Beet      Asp       Car       Spin     
## breaks   Numeric,6 Numeric,9 Numeric,8 Numeric,8 Numeric,8 Numeric,7
## counts   Integer,5 Integer,8 Integer,7 Integer,7 Integer,7 Integer,6
## density  Numeric,5 Numeric,8 Numeric,7 Numeric,7 Numeric,7 Numeric,6
## mids     Numeric,5 Numeric,8 Numeric,7 Numeric,7 Numeric,7 Numeric,6
## xname    "X[[1L]]" "X[[2L]]" "X[[3L]]" "X[[4L]]" "X[[5L]]" "X[[6L]]"
## equidist TRUE      TRUE      TRUE      TRUE      TRUE      TRUE     
##          S.Beans   Peas      Corn     
## breaks   Numeric,7 Numeric,7 Numeric,6
## counts   Integer,6 Integer,6 Integer,5
## density  Numeric,6 Numeric,6 Numeric,5
## mids     Numeric,6 Numeric,6 Numeric,5
## xname    "X[[7L]]" "X[[8L]]" "X[[9L]]"
## equidist TRUE      TRUE      TRUE

A preview of things to come: ggplot2 is awesome, and someone made an EDA package using it that does this:

library(GGally) #install.packages('GGally')
data(iris)
ggpairs(iris)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Warning: position_stack requires constant width: output may be incorrect

## Warning: position_stack requires constant width: output may be incorrect

## Warning: position_stack requires constant width: output may be incorrect

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Using some of the functions we’ll talk about next week, you can automate a lot of the creation of custom plots and analyses (like, if you want to look at ICCs for all of your variables).

On deck

Next week: dplyr, tidyr for manipulating your data (cause some times you gotta).
Future times: You try!, Massively awesome plotting, ????