R programming and data structures
Class miscellanea
For those who are enrolled:
- Attendance taken every class
- If you’re enrolled, email us and we will give you access to the attendance Google Doc.
- Final, brief summary due to Sanjay at course end
If you need help outside of class, first go through the swirl
tutorials and then contact us (or contact us to help you get started on swirl
.
install.packages("swirl")
library(swirl)
swirl() # have fun! :)
Recap of what we’ve learned
- R is a giant calculator
- More properly, a Turing Machine
- Let’s thank Alan and Ada
- R keeps data in vectors, matrices, and (new to you) data frames
- Use special syntax to index, or reference parts of each of these structures.
- R uses functions to do things
- ploting
- random number generation
- stats
R is programming
R stores data, and functions that do things with that data, in your Environment.
You can use functions by typing commands directly into the console, or you can write commands in a script, like this one, and run commands from there (the preferred method).
Let’s compartmentalize things a bit:
- Raw data stored as text in a comma separated value file on your computer
- SPSS files work too *
- Your R script saves all of your procedures for
- Loading your raw data
- Cleaning your raw data
- saving your cleaned data for faster loading
- Analyzing your cleaned data
- Saving your results to text, html, pdf, and so forth.
- The R environment holds in working memory (RAM) your
- loaded data (like, so loaded man)
- base functions
- loaded package functions (after using, e.g.,
library(AwesomePackage
)
Here’s an example, in flowchart, of what this looks like put together:
If this doesn’t fully grock right now, don’t worry, it will. For now just be thinking about how R sort of keeps separate the data on your hard drive, the copy of that data you’re working on, and all of the functions you’re applying to that data, either to fix it up, analyze it, or present your analyses.
Functions: The Basics
Functions take input (data, other functions, option flags) and often give you output.
You can save this output, as we saw, using <-
.
You can write your own functions and save them like this:
reverseScore <- function(aVector, minValue, maxValue){
reversed <- maxValue - aVector + minValue
return(reversed)
}
#Let's test:
reverseScore(c(1,2,3,4,5), 1, 5)
## [1] 5 4 3 2 1
Functions are stored using variable names just like data. Every function you’ll be using is written in this way. If you want to look inside a function, just type the function name without ()
.
reverseScore
## function(aVector, minValue, maxValue){
## reversed <- maxValue - aVector + minValue
## return(reversed)
## }
The structure of Data Frames
Quick aside into the land of prepackaged data
To get started, let’s load the ‘psych’ package by the preeminent personality psychologist, Bill Ravel, at Northwestern
library(psych)
Aside from providing a host of useful functions, we will also get access to a ton of neat data.
# Run this to see all available data in `psych`: data(package='psych')
We’re going to load the data about vegetable preferences…
data(vegetables)
str(veg)
## 'data.frame': 9 obs. of 9 variables:
## $ Turn : num 0.5 0.182 0.23 0.189 0.122 0.108 0.101 0.108 0.074
## $ Cab : num 0.818 0.5 0.399 0.277 0.257 0.264 0.189 0.155 0.142
## $ Beet : num 0.77 0.601 0.5 0.439 0.264 0.324 0.155 0.203 0.182
## $ Asp : num 0.811 0.723 0.561 0.5 0.439 0.412 0.324 0.399 0.27
## $ Car : num 0.878 0.743 0.736 0.561 0.5 0.507 0.426 0.291 0.236
## $ Spin : num 0.892 0.736 0.676 0.588 0.493 0.5 0.372 0.318 0.372
## $ S.Beans: num 0.899 0.811 0.845 0.676 0.574 0.628 0.5 0.473 0.358
## $ Peas : num 0.892 0.845 0.797 0.601 0.709 0.682 0.527 0.5 0.372
## $ Corn : num 0.926 0.858 0.818 0.73 0.764 0.628 0.642 0.628 0.5
How to data.frame
Unlike vectors and matrices, data frames can hold data of all different types: a column of numbers, another column of words (i.e., character strings), and maybe another column of a special type called factors.
To demonstrate this, I’m going to quickly add a column of random grouping to the veg
data frame:
veg$group <- sample(c('Awesome', 'NotSoMuch'), size=dim(veg)[1], replace=T)
str(veg)
## 'data.frame': 9 obs. of 10 variables:
## $ Turn : num 0.5 0.182 0.23 0.189 0.122 0.108 0.101 0.108 0.074
## $ Cab : num 0.818 0.5 0.399 0.277 0.257 0.264 0.189 0.155 0.142
## $ Beet : num 0.77 0.601 0.5 0.439 0.264 0.324 0.155 0.203 0.182
## $ Asp : num 0.811 0.723 0.561 0.5 0.439 0.412 0.324 0.399 0.27
## $ Car : num 0.878 0.743 0.736 0.561 0.5 0.507 0.426 0.291 0.236
## $ Spin : num 0.892 0.736 0.676 0.588 0.493 0.5 0.372 0.318 0.372
## $ S.Beans: num 0.899 0.811 0.845 0.676 0.574 0.628 0.5 0.473 0.358
## $ Peas : num 0.892 0.845 0.797 0.601 0.709 0.682 0.527 0.5 0.372
## $ Corn : num 0.926 0.858 0.818 0.73 0.764 0.628 0.642 0.628 0.5
## $ group : chr "Awesome" "Awesome" "NotSoMuch" "NotSoMuch" ...
Just like matrices, you can reference rows and columns numerically
# first row, all columns:
veg[1, ]
## Turn Cab Beet Asp Car Spin S.Beans Peas Corn group
## Turn 0.5 0.818 0.77 0.811 0.878 0.892 0.899 0.892 0.926 Awesome
# first column, all rows:
veg[, 1] # Notice we just get back a vector (no 'orientation')
## [1] 0.500 0.182 0.230 0.189 0.122 0.108 0.101 0.108 0.074
# the cell at row 8, column 9:
veg[8, 9]
## [1] 0.628
Have you noticed the columns are named?
names(veg)
## [1] "Turn" "Cab" "Beet" "Asp" "Car" "Spin" "S.Beans"
## [8] "Peas" "Corn" "group"
You can also reference columns by these names:
veg[, 'Asp']
## [1] 0.811 0.723 0.561 0.500 0.439 0.412 0.324 0.399 0.270
veg[1, 'group']
## [1] "Awesome"
Here’s another way, for some reason:
veg$Beet
## [1] 0.770 0.601 0.500 0.439 0.264 0.324 0.155 0.203 0.182
veg$Peas[4]
## [1] 0.601
You can also refer to ranges:
veg[1:3, 'Beet']
## [1] 0.770 0.601 0.500
veg[1:3, 1:4]
## Turn Cab Beet Asp
## Turn 0.500 0.818 0.770 0.811
## Cab 0.182 0.500 0.601 0.723
## Beet 0.230 0.399 0.500 0.561
veg$Car[5:6]
## [1] 0.500 0.507
There are row names too because this is actually a matrix of the proportion of times one vegetable was preferred over another – ignore that for now.
You can also index the ranges you don’t want:
#who cares about Turnips?
veg[-1, -1]
## Cab Beet Asp Car Spin S.Beans Peas Corn group
## Cab 0.500 0.601 0.723 0.743 0.736 0.811 0.845 0.858 Awesome
## Beet 0.399 0.500 0.561 0.736 0.676 0.845 0.797 0.818 NotSoMuch
## Asp 0.277 0.439 0.500 0.561 0.588 0.676 0.601 0.730 NotSoMuch
## Car 0.257 0.264 0.439 0.500 0.493 0.574 0.709 0.764 Awesome
## Spin 0.264 0.324 0.412 0.507 0.500 0.628 0.682 0.628 NotSoMuch
## S.Beans 0.189 0.155 0.324 0.426 0.372 0.500 0.527 0.642 Awesome
## Peas 0.155 0.203 0.399 0.291 0.318 0.473 0.500 0.628 Awesome
## Corn 0.142 0.182 0.270 0.236 0.372 0.358 0.372 0.500 NotSoMuch
Some classic R subsetting
You’ll see this stuff a lot, and it’s convenient, but ugly shorthand. At least it’s not MATLAB.
In addition to putting the number(s) indexing the rows and columns you want, you can also index using a vector of TRUE
or FALSE
values:
TFVector <- rep(c(TRUE, FALSE), length=dim(veg)[1])
TFVector
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
veg[TFVector, ]
## Turn Cab Beet Asp Car Spin S.Beans Peas Corn group
## Turn 0.500 0.818 0.770 0.811 0.878 0.892 0.899 0.892 0.926 Awesome
## Beet 0.230 0.399 0.500 0.561 0.736 0.676 0.845 0.797 0.818 NotSoMuch
## Car 0.122 0.257 0.264 0.439 0.500 0.493 0.574 0.709 0.764 Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372 0.500 0.527 0.642 Awesome
## Corn 0.074 0.142 0.182 0.270 0.236 0.372 0.358 0.372 0.500 NotSoMuch
This seems dumb, but you can use it to subset your data by whether or not some row has a certain value:
awesomeVector <- veg$group == 'Awesome'
awesomeVector
## [1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
veg[awesomeVector, ]
## Turn Cab Beet Asp Car Spin S.Beans Peas Corn group
## Turn 0.500 0.818 0.770 0.811 0.878 0.892 0.899 0.892 0.926 Awesome
## Cab 0.182 0.500 0.601 0.723 0.743 0.736 0.811 0.845 0.858 Awesome
## Car 0.122 0.257 0.264 0.439 0.500 0.493 0.574 0.709 0.764 Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372 0.500 0.527 0.642 Awesome
## Peas 0.108 0.155 0.203 0.399 0.291 0.318 0.473 0.500 0.628 Awesome
# Or just combine it:
veg[veg$group == 'Awesome', ]
## Turn Cab Beet Asp Car Spin S.Beans Peas Corn group
## Turn 0.500 0.818 0.770 0.811 0.878 0.892 0.899 0.892 0.926 Awesome
## Cab 0.182 0.500 0.601 0.723 0.743 0.736 0.811 0.845 0.858 Awesome
## Car 0.122 0.257 0.264 0.439 0.500 0.493 0.574 0.709 0.764 Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372 0.500 0.527 0.642 Awesome
## Peas 0.108 0.155 0.203 0.399 0.291 0.318 0.473 0.500 0.628 Awesome
There are better ways, a lot of which we’ll see next week.
R Ain’t Loopy
for
loops are an incredibly necessary and useful programming topic to master. However, R wants you to perform all your operations on those collections of numbers called vectors. That is, R works faster if you don’t visit each cell of each data frame using a for loop, but instead ask it to do something to each column. Even basic arithmetic works this way:
aVec <- 1:10
aVec
## [1] 1 2 3 4 5 6 7 8 9 10
aVec + 2.5
## [1] 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5
aVec*100
## [1] 100 200 300 400 500 600 700 800 900 1000
aVec + c(-.5, .25)
## [1] 0.50 2.25 2.50 4.25 4.50 6.25 6.50 8.25 8.50 10.25
In other words, most R functions already know how to do something to each element in a vector.
Combine that with the fact that each column in a data.frame is just a vector, and that R has built in functions for doing something to each column in a data frame, and you can write more R-like code that runs faster, and is easier to read.
These functions are called the apply
functions. We’ll show you recent innovations on these functions next week.
I’m first just going to show you how to take the mean of each column in veg
, except for the group column:
#
vegMeans <- sapply(veg[, -dim(veg)[2]], mean)
vegMeans
## Turn Cab Beet Asp Car Spin S.Beans
## 0.1793333 0.3334444 0.3820000 0.4932222 0.5420000 0.5496667 0.6404444
## Peas Corn
## 0.6583333 0.7215556
If you wanted to transform each cell in the data, say by log transforming it, you could do this:
logVeg <- sapply(veg[, -dim(veg)[2]], log)
logVeg
## Turn Cab Beet Asp Car Spin
## [1,] -0.6931472 -0.2008929 -0.2613648 -0.2094872 -0.1301087 -0.1142891
## [2,] -1.7037486 -0.6931472 -0.5091603 -0.3243461 -0.2970592 -0.3065252
## [3,] -1.4696760 -0.9187939 -0.6931472 -0.5780344 -0.3065252 -0.3915622
## [4,] -1.6660083 -1.2837378 -0.8232559 -0.6931472 -0.5780344 -0.5310283
## [5,] -2.1037342 -1.3586792 -1.3318062 -0.8232559 -0.6931472 -0.7072461
## [6,] -2.2256241 -1.3318062 -1.1270118 -0.8867319 -0.6792443 -0.6931472
## [7,] -2.2926348 -1.6660083 -1.8643302 -1.1270118 -0.8533159 -0.9888614
## [8,] -2.2256241 -1.8643302 -1.5945493 -0.9187939 -1.2344320 -1.1457039
## [9,] -2.6036902 -1.9519282 -1.7037486 -1.3093333 -1.4439235 -0.9888614
## S.Beans Peas Corn
## [1,] -0.1064722 -0.1142891 -0.07688104
## [2,] -0.2094872 -0.1684187 -0.15315118
## [3,] -0.1684187 -0.2269006 -0.20089294
## [4,] -0.3915622 -0.5091603 -0.31471074
## [5,] -0.5551259 -0.3438998 -0.26918749
## [6,] -0.4652151 -0.3827256 -0.46521511
## [7,] -0.6931472 -0.6405547 -0.44316698
## [8,] -0.7486599 -0.6931472 -0.46521511
## [9,] -1.0272223 -0.9888614 -0.69314718
This is where writing your own functions comes in handy. I want to get back both the mean and SD of each column:
meanAndSD <- function(aVec){
aMean <- mean(aVec)
aSD <- sd(aVec)
return(c(mean=aMean, sd=aSD))
}
vegMeanAndSD <- sapply(veg[, -dim(veg)[2]], meanAndSD)
vegMeanAndSD
## Turn Cab Beet Asp Car Spin S.Beans
## mean 0.1793333 0.3334444 0.382000 0.4932222 0.5420000 0.5496667 0.6404444
## sd 0.1304751 0.2150704 0.211109 0.1786405 0.2134174 0.1909908 0.1841040
## Peas Corn
## mean 0.6583333 0.7215556
## sd 0.1729855 0.1344016
Save that clean data
Let’s pretend that the log transformed data is our ‘cleaned’ data, and we want to save it for later (’cause log transforming takes SOOO LOOOONG):
write.csv(logVeg, file='logTransVegetables.csv')
#to read it later:
read.csv('logTransVegetables.csv')
## X Turn Cab Beet Asp Car Spin
## 1 1 -0.6931472 -0.2008929 -0.2613648 -0.2094872 -0.1301087 -0.1142891
## 2 2 -1.7037486 -0.6931472 -0.5091603 -0.3243461 -0.2970592 -0.3065252
## 3 3 -1.4696760 -0.9187939 -0.6931472 -0.5780344 -0.3065252 -0.3915622
## 4 4 -1.6660083 -1.2837378 -0.8232559 -0.6931472 -0.5780344 -0.5310283
## 5 5 -2.1037342 -1.3586792 -1.3318062 -0.8232559 -0.6931472 -0.7072461
## 6 6 -2.2256241 -1.3318062 -1.1270118 -0.8867319 -0.6792443 -0.6931472
## 7 7 -2.2926348 -1.6660083 -1.8643302 -1.1270118 -0.8533159 -0.9888614
## 8 8 -2.2256241 -1.8643302 -1.5945493 -0.9187939 -1.2344320 -1.1457039
## 9 9 -2.6036902 -1.9519282 -1.7037486 -1.3093333 -1.4439235 -0.9888614
## S.Beans Peas Corn
## 1 -0.1064722 -0.1142891 -0.07688104
## 2 -0.2094872 -0.1684187 -0.15315118
## 3 -0.1684187 -0.2269006 -0.20089294
## 4 -0.3915622 -0.5091603 -0.31471074
## 5 -0.5551259 -0.3438998 -0.26918749
## 6 -0.4652151 -0.3827256 -0.46521511
## 7 -0.6931472 -0.6405547 -0.44316698
## 8 -0.7486599 -0.6931472 -0.46521511
## 9 -1.0272223 -0.9888614 -0.69314718
Some notes on EDA
What you want to do is look at all your data, however you can make that happen.
We could use sapply
to get a histogram for every column of the veg data:
sapply(veg[, -dim(veg)[2]], hist)
## Turn Cab Beet Asp Car Spin
## breaks Numeric,6 Numeric,9 Numeric,8 Numeric,8 Numeric,8 Numeric,7
## counts Integer,5 Integer,8 Integer,7 Integer,7 Integer,7 Integer,6
## density Numeric,5 Numeric,8 Numeric,7 Numeric,7 Numeric,7 Numeric,6
## mids Numeric,5 Numeric,8 Numeric,7 Numeric,7 Numeric,7 Numeric,6
## xname "X[[1L]]" "X[[2L]]" "X[[3L]]" "X[[4L]]" "X[[5L]]" "X[[6L]]"
## equidist TRUE TRUE TRUE TRUE TRUE TRUE
## S.Beans Peas Corn
## breaks Numeric,7 Numeric,7 Numeric,6
## counts Integer,6 Integer,6 Integer,5
## density Numeric,6 Numeric,6 Numeric,5
## mids Numeric,6 Numeric,6 Numeric,5
## xname "X[[7L]]" "X[[8L]]" "X[[9L]]"
## equidist TRUE TRUE TRUE
A preview of things to come: ggplot2
is awesome, and someone made an EDA package using it that does this:
library(GGally) #install.packages('GGally')
data(iris)
ggpairs(iris)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Using some of the functions we’ll talk about next week, you can automate a lot of the creation of custom plots and analyses (like, if you want to look at ICCs for all of your variables).
On deck
- Next week:
dplyr
,tidyr
for manipulating your data (cause some times you gotta). - Future times: You try!, Massively awesome plotting, ????
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });