Category: Beginners

Resources and tutorials for those just starting.

Dealing with “missing”/out of bounds values in heatmaps

I was tinkering around in R to see if I could plot better looking heatmaps, when I encountered an issue regarding how specific values are represented in plots with user-specified restricted ranges. When I’m plotting heatmaps, I usually want to restrict the range of the data being plotted in a specific way. In this case, I was plotting classifier accuracy across frequencies and time of Fourier-transformed EEG data, and I want to restrict the lowest value of the heatmap to be chance level (in this case, 50%).

I used the following code to generate a simple heatmap:

decoding <- readMat("/Users/Pablo/Desktop/testDecoding.mat")$plotData
plotDecoding <- melt(decoding)

quartz(width=5,height=4)
ggplot(plotDecoding, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value)) +
scale_fill_viridis(name = "",limits = c(0.5,0.75))

And this is what I got:

heatmap1

See the gray areas? What’s happening here is that any data points that fall outside of my specified data range (see the limits argument in scale_fill_viridis) are being converted to NAs. Apparently, NA values are plotted as gray by default. Obviously these values are not missing in the actual data; how can we actually ensure that they're plotted? The solution comes from the scales package that comes with ggplot. Whenever you create a plot with specified limits, include the argument oob = squish (oob = out of bounds) in the same line where you set the limits (make sure that the scales package is loaded). Ex:

scale_fill_viridis(name = "",limits = c(0.5,0.75),oob=squish)

What this does is that values that fall below the specified range are represented as the minimum value (color, in this case), and values that fall above the range are represented as the maximum value (which seems to be what Matlab defaults to). The resulting heatmap looks like this:

heatmap2

A simple solution to an annoying little quirk in R!

ANOVA + Contrasts in R

The linked Dropbox file has code and data files for doing contrasts and ANOVA in R.

https://www.dropbox.com/sh/132z6stjuaapn4c/AAB8TZoNIck5FH395vRpDYPPa?dl=0



R Club 11.3.15 ANOVA Contrasts




# set the working directory
setwd("/Users/jessicakosie/Dropbox/RClub_11.3.15")

# clear the environment
rm(list=ls())

Planned Contrasts in R

# read in Data Set #1

data <- read.csv("RClub_DataSet1_11.3.15.csv", header = TRUE)

# get descriptives
library(psych)
## Warning: package 'psych' was built under R version 3.2.2
describeBy(data$score, data$animal, mat=TRUE)
##    item   group1 vars  n mean        sd median trimmed    mad min max
## 11    1   Cougar    1 10  3.0 1.1547005    3.0   3.000 1.4826   1   5
## 12    2      Dog    1 10  9.1 0.9944289    9.0   9.250 1.4826   7  10
## 13    3 HouseCat    1 10  1.6 0.6992059    1.5   1.500 0.7413   1   3
## 14    4     Wolf    1 10  6.9 1.1972190    7.0   6.875 1.4826   5   9
##    range       skew   kurtosis        se
## 11     4  0.0000000 -0.9750000 0.3651484
## 12     3 -0.7809801 -0.5931107 0.3144660
## 13     2  0.5616761 -1.0787603 0.2211083
## 14     4  0.1678308 -1.1806760 0.3785939
# run an omnibus ANOVA

model <- aov(score ~ animal, data = data)
summary(model)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## animal       3  358.9  119.63   112.7 <2e-16 ***
## Residuals   36   38.2    1.06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# look at the levels of our factor

levels(data$animal)
## [1] "Cougar"   "Dog"      "HouseCat" "Wolf"
# tell R which groups to compare
c1 <- c(.5, -.5, .5, -.5) # canines vs. felines
c2 <- c(1, 0, -1, 0) # cougars vs. housecats
c3 <- c(0, 1, 0, -1) # dogs vs. wolves

# combined the above 3 lines into a matrix
mat <- cbind(c1,c2,c3)

# tell R that the matrix gives the contrasts you want
contrasts(data$animal) <- mat

# these lines give you your results
model1 <- aov(score ~ animal, data = data)

# can just look at the omnibus ANOVA (same as above)
summary(model1)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## animal       3  358.9  119.63   112.7 <2e-16 ***
## Residuals   36   38.2    1.06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Make sure to use summary.aov here or 'split' might not work
summary.aov(model1, split=list(animal=list("Canines vs. Felines"=1, "Cougars vs House Cats" = 2, "Wolves vs Dogs"=3))) 
##                                 Df Sum Sq Mean Sq F value   Pr(>F)    
## animal                           3  358.9   119.6 112.743  < 2e-16 ***
##   animal: Canines vs. Felines    1  324.9   324.9 306.188  < 2e-16 ***
##   animal: Cougars vs House Cats  1    9.8     9.8   9.236   0.0044 ** 
##   animal: Wolves vs Dogs         1   24.2    24.2  22.806 2.98e-05 ***
## Residuals                       36   38.2     1.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Polynomial Contrasts

Plot the means

Before we run an ANOVA, let’s obtain a plot to get an idea of how the sample means differ.
Check out this useful tutorial: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/

# clear the environment

rm(list=ls())

# read in Data Set #2

data <- read.csv("RClub_DataSet2_11.3.15.csv", header = TRUE)

# Lets take a look at the data - do we see a linear trend?

# Since we're not plotting the data themselves but rather means, first we need to calculate that, so we can feed it into the plot.
# To do that, we'll use a few functions from the dplyr package. Check out this tutorial: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

library(dplyr) 
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
groups <- group_by(data, dose) # this just prepares it for us to calculate eveyrthing within each condition
plot.data <- summarise(groups,
  mean = mean(score, na.rm=TRUE),
  sd = sd(score, na.rm=TRUE),
  n = n(),
  se=sd/sqrt(n),
  ci = qt(0.975, df=n-1)*se)
plot.data # take a peek
## Source: local data frame [3 x 6]
## 
##     dose  mean        sd     n        se        ci
##   (fctr) (dbl)     (dbl) (int)     (dbl)     (dbl)
## 1   15ml  3.55 0.4972145    10 0.1572330 0.3556858
## 2   25ml  4.05 0.3689324    10 0.1166667 0.2639183
## 3   35ml  4.35 0.5296750    10 0.1674979 0.3789066
# ready to plot!
# line graph
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:psych':
## 
##     %+%
ggplot(plot.data, aes(x=dose, y=mean, group = factor(1))) +
  geom_line() +
  geom_point() + 
  geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.1) +
  ggtitle("Figure 1. Mean Score by Dose")

Polynomial Contrasts

For additional info on setting up contrasts (beyond the scope of this lab), check out the ever useful UCLA stats walkthrough

# run an ANOVA 

results <- aov(score ~ dose, data=data)
summary.aov(results)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## dose         2  3.267  1.6333   7.381 0.00277 **
## Residuals   27  5.975  0.2213                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Run a Polynomial Contrast

# Here's one way - tell R which groups to compare. These are the orthogonal polynomial contrasts. 
c1 <- c(-1, 0, 1)
c2 <- c(0.5, -1, 0.5)

# combined the above 2 lines into a matrix
mat <- cbind(c1,c2)

# tell R that the matrix gives the contrasts you want
contrasts(data$dose) <- mat

# these lines give you your results
model1 <- aov(score ~ dose, data = data)
summary.aov(model1, split=list(dose=list("Linear"=1, "Quadratic" = 2)))
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## dose               2  3.267   1.633   7.381 0.002773 ** 
##   dose: Linear     1  3.200   3.200  14.460 0.000744 ***
##   dose: Quadratic  1  0.067   0.067   0.301 0.587607    
## Residuals         27  5.975   0.221                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Try it using built-in contr.poly as well:
# we tell R to get a polynomial contrast matrix for 3 conditions
contrasts(data$dose) <- contr.poly(3)

# these lines give you your results
model2 <- aov(score ~ dose, data = data)
summary.aov(model2, split=list(dose=list("Linear"=1, "Quadratic" = 2)))
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## dose               2  3.267   1.633   7.381 0.002773 ** 
##   dose: Linear     1  3.200   3.200  14.460 0.000744 ***
##   dose: Quadratic  1  0.067   0.067   0.301 0.587607    
## Residuals         27  5.975   0.221                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Run a TukeyHSD Test

TukeyTest <- TukeyHSD(results)
TukeyTest
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = score ~ dose, data = data)
## 
## $dose
##           diff         lwr      upr     p adj
## 25ml-15ml  0.5 -0.02161703 1.021617 0.0621938
## 35ml-15ml  0.8  0.27838297 1.321617 0.0020820
## 35ml-25ml  0.3 -0.22161703 0.821617 0.3420861

Factorial ANOVA

# clear the environment

rm(list=ls())

# read in Data Set #3

data <- read.csv("RClub_DataSet3_11.3.15.csv", header = TRUE)

First, get a plot of your means, so you can see what’s going on.

# I'm choosing to plot with instruction type across the x-axis and grouped by age. It would also be fine to do it the other way around.

library(dplyr) 
groups <- group_by(data, instructions, age) # this just prepares it for us to calculate everything within each combination of instructions  and age
plot.data <- summarise(groups,
  mean = mean(score, na.rm=TRUE),
  sd = sd(score, na.rm=TRUE),
  n = n(),
  se=sd/sqrt(n),
  ci = qt(0.975,df=n-1)*se)

plot.data # take a peek
## Source: local data frame [6 x 7]
## Groups: instructions [?]
## 
##   instructions    age     mean       sd     n        se       ci
##         (fctr) (fctr)    (dbl)    (dbl) (int)     (dbl)    (dbl)
## 1         both    old 21.16667 4.262237     6 1.7400511 4.472944
## 2         both  young 26.16667 4.167333     6 1.7013067 4.373348
## 3       verbal    old 23.50000 3.209361     6 1.3102163 3.368018
## 4       verbal  young 23.00000 2.366432     6 0.9660918 2.483418
## 5      written    old 13.66667 3.502380     6 1.4298407 3.675523
## 6      written  young 22.00000 4.774935     6 1.9493589 5.010986
# ready to plot!
# bar graph
# check this out: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/
library(ggplot2)
ggplot(plot.data, aes(x=instructions, y=mean, fill = age )) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.2, position=position_dodge(.9)) +
  ggtitle("Figure 1. Mean score \nby instruction type and age")

### Run a 2 X 3 ANOVA on the data.

# The best way to run this is actually with the lm() command, not aov(). It stands for "linear model". ANOVAs, regressions, t-tests, etc. are all examples of the general linear model, so you can use this one command to do pretty much any of them in R.

# aov() works, and it will generate exactly the same source table for you (the math is all identical), but lm() gives you more useful output.

model <- lm(score ~ instructions*age , data=data) 

# When you specify an interaction with *, R automatically assumes you want the main effects as well. 
# So "instructions*age" is shorthand for "instructions + age + instructions*age".

anova(model) # to get a source table
## Analysis of Variance Table
## 
## Response: score
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## instructions      2 254.17 127.083  8.8150 0.0009741 ***
## age               1 164.69 164.694 11.4239 0.0020278 ** 
## instructions:age  2 119.39  59.694  4.1407 0.0258236 *  
## Residuals        30 432.50  14.417                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We can discuss calculating simple effects in a later R Club



Using dplyr to batch analyses






Get the source code!

Load our packages

If they aren’t installed, just use the install.packages command provided below.

library(MASS)
#install.packages('MASS')
library(dplyr)
#install.packages('dplyr')
library(tidyr)
#install.packages('tidyr')
library(lme4)
#install.packages('lme4')
library(ggplot2)
#install.packages('ggplot2')
library(cowplot)
#install.packages('cowplot')

Simulate Data

#set seed so this is the same every time
set.seed(1337)

I’m making up data here that is supposed to reflect Big Five personality change over the lifespan. Possible scores range from 0-100.

First, I set up population mean intercepts (at age 35) and slopes.

popLineParamMeans <- data.frame(
                var=c('E','A','C','N','O'),
                int=c(60, 65, 55, 45, 50),
                slope=c(1, 2, 3, -1, 0))

head(popLineParamMeans)
##   var int slope
## 1   E  60     1
## 2   A  65     2
## 3   C  55     3
## 4   N  45    -1
## 5   O  50     0

Next, I set a variable to a covariance matrix for the slopes and intercepts. Here, I give the intercepts a standard deviation of \(\sqrt{10}\), slopes a SD of 1, and a negative covariance between them of -.5 (correlation of about -.16).

lineSigmas <- matrix(c(10, -.5, -.5, 1), nrow=2)

print(lineSigmas)
##      [,1] [,2]
## [1,] 10.0 -0.5
## [2,] -0.5  1.0
print(cov2cor(lineSigmas))
##            [,1]       [,2]
## [1,]  1.0000000 -0.1581139
## [2,] -0.1581139  1.0000000

Now I want to draw each person’s individual slope and intercept from a multivariate normal distribution based on the parameters above (for each BFI variable).

#set a variable for the number of people
# so that it's easy to change later if I want.
N <- 100

indivSlopes <- popLineParamMeans %>% # use the data frame from above
    group_by(var) %>% # do this for each variable separately
    do({
        someDat <- data.frame(
                #N draws
                #for 2 parameters with means defined here
                #and covariance matrix defined here
                      mvrnorm(n=N, 
                      mu=c(.$int, .$slope),
                      Sigma=lineSigmas))
        #give each row an ID
        someDat$id <- 1:N
        #return the data frame, 'cause that's what `do` expects
        someDat
    })

#give it some descriptive names
names(indivSlopes) <- c('bfi', 'intercept', 'slope', 'id')

head(indivSlopes)
## Source: local data frame [6 x 4]
## 
##      bfi intercept     slope    id
##   (fctr)     (dbl)     (dbl) (int)
## 1      A  64.37993 1.8269765     1
## 2      A  69.59519 2.1257374     2
## 3      A  65.94703 0.5928178     3
## 4      A  59.86194 2.1275653     4
## 5      A  67.18973 2.0810316     5
## 6      A  58.58967 3.1965359     6

So now we have a data frame that has a row for every person, for every BFI variable, and a column for that person’s intercept and slope on that variable.

Now we can generate their observed values at every time point. I’ll do 4 waves for now. We will generate them using a basic linear equation and throw in some error. Also, in this step we’ll randomly draw their starting age, and then center everything at age 35 (just because this mirrors the data I’m familiar with :]).

The equation is just \(y= \beta_0 + \beta_1 * age + \epsilon\), where B0 is the intercept, B1 is the slope, age is age, and epsilon is ther normally distributed error term.

timeSeriesData <- indivSlopes %>% #use the slope-per-person-per-var df from above
    group_by(id, bfi) %>% #do something for every person, for every bfi variable
    do({
        age <- round(runif(1, 18, 50)) #randomly draw starting age
        ages <- age + 0:3 #generate the ages for the all 4 waves
        #generate 4 scores for 4 ages with random, uncorrelated error
        score <- .$intercept + .$slope * (ages-35) + rnorm(4, 0, 4) 
        #Return a data frame, 'cause that's what `do` expects.
        #Note that it always retains the grouping variables like `id` and
        #`bfi` even if we don't put them in this data frame.
        data.frame(age=ages-35,
               agew1=age-35,
               value=score)
    })


head(timeSeriesData)
## Source: local data frame [6 x 5]
## Groups: id, bfi [2]
## 
##      id    bfi   age agew1    value
##   (int) (fctr) (dbl) (dbl)    (dbl)
## 1     1      A     4     4 74.27345
## 2     1      A     5     4 59.96832
## 3     1      A     6     4 77.33757
## 4     1      A     7     4 85.42808
## 5     1      C     1     1 49.13083
## 6     1      C     2     1 63.09312
timeSeriesData %>%
    filter(bfi=='A') %>%
    ggplot(aes(x=age, y=value)) +
        geom_line(aes(group=id)) +
        geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Now we have some data to work with, and you’ve learned about mvrnorm, and have a more complex example for how do can work.

Use do for batching analysis

Let’s pretend we don’t know the population parameters that generated our data, and want to estimate the proportion of variance due to individuals, compared to total variance.

We’ll use the multilevel modeling package lme4 to generate these statistics.

First, let’s figure out how to get the values out of lme4 we need for an ICC – that is, the variance in intercepts due to our grouping factor and total variance.

testMod <- lmer(value ~ 1 + (1|id), data = filter(timeSeriesData, bfi=='E'))
summary(testMod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + (1 | id)
##    Data: filter(timeSeriesData, bfi == "E")
## 
## REML criterion at convergence: 2644.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.69346 -0.54994 -0.00728  0.58528  2.44704 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 173.25   13.162  
##  Residual              17.38    4.169  
## Number of obs: 400, groups:  id, 100
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   59.360      1.333   44.54
VarCorr(testMod) #we can get SD for grouping factor from this
##  Groups   Name        Std.Dev.
##  id       (Intercept) 13.1624 
##  Residual              4.1686
sigma(testMod) #and residual SD from this
## [1] 4.16858
str(VarCorr(testMod))
## List of 1
##  $ id: num [1, 1] 173
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "(Intercept)"
##   .. ..$ : chr "(Intercept)"
##   ..- attr(*, "stddev")= Named num 13.2
##   .. ..- attr(*, "names")= chr "(Intercept)"
##   ..- attr(*, "correlation")= num [1, 1] 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr "(Intercept)"
##   .. .. ..$ : chr "(Intercept)"
##  - attr(*, "sc")= num 4.17
##  - attr(*, "useSc")= logi TRUE
##  - attr(*, "class")= chr "VarCorr.merMod"
VarCorr(testMod)$id['(Intercept)', '(Intercept)']
## [1] 173.2495
VarCorr(testMod)$id['(Intercept)', '(Intercept)']^.5 # Looks like this gives us the variance
## [1] 13.16243
idVariance <- VarCorr(testMod)[['id']]['(Intercept)', '(Intercept)']
residVar <- sigma(testMod)^2 #square the SD to get variance

anICC <- idVariance/(idVariance+residVar)
anICC
## [1] 0.9088424

Now all we have to do is use group_by and do to do this to each subset of the data frame. Remember that, within do, the subset of your data is called ..

iccDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + (1|id), data=.)
        idVariance <- VarCorr(amod)[['id']]['(Intercept)', '(Intercept)']
        residVar <- sigma(amod)^2 #square the SD to get variance
        anICC <- idVariance/(idVariance+residVar)
        data.frame(icc=anICC)
    })

iccDF
## Source: local data frame [5 x 2]
## Groups: bfi [5]
## 
##      bfi       icc
##   (fctr)     (dbl)
## 1      A 0.9411433
## 2      C 0.9694278
## 3      E 0.9088424
## 4      N 0.9010527
## 5      O 0.8576107

Write an ICC function

We could also write a function that takes the names of the outcome variable column, grouping variable column, and the data, and returns an ICC.

There’s a little magic in here where we construct the formula for lmer using the text. Quick explanation: paste0 just sticks together strings, so paste0('This', 'Works') gives us ThisWorks. We can use variables in it too, so if we set thing <- 'nonesense' and do paste0(thing, 'Works'), we get nonesenseWorks.

theICC <- function(yName, groupName, theData){
    if(!require(lme4)){
            stop('missing lme4 package')
    }

    theFormula <- formula(paste0(yName, '~1+(1|', groupName, ')'))
    amod <- lmer(theFormula, theData)
    idVariance <- VarCorr(amod)[[groupName]]['(Intercept)', '(Intercept)']
    residVar <- sigma(amod)^2 #square the SD to get variance

    anICC <- idVariance/(idVariance+residVar)
    anICC
}

#Test the function!
theICC('value', 'id', filter(timeSeriesData, bfi=='E'))
## [1] 0.9088424
iccDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        anICC <- theICC(yName='value',
                groupName='id',
                theData=.)
        data.frame(icc=anICC)
    })

iccDF
## Source: local data frame [5 x 2]
## Groups: bfi [5]
## 
##      bfi       icc
##   (fctr)     (dbl)
## 1      A 0.9411433
## 2      C 0.9694278
## 3      E 0.9088424
## 4      N 0.9010527
## 5      O 0.8576107

Other batched analyses

You can use do in this way for anything, so long as you return a data frame. Let’s go ahead and estimate the growth parameters for each BFI variable.

Again, start by getting a handle on the object you get back from your statistic function:

testMod <- lmer(value ~ 1 + age + (1|age), data=filter(timeSeriesData, bfi=='E'))
summary(testMod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + age + (1 | age)
##    Data: filter(timeSeriesData, bfi == "E")
## 
## REML criterion at convergence: 3051.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2027 -0.5829  0.0264  0.5890  4.1250 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  age      (Intercept)   0       0.00   
##  Residual             120      10.96   
## Number of obs: 400, groups:  age, 36
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 58.88174    0.54870  107.31
## age          0.85460    0.05619   15.21
## 
## Correlation of Fixed Effects:
##     (Intr)
## age -0.057
fixef(testMod) #looks like we might be able to just coerce this into a DF
## (Intercept)         age 
##  58.8817372   0.8546026
as.data.frame(fixef(testMod))
##             fixef(testMod)
## (Intercept)     58.8817372
## age              0.8546026

Looks like, so long as we keep track of which parameter is on which row, we’ll be able to just return the output of fixef coerced into a data.frame.

growthParamsDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + age + (1|age), data=.)
        fx <- as.data.frame(fixef(amod))
    })
growthParamsDF
## Source: local data frame [10 x 2]
## Groups: bfi [5]
## 
##       bfi fixef(amod)
##    (fctr)       (dbl)
## 1       A 65.48942693
## 2       A  1.93237366
## 3       C 56.15908452
## 4       C  3.02224221
## 5       E 58.88173717
## 6       E  0.85460255
## 7       N 44.86838252
## 8       N -0.81417670
## 9       O 49.45933488
## 10      O -0.08210474

Oops, gotta add the rownames.

growthParamsDF <- timeSeriesData %>%
    group_by(bfi) %>%
    do({
        amod <- lmer(value ~ 1 + age + (1|age), data=.)
        fx <- fixef(amod)
        fxDF <- as.data.frame(fx)
        fxDF$param <- gsub('\\(|\\)', '', rownames(fxDF)) # and remove '(' or ')'
        fxDF
    })
growthParamsDF
## Source: local data frame [10 x 3]
## Groups: bfi [5]
## 
##       bfi          fx     param
##    (fctr)       (dbl)     (chr)
## 1       A 65.48942693 Intercept
## 2       A  1.93237366       age
## 3       C 56.15908452 Intercept
## 4       C  3.02224221       age
## 5       E 58.88173717 Intercept
## 6       E  0.85460255       age
## 7       N 44.86838252 Intercept
## 8       N -0.81417670       age
## 9       O 49.45933488 Intercept
## 10      O -0.08210474       age

Presentation of your results

If we want, we can spread out the values and output a table.

growthParamsDF_wide <- growthParamsDF %>%
    spread(param, fx)
growthParamsDF_wide
## Source: local data frame [5 x 3]
## 
##      bfi         age Intercept
##   (fctr)       (dbl)     (dbl)
## 1      A  1.93237366  65.48943
## 2      C  3.02224221  56.15908
## 3      E  0.85460255  58.88174
## 4      N -0.81417670  44.86838
## 5      O -0.08210474  49.45933
library(knitr)
#Just a preview of how to make nice tables:

kable(growthParamsDF_wide)
bfi age Intercept
A 1.9323737 65.48943
C 3.0222422 56.15908
E 0.8546026 58.88174
N -0.8141767 44.86838
O -0.0821047 49.45933



Wrangle yourself some data!

2000px-Rubik's_cube_scrambled.svg

Your data is a scrambled Rubik’s Cube, and tidyr & dplyr are your hands.

Today’s R Club will start with tutorials on some of Hadley Wickham’s amazing packages (bolded, below), with some time at the end to workshop your own data. So: if you’ve got data to wrangle, bring it!

tidyr, “designed specifically for data tidying”

dplyr, “A fast, consistent tool for working with data frame like objects”

magrittr, “Ceci n’est pas un pipe.”

  • See the GitHub Readme for more info on the syntax augmentations for the above packages.

Putting it all together

R programming and data structures



week2.r




Download the source

Class miscellanea

For those who are enrolled:

  1. Attendance taken every class
    • If you’re enrolled, email us and we will give you access to the attendance Google Doc.
  2. Final, brief summary due to Sanjay at course end

If you need help outside of class, first go through the swirl tutorials and then contact us (or contact us to help you get started on swirl.

install.packages("swirl") 
library(swirl)
swirl()  # have fun! :)

Recap of what we’ve learned

  1. R is a giant calculator
    • More properly, a Turing Machine
    • Let’s thank Alan and Ada

    Alan Ada

  2. R keeps data in vectors, matrices, and (new to you) data frames
    • Use special syntax to index, or reference parts of each of these structures.
  3. R uses functions to do things
    • ploting
    • random number generation
    • stats

R is programming

R stores data, and functions that do things with that data, in your Environment.

You can use functions by typing commands directly into the console, or you can write commands in a script, like this one, and run commands from there (the preferred method).

Let’s compartmentalize things a bit:

  1. Raw data stored as text in a comma separated value file on your computer
    • SPSS files work too *
  2. Your R script saves all of your procedures for
    1. Loading your raw data
    2. Cleaning your raw data
      • saving your cleaned data for faster loading
    3. Analyzing your cleaned data
    4. Saving your results to text, html, pdf, and so forth.
  3. The R environment holds in working memory (RAM) your
    • loaded data (like, so loaded man)
    • base functions
    • loaded package functions (after using, e.g., library(AwesomePackage)

Here’s an example, in flowchart, of what this looks like put together:

R Workflow

If this doesn’t fully grock right now, don’t worry, it will. For now just be thinking about how R sort of keeps separate the data on your hard drive, the copy of that data you’re working on, and all of the functions you’re applying to that data, either to fix it up, analyze it, or present your analyses.

Functions: The Basics

Functions take input (data, other functions, option flags) and often give you output.

You can save this output, as we saw, using <-.

You can write your own functions and save them like this:

reverseScore  <- function(aVector, minValue, maxValue){
    reversed <- maxValue - aVector + minValue
    return(reversed)
}

#Let's test:
reverseScore(c(1,2,3,4,5), 1, 5)
## [1] 5 4 3 2 1

Functions are stored using variable names just like data. Every function you’ll be using is written in this way. If you want to look inside a function, just type the function name without ().

reverseScore
## function(aVector, minValue, maxValue){
##  reversed <- maxValue - aVector + minValue
##  return(reversed)
## }

The structure of Data Frames

Quick aside into the land of prepackaged data

To get started, let’s load the ‘psych’ package by the preeminent personality psychologist, Bill Ravel, at Northwestern

library(psych)

Aside from providing a host of useful functions, we will also get access to a ton of neat data.

# Run this to see all available data in `psych`: data(package='psych')

We’re going to load the data about vegetable preferences…

data(vegetables)

str(veg)
## 'data.frame':    9 obs. of  9 variables:
##  $ Turn   : num  0.5 0.182 0.23 0.189 0.122 0.108 0.101 0.108 0.074
##  $ Cab    : num  0.818 0.5 0.399 0.277 0.257 0.264 0.189 0.155 0.142
##  $ Beet   : num  0.77 0.601 0.5 0.439 0.264 0.324 0.155 0.203 0.182
##  $ Asp    : num  0.811 0.723 0.561 0.5 0.439 0.412 0.324 0.399 0.27
##  $ Car    : num  0.878 0.743 0.736 0.561 0.5 0.507 0.426 0.291 0.236
##  $ Spin   : num  0.892 0.736 0.676 0.588 0.493 0.5 0.372 0.318 0.372
##  $ S.Beans: num  0.899 0.811 0.845 0.676 0.574 0.628 0.5 0.473 0.358
##  $ Peas   : num  0.892 0.845 0.797 0.601 0.709 0.682 0.527 0.5 0.372
##  $ Corn   : num  0.926 0.858 0.818 0.73 0.764 0.628 0.642 0.628 0.5

How to data.frame

Unlike vectors and matrices, data frames can hold data of all different types: a column of numbers, another column of words (i.e., character strings), and maybe another column of a special type called factors.

To demonstrate this, I’m going to quickly add a column of random grouping to the veg data frame:

veg$group <- sample(c('Awesome', 'NotSoMuch'), size=dim(veg)[1], replace=T)

str(veg)
## 'data.frame':    9 obs. of  10 variables:
##  $ Turn   : num  0.5 0.182 0.23 0.189 0.122 0.108 0.101 0.108 0.074
##  $ Cab    : num  0.818 0.5 0.399 0.277 0.257 0.264 0.189 0.155 0.142
##  $ Beet   : num  0.77 0.601 0.5 0.439 0.264 0.324 0.155 0.203 0.182
##  $ Asp    : num  0.811 0.723 0.561 0.5 0.439 0.412 0.324 0.399 0.27
##  $ Car    : num  0.878 0.743 0.736 0.561 0.5 0.507 0.426 0.291 0.236
##  $ Spin   : num  0.892 0.736 0.676 0.588 0.493 0.5 0.372 0.318 0.372
##  $ S.Beans: num  0.899 0.811 0.845 0.676 0.574 0.628 0.5 0.473 0.358
##  $ Peas   : num  0.892 0.845 0.797 0.601 0.709 0.682 0.527 0.5 0.372
##  $ Corn   : num  0.926 0.858 0.818 0.73 0.764 0.628 0.642 0.628 0.5
##  $ group  : chr  "Awesome" "Awesome" "NotSoMuch" "NotSoMuch" ...

Just like matrices, you can reference rows and columns numerically

# first row, all columns:
veg[1, ]
##      Turn   Cab Beet   Asp   Car  Spin S.Beans  Peas  Corn   group
## Turn  0.5 0.818 0.77 0.811 0.878 0.892   0.899 0.892 0.926 Awesome
# first column, all rows:
veg[, 1] # Notice we just get back a vector (no 'orientation')
## [1] 0.500 0.182 0.230 0.189 0.122 0.108 0.101 0.108 0.074
# the cell at row 8, column 9:
veg[8, 9]
## [1] 0.628

Have you noticed the columns are named?

names(veg)
##  [1] "Turn"    "Cab"     "Beet"    "Asp"     "Car"     "Spin"    "S.Beans"
##  [8] "Peas"    "Corn"    "group"

You can also reference columns by these names:

veg[, 'Asp']
## [1] 0.811 0.723 0.561 0.500 0.439 0.412 0.324 0.399 0.270
veg[1, 'group']
## [1] "Awesome"

Here’s another way, for some reason:

veg$Beet
## [1] 0.770 0.601 0.500 0.439 0.264 0.324 0.155 0.203 0.182
veg$Peas[4]
## [1] 0.601

You can also refer to ranges:

veg[1:3, 'Beet']
## [1] 0.770 0.601 0.500
veg[1:3, 1:4]
##       Turn   Cab  Beet   Asp
## Turn 0.500 0.818 0.770 0.811
## Cab  0.182 0.500 0.601 0.723
## Beet 0.230 0.399 0.500 0.561
veg$Car[5:6]
## [1] 0.500 0.507

There are row names too because this is actually a matrix of the proportion of times one vegetable was preferred over another – ignore that for now.

You can also index the ranges you don’t want:

#who cares about Turnips?
veg[-1, -1]
##           Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn     group
## Cab     0.500 0.601 0.723 0.743 0.736   0.811 0.845 0.858   Awesome
## Beet    0.399 0.500 0.561 0.736 0.676   0.845 0.797 0.818 NotSoMuch
## Asp     0.277 0.439 0.500 0.561 0.588   0.676 0.601 0.730 NotSoMuch
## Car     0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764   Awesome
## Spin    0.264 0.324 0.412 0.507 0.500   0.628 0.682 0.628 NotSoMuch
## S.Beans 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642   Awesome
## Peas    0.155 0.203 0.399 0.291 0.318   0.473 0.500 0.628   Awesome
## Corn    0.142 0.182 0.270 0.236 0.372   0.358 0.372 0.500 NotSoMuch

Some classic R subsetting

You’ll see this stuff a lot, and it’s convenient, but ugly shorthand. At least it’s not MATLAB.

In addition to putting the number(s) indexing the rows and columns you want, you can also index using a vector of TRUE or FALSE values:

TFVector <- rep(c(TRUE, FALSE), length=dim(veg)[1])
TFVector
## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
veg[TFVector, ]
##          Turn   Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn     group
## Turn    0.500 0.818 0.770 0.811 0.878 0.892   0.899 0.892 0.926   Awesome
## Beet    0.230 0.399 0.500 0.561 0.736 0.676   0.845 0.797 0.818 NotSoMuch
## Car     0.122 0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764   Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642   Awesome
## Corn    0.074 0.142 0.182 0.270 0.236 0.372   0.358 0.372 0.500 NotSoMuch

This seems dumb, but you can use it to subset your data by whether or not some row has a certain value:

awesomeVector <- veg$group == 'Awesome'
awesomeVector
## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
veg[awesomeVector, ]
##          Turn   Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn   group
## Turn    0.500 0.818 0.770 0.811 0.878 0.892   0.899 0.892 0.926 Awesome
## Cab     0.182 0.500 0.601 0.723 0.743 0.736   0.811 0.845 0.858 Awesome
## Car     0.122 0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764 Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642 Awesome
## Peas    0.108 0.155 0.203 0.399 0.291 0.318   0.473 0.500 0.628 Awesome
# Or just combine it:

veg[veg$group == 'Awesome', ]
##          Turn   Cab  Beet   Asp   Car  Spin S.Beans  Peas  Corn   group
## Turn    0.500 0.818 0.770 0.811 0.878 0.892   0.899 0.892 0.926 Awesome
## Cab     0.182 0.500 0.601 0.723 0.743 0.736   0.811 0.845 0.858 Awesome
## Car     0.122 0.257 0.264 0.439 0.500 0.493   0.574 0.709 0.764 Awesome
## S.Beans 0.101 0.189 0.155 0.324 0.426 0.372   0.500 0.527 0.642 Awesome
## Peas    0.108 0.155 0.203 0.399 0.291 0.318   0.473 0.500 0.628 Awesome

There are better ways, a lot of which we’ll see next week.

R Ain’t Loopy

for loops are an incredibly necessary and useful programming topic to master. However, R wants you to perform all your operations on those collections of numbers called vectors. That is, R works faster if you don’t visit each cell of each data frame using a for loop, but instead ask it to do something to each column. Even basic arithmetic works this way:

aVec <- 1:10
aVec
##  [1]  1  2  3  4  5  6  7  8  9 10
aVec + 2.5
##  [1]  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5
aVec*100
##  [1]  100  200  300  400  500  600  700  800  900 1000
aVec + c(-.5, .25)
##  [1]  0.50  2.25  2.50  4.25  4.50  6.25  6.50  8.25  8.50 10.25

In other words, most R functions already know how to do something to each element in a vector.

Combine that with the fact that each column in a data.frame is just a vector, and that R has built in functions for doing something to each column in a data frame, and you can write more R-like code that runs faster, and is easier to read.

These functions are called the apply functions. We’ll show you recent innovations on these functions next week.

I’m first just going to show you how to take the mean of each column in veg, except for the group column:

#

vegMeans <- sapply(veg[, -dim(veg)[2]], mean)
vegMeans
##      Turn       Cab      Beet       Asp       Car      Spin   S.Beans 
## 0.1793333 0.3334444 0.3820000 0.4932222 0.5420000 0.5496667 0.6404444 
##      Peas      Corn 
## 0.6583333 0.7215556

If you wanted to transform each cell in the data, say by log transforming it, you could do this:

logVeg <- sapply(veg[, -dim(veg)[2]], log)
logVeg
##             Turn        Cab       Beet        Asp        Car       Spin
##  [1,] -0.6931472 -0.2008929 -0.2613648 -0.2094872 -0.1301087 -0.1142891
##  [2,] -1.7037486 -0.6931472 -0.5091603 -0.3243461 -0.2970592 -0.3065252
##  [3,] -1.4696760 -0.9187939 -0.6931472 -0.5780344 -0.3065252 -0.3915622
##  [4,] -1.6660083 -1.2837378 -0.8232559 -0.6931472 -0.5780344 -0.5310283
##  [5,] -2.1037342 -1.3586792 -1.3318062 -0.8232559 -0.6931472 -0.7072461
##  [6,] -2.2256241 -1.3318062 -1.1270118 -0.8867319 -0.6792443 -0.6931472
##  [7,] -2.2926348 -1.6660083 -1.8643302 -1.1270118 -0.8533159 -0.9888614
##  [8,] -2.2256241 -1.8643302 -1.5945493 -0.9187939 -1.2344320 -1.1457039
##  [9,] -2.6036902 -1.9519282 -1.7037486 -1.3093333 -1.4439235 -0.9888614
##          S.Beans       Peas        Corn
##  [1,] -0.1064722 -0.1142891 -0.07688104
##  [2,] -0.2094872 -0.1684187 -0.15315118
##  [3,] -0.1684187 -0.2269006 -0.20089294
##  [4,] -0.3915622 -0.5091603 -0.31471074
##  [5,] -0.5551259 -0.3438998 -0.26918749
##  [6,] -0.4652151 -0.3827256 -0.46521511
##  [7,] -0.6931472 -0.6405547 -0.44316698
##  [8,] -0.7486599 -0.6931472 -0.46521511
##  [9,] -1.0272223 -0.9888614 -0.69314718

This is where writing your own functions comes in handy. I want to get back both the mean and SD of each column:

meanAndSD <- function(aVec){
    aMean <- mean(aVec)
    aSD <- sd(aVec)
    return(c(mean=aMean, sd=aSD))
}

vegMeanAndSD <- sapply(veg[, -dim(veg)[2]], meanAndSD)
vegMeanAndSD
##           Turn       Cab     Beet       Asp       Car      Spin   S.Beans
## mean 0.1793333 0.3334444 0.382000 0.4932222 0.5420000 0.5496667 0.6404444
## sd   0.1304751 0.2150704 0.211109 0.1786405 0.2134174 0.1909908 0.1841040
##           Peas      Corn
## mean 0.6583333 0.7215556
## sd   0.1729855 0.1344016

Save that clean data

Let’s pretend that the log transformed data is our ‘cleaned’ data, and we want to save it for later (’cause log transforming takes SOOO LOOOONG):

write.csv(logVeg, file='logTransVegetables.csv')
#to read it later:
read.csv('logTransVegetables.csv')
##   X       Turn        Cab       Beet        Asp        Car       Spin
## 1 1 -0.6931472 -0.2008929 -0.2613648 -0.2094872 -0.1301087 -0.1142891
## 2 2 -1.7037486 -0.6931472 -0.5091603 -0.3243461 -0.2970592 -0.3065252
## 3 3 -1.4696760 -0.9187939 -0.6931472 -0.5780344 -0.3065252 -0.3915622
## 4 4 -1.6660083 -1.2837378 -0.8232559 -0.6931472 -0.5780344 -0.5310283
## 5 5 -2.1037342 -1.3586792 -1.3318062 -0.8232559 -0.6931472 -0.7072461
## 6 6 -2.2256241 -1.3318062 -1.1270118 -0.8867319 -0.6792443 -0.6931472
## 7 7 -2.2926348 -1.6660083 -1.8643302 -1.1270118 -0.8533159 -0.9888614
## 8 8 -2.2256241 -1.8643302 -1.5945493 -0.9187939 -1.2344320 -1.1457039
## 9 9 -2.6036902 -1.9519282 -1.7037486 -1.3093333 -1.4439235 -0.9888614
##      S.Beans       Peas        Corn
## 1 -0.1064722 -0.1142891 -0.07688104
## 2 -0.2094872 -0.1684187 -0.15315118
## 3 -0.1684187 -0.2269006 -0.20089294
## 4 -0.3915622 -0.5091603 -0.31471074
## 5 -0.5551259 -0.3438998 -0.26918749
## 6 -0.4652151 -0.3827256 -0.46521511
## 7 -0.6931472 -0.6405547 -0.44316698
## 8 -0.7486599 -0.6931472 -0.46521511
## 9 -1.0272223 -0.9888614 -0.69314718

Some notes on EDA

What you want to do is look at all your data, however you can make that happen.

We could use sapply to get a histogram for every column of the veg data:

sapply(veg[, -dim(veg)[2]], hist)

##          Turn      Cab       Beet      Asp       Car       Spin     
## breaks   Numeric,6 Numeric,9 Numeric,8 Numeric,8 Numeric,8 Numeric,7
## counts   Integer,5 Integer,8 Integer,7 Integer,7 Integer,7 Integer,6
## density  Numeric,5 Numeric,8 Numeric,7 Numeric,7 Numeric,7 Numeric,6
## mids     Numeric,5 Numeric,8 Numeric,7 Numeric,7 Numeric,7 Numeric,6
## xname    "X[[1L]]" "X[[2L]]" "X[[3L]]" "X[[4L]]" "X[[5L]]" "X[[6L]]"
## equidist TRUE      TRUE      TRUE      TRUE      TRUE      TRUE     
##          S.Beans   Peas      Corn     
## breaks   Numeric,7 Numeric,7 Numeric,6
## counts   Integer,6 Integer,6 Integer,5
## density  Numeric,6 Numeric,6 Numeric,5
## mids     Numeric,6 Numeric,6 Numeric,5
## xname    "X[[7L]]" "X[[8L]]" "X[[9L]]"
## equidist TRUE      TRUE      TRUE

A preview of things to come: ggplot2 is awesome, and someone made an EDA package using it that does this:

library(GGally) #install.packages('GGally')
data(iris)
ggpairs(iris)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Using some of the functions we’ll talk about next week, you can automate a lot of the creation of custom plots and analyses (like, if you want to look at ICCs for all of your variables).

On deck

  • Next week: dplyr, tidyr for manipulating your data (cause some times you gotta).
  • Future times: You try!, Massively awesome plotting, ????



R Club: An Introduction to R

Tomorrow’s R Club meeting will start out with a presentation from Rose Maier. She’ll be showing us a demonstration of Bayes Theorem that she programmed in R and some plotting in ggplot2 (which is a nice alternative to the basic plotting we demonstrated last week).

After Rose’s presentation John will go over some basic (but more advanced than last week) R programming. Topics will include any/all of the following: functions and arguments, subsetting data, understanding the structure of data files, reading and writing .csv files, different ways of doing exploratory data analysis, more about plotting and ggplot2, and we’ll take a look at for loops again. We’re always happy to spend more or less time on any of these topics (or add others) depending on the preferences/needs of the group!

For R Club veterans or more advanced users, we’d love your input. Would you like to show us how to do a certain type of analysis in R? Is there something that you think would have been especially helpful for you when you were starting out with R? Are there other more advanced topics that you’d like to discuss? Please let us know!

Hope to see you all tomorrow!

Jessica (I wrote this announcement – typos/awkwardness/mistakes are all my fault)

& John

 

Why should you learn R?

To all you folks who just joined us for the first time: Welcome, and thanks for making such a strong showing. Just in case we did not evangelize hard enough, here’s some info to encourage you to come back for week 2 and beyond.

Why should you learn R? A quick Google will reveal that many people have answered this question. My aim here is just to say quickly that it will help you conceive of, execute, and communicate your analyses. And it will greatly facilitate your ability to engage in reproducible science. Plus, the graphics are AWESOME.

To be further convinced, check out Why Use R over at inside-R.

Welcome to the wonderful world of R!

Getting started

Slides on R basics (including installation):
www.ats.ucla.edu/stat/r/seminars/intro.htm (you may want to toggle the format using the button in the bottom right)

The way to learn R is by using it, so jump into some practice problems. You can use the “practice with datasets” code copied below, or download swirl and play around with that. Or both!

Learning with Swirl

To install swirl, first install R if you haven’t already (see instructions in the slides above), and open it. In the command line, type
install.packages("swirl")
and hit Enter. You need a working internet connection. Once R has installed the package, you also need to load it. Type
library(swirl)
and hit Enter. Once you do that, swirl will take over and start giving you instructions (and peppy feedback!) to take you though the basics of R. Have fun!

Practice with datasets

This is some code for you to work through to practice using datasets in R.

This coordinates (roughly) with the [Intro to R slides from UCLA](http://www.ats.ucla.edu/stat/r/seminars/intro.htm“>www.ats.ucla.edu/stat/r/seminars/intro.htm).

Download the relevant datasets from dropbox:
https://www.dropbox.com/s/cw5hrw62exn3lvx/rclub1_data1.txt
https://www.dropbox.com/s/zvw58hhel8bu3gh/rclub1_data2.sav

Make a note of where these data files get saved when you download them (your downloads folder, maybe? Or the desktop?). You’ll need to know where they are to be able to access them from R.

Note: this is meant to provide practice moving data in and out of R. suggested functions are provided, but you need to fill in the arguments, etc. For help on how to use the functions, enter ?function.name in the command window (e.g. ?read.table).

To save your work, I recommend copying your code into an R script and working from there. In R Studio, go to File > New File > R Script. In R, go to File > New Document. In either case, this will open up a blank text file. You can copy-paste this code there, and then fill in your answers as work through the problems. Any line of text that begins with # is a comment.

First, check out the file: open the file rclub1_data1.txt in a text editor or excel. you’ll see it’s tab-deliminated. now we want to bring it into R so we can use it. before we tell R which file to open, we need to know what folder R is currently set to look in (working directory). If you’re trying unsucessfully to open a file, the first thing you should check is that your wd is correct.

getwd()
## [1] "/Users/TARDIS/Dropbox/RClub/rclub_code"
# if necessary, change wd so it matches the folder where you've got your data files
setwd("/Users/TARDIS/Dropbox/RClub")

# we want to read in the data file, so first you need to learn about the appropriate function (read.table) so you know how to structure the command. You'll want to do this for each new function you use.
?read.table

# now we can tell R to open the file, and save it as an object called df (short for dataframe). you could name it whatever you want, though. the name of the file ("rclub_data1.txt" must be in quotes, so R knows to look for a string matching that rather than to look for an existing object with that name).
df <- read.table("rclub1_data1.txt", header = T, sep = "\t")

# look at the first part of your dataframe, to eyeball the data.
head(df)
##   gender HowSmartTheyAre HowManyPointsTheyGot DidTheyEatBreakfast
## 1      M              95                    3                   Y
## 2      F              85                    3                   Y
## 3      F             114                    1                   Y
## 4      M             108                    2                   Y
## 5      M              98                    3                   Y
## 6      F              86                    3                   Y
# check the last chunk of the dataset, too, just because you're curious.
tail(df)
##    gender HowSmartTheyAre HowManyPointsTheyGot DidTheyEatBreakfast
## 45      F              87                    5                   N
## 46      M             108                    2                   N
## 47      F              91                    3                   N
## 48      M              95                    4                   N
## 49      F             112                    1                   N
## 50      F             104                    1                   N
# actually, let's look at it in a new window. 
View(df)

# get basic information about your dataframe (dimensions, variable types, etc.)
dim(df)
## [1] 50  4
str(df)
## 'data.frame':    50 obs. of  4 variables:
##  $ gender              : Factor w/ 2 levels "F","M": 2 1 1 2 2 1 2 1 2 1 ...
##  $ HowSmartTheyAre     : int  95 85 114 108 98 86 85 107 100 97 ...
##  $ HowManyPointsTheyGot: int  3 3 1 2 3 3 3 5 2 3 ...
##  $ DidTheyEatBreakfast : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 1 ...
# what are the variable names (column names)?
colnames(df)
## [1] "gender"               "HowSmartTheyAre"      "HowManyPointsTheyGot"
## [4] "DidTheyEatBreakfast"
# you don't like these column names. rename several of the variables.
colnames(df) <- c("gender", "IQ", "Score", "Breakfast?")

# rename just one variable: change "gender" to "male".
colnames(df)[1] <- "male"


# change the coding scheme for "male" from M/F to Y/N. Remeber that "male" is currently a factor variable. Use str() and View() to check how the coding scheme is applied when you change the levels of the variable.
levels(df$male) # check what they are originally first
## [1] "F" "M"
levels(df$male) <- c("N","Y") # change them to what you want

# you notice a typo (the first subject should actually be female, not male). damn RAs and their sloppy data entry! edit just that cell.
df$male[1] <- "N"

# save this dataframe in csv format, so you can open it in other software:
# write.csv()

# open that csv file back in R, because you're fickle. (note that you can also just use read.table to read csv files - they're actually the same function, just with different defaults)
# read.csv()

# and now save it again, but this time as tab-deliminated. or try using other deliminators, if you like (e.g spaces).
# write.table()

# open the file rclub1_data2.sav (an SPSS file). first, learn about read.spss:
?read.spss
## No documentation for 'read.spss' in specified packages and libraries:
## you could try '??read.spss'
# if R can't find the function, use two ?'s to conduct a boarder search:
??read.spss

#read.spss is in the package "foreign". do you have the foreign package attached already?
sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.4     evaluate_0.5.5   formatR_0.10     htmltools_0.2.4 
## [5] knitr_1.6        rmarkdown_0.3.10 stringr_0.6.2    tools_3.0.2
# if not, load the foreign library.
library(foreign)

# now try ?read.spss again, then open the fie rclub1_data2.sav
df <- read.spss("rclub1_data2.sav")
## Warning: rclub1_data2.sav: Unrecognized record type 7, subtype 18
## encountered in system file
## re-encoding from latin1
# check out the dataframe to see what variables you have, how many cases you have, etc. You can use the same functions you used before.

# add a new column on the end with average test score by taking the mean of each person's scores for the three tests
df$TestAve <- (df$Test1 + df$Test2 + df$Test3)/3