Category: Beginners

Resources and tutorials for those just starting.

Dealing with “missing”/out of bounds values in heatmaps

I was tinkering around in R to see if I could plot better looking heatmaps, when I encountered an issue regarding how specific values are represented in plots with user-specified restricted ranges. When I’m plotting heatmaps, I usually want to restrict the range of the data being plotted in a specific way. In this case, I was plotting classifier accuracy across frequencies and time of Fourier-transformed EEG data, and I want to restrict the lowest value of the heatmap to be chance level (in this case, 50%).

I used the following code to generate a simple heatmap:

decoding <- readMat("/Users/Pablo/Desktop/testDecoding.mat")$plotData
plotDecoding <- melt(decoding)

quartz(width=5,height=4)
ggplot(plotDecoding, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value)) +
scale_fill_viridis(name = "",limits = c(0.5,0.75))

And this is what I got:

heatmap1

See the gray areas? What’s happening here is that any data points that fall outside of my specified data range (see the limits argument in scale_fill_viridis) are being converted to NAs. Apparently, NA values are plotted as gray by default. Obviously these values are not missing in the actual data; how can we actually ensure that they're plotted? The solution comes from the scales package that comes with ggplot. Whenever you create a plot with specified limits, include the argument oob = squish (oob = out of bounds) in the same line where you set the limits (make sure that the scales package is loaded). Ex:

scale_fill_viridis(name = "",limits = c(0.5,0.75),oob=squish)

What this does is that values that fall below the specified range are represented as the minimum value (color, in this case), and values that fall above the range are represented as the maximum value (which seems to be what Matlab defaults to). The resulting heatmap looks like this:

heatmap2

A simple solution to an annoying little quirk in R!

ANOVA + Contrasts in R

The linked Dropbox file has code and data files for doing contrasts and ANOVA in R.

https://www.dropbox.com/sh/132z6stjuaapn4c/AAB8TZoNIck5FH395vRpDYPPa?dl=0



R Club 11.3.15 ANOVA Contrasts




# set the working directory
setwd("/Users/jessicakosie/Dropbox/RClub_11.3.15")

# clear the environment
rm(list=ls())

Planned Contrasts in R

# read in Data Set #1

data <- read.csv("RClub_DataSet1_11.3.15.csv", header = TRUE)

# get descriptives
library(psych)
## Warning: package 'psych' was built under R version 3.2.2
describeBy(data$score, data$animal, mat=TRUE)
##    item   group1 vars  n mean        sd median trimmed    mad min max
## 11    1   Cougar    1 10  3.0 1.1547005    3.0   3.000 1.4826   1   5
## 12    2      Dog    1 10  9.1 0.9944289    9.0   9.250 1.4826   7  10
## 13    3 HouseCat    1 10  1.6 0.6992059    1.5   1.500 0.7413   1   3
## 14    4     Wolf    1 10  6.9 1.1972190    7.0   6.875 1.4826   5   9
##    range       skew   kurtosis        se
## 11     4  0.0000000 -0.9750000 0.3651484
## 12     3 -0.7809801 -0.5931107 0.3144660
## 13     2  0.5616761 -1.0787603 0.2211083
## 14     4  0.1678308 -1.1806760 0.3785939
# run an omnibus ANOVA

model <- aov(score ~ animal, data = data)
summary(model)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## animal       3  358.9  119.63   112.7 <2e-16 ***
## Residuals   36   38.2    1.06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# look at the levels of our factor

levels(data$animal)
## [1] "Cougar"   "Dog"      "HouseCat" "Wolf"
# tell R which groups to compare
c1 <- c(.5, -.5, .5, -.5) # canines vs. felines
c2 <- c(1, 0, -1, 0) # cougars vs. housecats
c3 <- c(0, 1, 0, -1) # dogs vs. wolves

# combined the above 3 lines into a matrix
mat <- cbind(c1,c2,c3)

# tell R that the matrix gives the contrasts you want
contrasts(data$animal) <- mat

# these lines give you your results
model1 <- aov(score ~ animal, data = data)

# can just look at the omnibus ANOVA (same as above)
summary(model1)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## animal       3  358.9  119.63   112.7 <2e-16 ***
## Residuals   36   38.2    1.06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Make sure to use summary.aov here or 'split' might not work
summary.aov(model1, split=list(animal=list("Canines vs. Felines"=1, "Cougars vs House Cats" = 2, "Wolves vs Dogs"=3))) 
##                                 Df Sum Sq Mean Sq F value   Pr(>F)    
## animal                           3  358.9   119.6 112.743  < 2e-16 ***
##   animal: Canines vs. Felines    1  324.9   324.9 306.188  < 2e-16 ***
##   animal: Cougars vs House Cats  1    9.8     9.8   9.236   0.0044 ** 
##   animal: Wolves vs Dogs         1   24.2    24.2  22.806 2.98e-05 ***
## Residuals                       36   38.2     1.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Polynomial Contrasts

Plot the means

Before we run an ANOVA, let’s obtain a plot to get an idea of how the sample means differ.
Check out this useful tutorial: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/

# clear the environment

rm(list=ls())

# read in Data Set #2

data <- read.csv("RClub_DataSet2_11.3.15.csv", header = TRUE)

# Lets take a look at the data - do we see a linear trend?

# Since we're not plotting the data themselves but rather means, first we need to calculate that, so we can feed it into the plot.
# To do that, we'll use a few functions from the dplyr package. Check out this tutorial: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

library(dplyr) 
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
groups <- group_by(data, dose) # this just prepares it for us to calculate eveyrthing within each condition
plot.data <- summarise(groups,
  mean = mean(score, na.rm=TRUE),
  sd = sd(score, na.rm=TRUE),
  n = n(),
  se=sd/sqrt(n),
  ci = qt(0.975, df=n-1)*se)
plot.data # take a peek
## Source: local data frame [3 x 6]
## 
##     dose  mean        sd     n        se        ci
##   (fctr) (dbl)     (dbl) (int)     (dbl)     (dbl)
## 1   15ml  3.55 0.4972145    10 0.1572330 0.3556858
## 2   25ml  4.05 0.3689324    10 0.1166667 0.2639183
## 3   35ml  4.35 0.5296750    10 0.1674979 0.3789066
# ready to plot!
# line graph
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:psych':
## 
##     %+%
ggplot(plot.data, aes(x=dose, y=mean, group = factor(1))) +
  geom_line() +
  geom_point() + 
  geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.1) +
  ggtitle("Figure 1. Mean Score by Dose")

Polynomial Contrasts

For additional info on setting up contrasts (beyond the scope of this lab), check out the ever useful UCLA stats walkthrough

# run an ANOVA 

results <- aov(score ~ dose, data=data)
summary.aov(results)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## dose         2  3.267  1.6333   7.381 0.00277 **
## Residuals   27  5.975  0.2213                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Run a Polynomial Contrast

# Here's one way - tell R which groups to compare. These are the orthogonal polynomial contrasts. 
c1 <- c(-1, 0, 1)
c2 <- c(0.5, -1, 0.5)

# combined the above 2 lines into a matrix
mat <- cbind(c1,c2)

# tell R that the matrix gives the contrasts you want
contrasts(data$dose) <- mat

# these lines give you your results
model1 <- aov(score ~ dose, data = data)
summary.aov(model1, split=list(dose=list("Linear"=1, "Quadratic" = 2)))
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## dose               2  3.267   1.633   7.381 0.002773 ** 
##   dose: Linear     1  3.200   3.200  14.460 0.000744 ***
##   dose: Quadratic  1  0.067   0.067   0.301 0.587607    
## Residuals         27  5.975   0.221                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Try it using built-in contr.poly as well:
# we tell R to get a polynomial contrast matrix for 3 conditions
contrasts(data$dose) <- contr.poly(3)

# these lines give you your results
model2 <- aov(score ~ dose, data = data)
summary.aov(model2, split=list(dose=list("Linear"=1, "Quadratic" = 2)))
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## dose               2  3.267   1.633   7.381 0.002773 ** 
##   dose: Linear     1  3.200   3.200  14.460 0.000744 ***
##   dose: Quadratic  1  0.067   0.067   0.301 0.587607    
## Residuals         27  5.975   0.221                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Run a TukeyHSD Test

TukeyTest <- TukeyHSD(results)
TukeyTest
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = score ~ dose, data = data)
## 
## $dose
##           diff         lwr      upr     p adj
## 25ml-15ml  0.5 -0.02161703 1.021617 0.0621938
## 35ml-15ml  0.8  0.27838297 1.321617 0.0020820
## 35ml-25ml  0.3 -0.22161703 0.821617 0.3420861

Factorial ANOVA

# clear the environment

rm(list=ls())

# read in Data Set #3

data <- read.csv("RClub_DataSet3_11.3.15.csv", header = TRUE)

First, get a plot of your means, so you can see what’s going on.

# I'm choosing to plot with instruction type across the x-axis and grouped by age. It would also be fine to do it the other way around.

library(dplyr) 
groups <- group_by(data, instructions, age) # this just prepares it for us to calculate everything within each combination of instructions  and age
plot.data <- summarise(groups,
  mean = mean(score, na.rm=TRUE),
  sd = sd(score, na.rm=TRUE),
  n = n(),
  se=sd/sqrt(n),
  ci = qt(0.975,df=n-1)*se)

plot.data # take a peek
## Source: local data frame [6 x 7]
## Groups: instructions [?]
## 
##   instructions    age     mean       sd     n        se       ci
##         (fctr) (fctr)    (dbl)    (dbl) (int)     (dbl)    (dbl)
## 1         both    old 21.16667 4.262237     6 1.7400511 4.472944
## 2         both  young 26.16667 4.167333     6 1.7013067 4.373348
## 3       verbal    old 23.50000 3.209361     6 1.3102163 3.368018
## 4       verbal  young 23.00000 2.366432     6 0.9660918 2.483418
## 5      written    old 13.66667 3.502380     6 1.4298407 3.675523
## 6      written  young 22.00000 4.774935     6 1.9493589 5.010986
# ready to plot!
# bar graph
# check this out: http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/
library(ggplot2)
ggplot(plot.data, aes(x=instructions, y=mean, fill = age )) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.2, position=position_dodge(.9)) +
  ggtitle("Figure 1. Mean score \nby instruction type and age")

### Run a 2 X 3 ANOVA on the data.

# The best way to run this is actually with the lm() command, not aov(). It stands for "linear model". ANOVAs, regressions, t-tests, etc. are all examples of the general linear model, so you can use this one command to do pretty much any of them in R.

# aov() works, and it will generate exactly the same source table for you (the math is all identical), but lm() gives you more useful output.

model <- lm(score ~ instructions*age , data=data) 

# When you specify an interaction with *, R automatically assumes you want the main effects as well. 
# So "instructions*age" is shorthand for "instructions + age + instructions*age".

anova(model) # to get a source table
## Analysis of Variance Table
## 
## Response: score
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## instructions      2 254.17 127.083  8.8150 0.0009741 ***
## age               1 164.69 164.694 11.4239 0.0020278 ** 
## instructions:age  2 119.39  59.694  4.1407 0.0258236 *  
## Residuals        30 432.50  14.417                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We can discuss calculating simple effects in a later R Club