Category: nuggets

February 3, 2015

Generate Random Correlated Data

library(MASS)
covmat <- matrix(c(1,.5,.6, .5,1,.5, .6,.5,1), ncol=3) # the true cov matrix for my data
data <- mvrnorm(100, mu=c(0,0,0), Sigma=covmat) # generate random data that match that cov matrix
colnames(data) <- c("X1", "X2", "X3")

library(knitr)
kable(round(cor(data), 3),format = "pandoc", caption = "Table 1: Correlation Matrix")

Table 1: Correlation Matrix
	X1	X2	X3
X1	1.000	0.560	0.555
X2	0.560	1.000	0.518
X3	0.555	0.518	1.000

February 2, 2015

Dealing with messy data in R (Dplyr, Tidyr, Jsonlite, and Facebook API)

For Tuesday’s R Club meeting, I’d like to go over two related things if there’s time to get through both:

Messy data: Rearranging messy datasets with Dplyr and Tidyr
Really, really, really messy data: Using the Facebook API

My Facebook code is not completely done, but it is functional and I think it would be good for anyone who is not familiar with API’s to see how you get and work with a real dataset. It comes in Json format (like XML; tableless), so it’s pretty exciting. I will post the Facebook code once I’ve finished tuning it up in the next week or two.

An overview of data wrangling in R

RMD file. Based on a recent presentation by Hadley Wickham. You’ll definitely want to get the Data Wrangling Cheat Sheet. Here’s the video:

library(RCurl, warn = FALSE)

## Loading required package: bitops

# get tb data from Hadley Wickham's github
myData <- getURL("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/tb.csv", ssl.verifypeer = FALSE) 
tbdata <- read.csv(textConnection(myData)) 
head(tbdata, n=7)

##   iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514
## 1   AD 1989  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 2   AD 1990  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 3   AD 1991  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 4   AD 1992  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 5   AD 1993  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 6   AD 1994  NA   NA   NA    NA    NA    NA    NA    NA  NA NA  NA   NA
## 7   AD 1996  NA   NA    0     0     0     4     1     0   0 NA  NA   NA
##   f014 f1524 f2534 f3544 f4554 f5564 f65 fu
## 1   NA    NA    NA    NA    NA    NA  NA NA
## 2   NA    NA    NA    NA    NA    NA  NA NA
## 3   NA    NA    NA    NA    NA    NA  NA NA
## 4   NA    NA    NA    NA    NA    NA  NA NA
## 5   NA    NA    NA    NA    NA    NA  NA NA
## 6   NA    NA    NA    NA    NA    NA  NA NA
## 7    0     1     1     0     0     1   0 NA

tidyr

Messy -> Tidy Data

library(tidyr)
library(dplyr, warn = FALSE)

# gather and separate the data to make it "tidy"
tb2 <- tbdata %>%
  gather(demo, n, -iso2, -year, na.rm = TRUE) %>% 
  separate(demo, c("sex","age"), 1)
head(tb2, n=7)

##   iso2 year sex age n
## 1   AD 2005   m  04 0
## 2   AD 2006   m  04 0
## 3   AD 2008   m  04 0
## 4   AE 2006   m  04 0
## 5   AE 2007   m  04 0
## 6   AE 2008   m  04 0
## 7   AG 2007   m  04 0

dplyr

Manipulate data

# rename variables and sort observations (arrange)
tb3 <- tb2 %>%
  rename(country = iso2) %>%
  arrange(country, year, sex, age)
head(tb3, n=7)

##   country year sex  age n
## 1      AD 1996   f  014 0
## 2      AD 1996   f 1524 1
## 3      AD 1996   f 2534 1
## 4      AD 1996   f 3544 0
## 5      AD 1996   f 4554 0
## 6      AD 1996   f 5564 1
## 7      AD 1996   f   65 0

more examples

demo(package = "tidyr") # produces a list of all of the demos in package 'tidyr'
demo('so-15668870', package = "tidyr", ask = FALSE)

## 
## 
##  demo(so-15668870)
##  ---- ~~~~~~~~~~~
## 
## > # http://stackoverflow.com/questions/15668870/
## > library(tidyr)
## 
## > library(dplyr)
## 
## > grades <- tbl_df(read.table(header = TRUE, text = "
## +    ID   Test Year   Fall Spring Winter
## +     1   1   2008    15      16      19
## +     1   1   2009    12      13      27
## +     1   2   2008    22      22      24
## +     1   2   2009    10      14      20
## +     2   1   2008    12      13      25
## +     2   1   2009    16      14      21
## +     2   2   2008    13      11      29
## +     2   2   2009    23      20      26
## +     3   1   2008    11      12      22
## +     3   1   2009    13      11      27
## +     3   2   2008    17      12      23
## +     3   2   2009    14      9       31
## + "))
## 
## > grades %>%
## +   gather(Semester, Score, Fall:Winter) %>%
## +   mutate(Test = paste0("Test", Test)) %>%
## +   spread(Test, Score) %>%
## +   arrange(ID, Year, Semester)
## Source: local data frame [18 x 5]
## 
##    ID Year Semester Test1 Test2
## 1   1 2008     Fall    15    22
## 2   1 2008   Spring    16    22
## 3   1 2008   Winter    19    24
## 4   1 2009     Fall    12    10
## 5   1 2009   Spring    13    14
## 6   1 2009   Winter    27    20
## 7   2 2008     Fall    12    13
## 8   2 2008   Spring    13    11
## 9   2 2008   Winter    25    29
## 10  2 2009     Fall    16    23
## 11  2 2009   Spring    14    20
## 12  2 2009   Winter    21    26
## 13  3 2008     Fall    11    17
## 14  3 2008   Spring    12    12
## 15  3 2008   Winter    22    23
## 16  3 2009     Fall    13    14
## 17  3 2009   Spring    11     9
## 18  3 2009   Winter    27    31

demo('dadmom', package = "tidyr", ask = FALSE)

## 
## 
##  demo(dadmom)
##  ---- ~~~~~~
## 
## > library(tidyr)
## 
## > library(dplyr)
## 
## > dadmom <- foreign::read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
## 
## > dadmom %>%
## +   gather(key, value, named:incm) %>%
## +   separate(key, c("variable", "type"), -2) %>%
## +   spread(variable, value, convert = TRUE)
##   famid type   inc name
## 1     1    d 30000 Bill
## 2     1    m 15000 Bess
## 3     2    d 22000  Art
## 4     2    m 18000  Amy
## 5     3    d 25000 Paul
## 6     3    m 50000  Pat

more resources

vignette("tidy-data")

January 20, 2015

611 Highlights in R (Part II)

The raw .R file is here and the Rmd file is here.

Brief ‘Hello’ to R Markdown

Let me first introduce you to R markdown and kitr. This format allows you to integrate your analyses with text that can be automatically translated to html, word, and pdf formats. I just produced a supplementary materials document using this and it probably saved me a ridiculous number of hours. It’s easy to get started:

Save your file with extension .Rmd.

Place something like the following in the very top of your file:

---
author: John Flournoy
title: 611 Highlights in R (Part II)
output: html_document
---

Write as you normally would, using Markdown syntax for formatting. For example:

# This is a big heading

1. this is
2. a numbered list

- this is a bulleted list
- With **bold** and *italic*  text

When you want to throw in some R code, place it inside an R chunk by wrapping it in ```{r} at the beginning and ``` at the end. When you do, it comes out something like this:

# From Rose's presentation:
# a toy data set for us to work with
n <- 100
gender <- as.factor(c(rep("M", n/2), rep("F", n/2)))
IQ <- rnorm(n, mean=100, sd=15)
degree <- rep(c("HS", "BA", "MS", "PhD"), n/4)
height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1)
RT1 <- rchisq(n, 4)
RT2 <- rchisq(n, 4)
DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n)

df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2)
head(df)

##    awesome gender        IQ degree   height      RT1      RT2
## 1 52.89486      M  79.40263     HS 6.292706 1.761990 2.228117
## 2 57.04782      M  96.31289     BA 4.759042 2.236027 7.363253
## 3 55.59594      M 104.95331     MS 5.560292 5.159114 7.126281
## 4 55.44397      M  84.23347    PhD 5.354595 7.652900 2.836956
## 5 55.45191      M 107.38754     HS 4.523997 7.768904 4.504664
## 6 54.70468      M 110.07096     BA 6.139368 7.786462 7.508488

Notice that in 4, the we see the output of the command head(df). This is useful for outputting tables and plots, as we’ll see.

Here are more resources on

Streamlined Data Workflow with dplyr

Download the data here:

Make sure you have the package (library(dplyr)), and if you don’t, install it (install.packages('dplyr')). Here’s an intro to dplyr straight from the horse’s (Hadley’s) mouth.

By way of data manipulation that I’m actually using for a real project, this tutorial will introduce you to

%>%
- a function chaining operator
mutate and transmute
- adds new variables and preserves existing (transmute drops existing variables)
mutate_each
- adds new variables, applying functions in funs(...) to each column
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
left_join
- joins two data frames, returning all rows from x, and all columns from x and y
select
- keeps the variables/columns that you mention (see ?select for special functions that help you select variables)
filter
- returns rows with matching conditions (similar to selecting columns like: aDataFrame[variable == value,])
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
do
- use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same lm or t.test separately for each gender, country)

Continue reading →

January 20, 2015

Biased and unbiased estimates

(This was really little more than an excuse to get me to learn a little about Markdown in R. I have been toying with the idea of trying to move the 302 labs away from SPSS and into R. One such advantage could be the relative ease of employing simulations in R to to illustrate some of the concepts that students often report as challenging. The distinction between biased and unbiased estimates was something that students questioned me on last week, so it’s what I’ve tried to walk through here.)

In 302, we teach students that sample means provide an unbiased estimate of population means. Of course, this doesn’t mean that sample means are PERFECT estimates of population means. We should expect that the mean of any sample, no matter how representative, will differ a bit from the mean of the population from which it was drawn. This is sampling error. We say the sample mean is an unbiased estimate because it doesn’t differ systemmatically from the population mean–samples with means greater than the population mean are as likely as samples with means smaller than the population mean.

Let’s simulate this.

Creating a population

First, we need to create a population of scores. Let’s use IQ scores as an example. Here we’ll draw a million scores from a normal distribution with a mean of 100 and a standard deviation of 15. We’ll plot it as a histogram to show that everything is nice and normal.

# Define a few settings for our work today
popMean = 100; # The mean of our population
popSD = 15; # The standard deviation of our population
sampSize = 30; # The size of the samples we'll draw from the population

population <- rnorm(1000000, mean = popMean, sd = popSD)
hist(population)

Just to double check and make sure that R is doing its thing like it should, we can check some descriptive statistics for this population. If things have worked, these values should be pretty darn close to μ = 100 and σ = 15.

mean(population)

## [1] 100.0175

sd(population)

## [1] 14.99739

Yep.

Sample means are unbiased estimates of population means

Now, we need to create a sampling distribution. We will draw a sample from this population and find its mean. Then, we do that same thing over and over again a whole mess ’a times. This distribution of sample means is a sampling distribution.

Let’s give it a whirl. We’ll start with a little snippet that just gives us a single sample of 30 participants, using random sampling with replacement. When we take a peek at its mean, we should find that it is pretty darn close to 100, but with a little sampling error.

sampChamp <- sample(population, sampSize, replace=TRUE)
mean(sampChamp)

## [1] 96.36024

A quick aside: Just how close should it be to the population mean? Well, the expected deviation between any sample mean and the population mean is estimated by the standard error: σ²_M = σ / √(n). So, in this case, we’d have a σ²_M = 15 / √30 = 2.7386128

Let’s return to our simulation. We’ll now draw a whole bunch of samples and enter their means into a sampling distribution.

sampDist <- replicate(10000,mean(sample(population, sampSize, replace=TRUE)))
hist(sampDist)

Let’s get a few descriptive statistics here, too.

mean(sampDist)

## [1] 99.96665

sd(sampDist)

## [1] 2.743161

Note that the mean of the sampling distribution is a pretty darn good match for the original population mean. So, looky there, the sample mean is an unbaised estimator! Replications of the sampling procedure yield means that are just as likely to be above the population mean as below (in a symmetrical distribution like this, the mean and median are pretty much the same. If you’re interested, the actual value for the median is 99.950235. Also: The standard deviation of the sampling distribution is the standard error, so we see a pretty good match between the SD here (2.7431614) and the standard error calculated earlier (2.7386128).

Uncorrected sample standard deviations are biased estimates of population standard deviations

Okay, let’s put together a different sampling distribution. Rather than collecting means from each sample we’ll collect uncorrected sample standard deviations. This is the square root of the sample variance, where the sample variance is the sum of the squared deviations from the mean divided by the sample size (SS/n).

R uses the corrected sample standard devaition (and variance), by default. So I use some cheap tricks to get it to give me the uncorrected standard deviation.

sdDist <- replicate(10000,(((sqrt(var(sample(population, sampSize, replace=TRUE)))*(sampSize-1))/(sampSize))))
hist(sdDist)

Let’s grab a mean of this sampling distribution of standard deviations.

mean(sdDist)

## [1] 14.35129

Notice that it is an underestimate of the population variance. The deviation between this estimate (14.3512925) and the true population standard deviation (15) is 0.6487075. Uncorrected sample standard deviations are systemmatically smaller than the population standard deviations that we intend them to estimate.

Now, let’s try it again with the corrected sample standard deviation.

sdDist_2 <- replicate(10000,sd(sample(population, sampSize, replace=TRUE)))
hist(sdDist_2)

And its mean.

mean(sdDist_2)

## [1] 14.88884

Here, the deviation between the estimate from the sample(14.8888407) and the true population standard deviation (15) is reduced to 0.1111593. Sample standard deviations are less systemmatically underrepresenting the population standard deviation.

January 20, 2015

DIY functions

See also here.
You can write and save your own functions in R, which is very handy for automating a series of commands you do often. It can also make your code much more transparent, which is great news for anyone trying to understand your scripts (including Future You). Here’s how it works:

Give your new function a name.
Define the arguments.
Spell out the code you want R to run each time you call your function.
Tell it what output you want from it.

Here’s the rough structure to follow:

myfunction = function(arg1, arg2, ... ){
doing...
cool...
stuff...
return(output)
}

Example time!

Write a function that can take a vector of numbers as input, and return the mean of the numbers as output.

GetMean <- function(vector){
result <- mean(vector, na.rm=TRUE)
return(result)
}

What happens if you run that code? Not much, on the surface. R saves that new function for you, though, so later you can call it and provide the necessary argument(s). If you’re using RStudio, you’ll notice your brand new function shows up in the environment window.

Run the code above, and then try this:

GetMean(vector=1:10)

## [1] 5.5

GetMean(vector=rep(6,30))

## [1] 6

GetMean(c(1,3,2.5,NA))

## [1] 2.167

GetMean(iris$Petal.Length) # note that iris is one of the datasets that's built into R.

## [1] 3.758

How might you want to use this?

If you find yourself writing the same set of commands over and over, consider putting it into a function. For example, maybe you are doing a series of transformations and you want to generate a histogram after each step, but you’re a data artiste and you refuse to compromise on aethestics – you can use a function to save all the relevant plotting code in one place and then just call it every time you want to use it.

library(ggplot2)

# Use the iris data as an example (built into R)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Define the function to generate a histogram with all of the settings I like
PlotHist <- function(var, var.name, fig.num, transformation){
  
  data <- data.frame(var.name=var) # convert variable vector to data frame for ggplot
  
  p <- ggplot(data, aes(x=var.name)) +
    geom_histogram(aes(y=..density..), colour="black", fill="white") +
    geom_density(alpha=.2, fill="red") +  # Overlay with transparent density plot
    xlab(var.name) +
    ylab(NULL) +
    ggtitle(paste("Figure ", fig.num, ": ", var.name, " (", transformation, ")", sep=""))  
return(p)
}

# Plot raw Petal.Length data
PlotHist(var=iris$Petal.Length, var.name="Petal Length", fig.num=1, transformation="raw")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

# Run transformations on Petal.Length and get plot after each one
iris$PL.sqrt <- sqrt(iris$Petal.Length)
PlotHist(iris$PL.sqrt, "Petal Length", 2, "square root")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

iris$PL.negrec <- -1/iris$Petal.Length
PlotHist(iris$PL.negrec, "Petal Length", 3, "negative reciprocal")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-4

# How cool?? So cool.
# Also, how much easier is this to read than if I had copy/pasted my ggplot code three times? So much easier.

January 6, 2015

611 Highlights in R (part 1)

Check out an updated version of this post on RPubs: http://rpubs.com/rosemm/64056

611 Syllabus in R

Here’s the 611 schedule of topics from this year, and some code highlights from each week.

# a toy data set for us to work with
n <- 100
gender <- as.factor(c(rep("M", n/2), rep("F", n/2)))
IQ <- rnorm(n, mean=100, sd=15)
degree <- rep(c("HS", "BA", "MS", "PhD"), n/4)
height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1)
RT1 <- rchisq(n, 4)
RT2 <- rchisq(n, 4)
DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n)

df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2)

Exploratory Data Analysis

head(df)

##   awesome gender     IQ degree height   RT1    RT2
## 1   56.22      M 114.60     HS  6.260 1.965 0.4609
## 2   56.30      M 118.47     BA  5.010 1.516 4.6082
## 3   59.22      M 119.18     MS  4.455 5.396 2.2837
## 4   54.42      M  92.82    PhD  6.796 3.476 7.3616
## 5   53.95      M  92.41     HS  4.052 7.327 4.2431
## 6   58.60      M 115.49     BA  5.155 6.912 6.4011

View(df)

summary(df)

##     awesome     gender       IQ        degree       height    
##  Min.   :51.3   F:50   Min.   : 58.4   BA :25   Min.   :2.19  
##  1st Qu.:54.8   M:50   1st Qu.: 91.8   HS :25   1st Qu.:4.25  
##  Median :57.5          Median : 98.1   MS :25   Median :4.86  
##  Mean   :57.4          Mean   : 99.2   PhD:25   Mean   :4.89  
##  3rd Qu.:59.4          3rd Qu.:109.3            3rd Qu.:5.60  
##  Max.   :63.2          Max.   :131.1            Max.   :7.99  
##       RT1              RT2        
##  Min.   : 0.467   Min.   : 0.371  
##  1st Qu.: 2.006   1st Qu.: 2.008  
##  Median : 4.140   Median : 3.163  
##  Mean   : 4.513   Mean   : 4.121  
##  3rd Qu.: 6.457   3rd Qu.: 5.301  
##  Max.   :13.278   Max.   :20.295

library(psych)
describe(df)

##         vars   n  mean    sd median trimmed   mad   min    max range  skew
## awesome    1 100 57.39  2.87  57.46   57.31  3.44 51.34  63.16 11.82  0.11
## gender*    2 100  1.50  0.50   1.50    1.50  0.74  1.00   2.00  1.00  0.00
## IQ         3 100 99.21 14.87  98.09   99.78 12.01 58.41 131.05 72.64 -0.34
## degree*    4 100  2.50  1.12   2.50    2.50  1.48  1.00   4.00  3.00  0.00
## height     5 100  4.89  1.16   4.86    4.88  1.08  2.19   7.99  5.80  0.03
## RT1        6 100  4.51  2.94   4.14    4.20  3.23  0.47  13.28 12.81  0.81
## RT2        7 100  4.12  3.13   3.16    3.71  2.18  0.37  20.29 19.92  1.97
##         kurtosis   se
## awesome    -0.94 0.29
## gender*    -2.02 0.05
## IQ          0.28 1.49
## degree*    -1.39 0.11
## height     -0.24 0.12
## RT1         0.03 0.29
## RT2         6.27 0.31

describe(df[,c(1,3)])

##         vars   n  mean    sd median trimmed   mad   min    max range  skew
## awesome    1 100 57.39  2.87  57.46   57.31  3.44 51.34  63.16 11.82  0.11
## IQ         2 100 99.21 14.87  98.09   99.78 12.01 58.41 131.05 72.64 -0.34
##         kurtosis   se
## awesome    -0.94 0.29
## IQ          0.28 1.49

plot(df$awesome ~ df$IQ)

plot(df$awesome ~ df$gender)

plot(df$awesome ~ df$degree)

plot(df$height ~ df$gender)

hist(df$IQ)

hist(df$RT1)

cor(df$awesome, df$IQ)

## [1] 0.4604

round(cor(df[,c(1,3,5:7)]),3)

##         awesome     IQ height    RT1   RT2
## awesome   1.000  0.460 -0.419 -0.042 0.091
## IQ        0.460  1.000 -0.026 -0.003 0.142
## height   -0.419 -0.026  1.000  0.014 0.142
## RT1      -0.042 -0.003  0.014  1.000 0.012
## RT2       0.091  0.142  0.142  0.012 1.000

var(df$awesome, df$IQ)

## [1] 19.62

table(df[,c(2,4)])

##       degree
## gender BA HS MS PhD
##      F 12 12 13  13
##      M 13 13 12  12

# the big guns
library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:psych':
## 
##     %+%

library(GGally)
ggpairs(df)

Distributions

qqnorm(df$IQ)
qqline(df$IQ, col="red")

qqnorm(df$RT1)
qqline(df$RT1, col="red")

df$RT1_sqrt <- sqrt(df$RT1)
qqnorm(df$RT1_sqrt)
qqline(df$RT1_sqrt, col="red")

Hypothesis Testing & Estimation: T-tests

# one-sample
t.test(IQ, mu=100, data=df, conf.level=.95)

## 
##  One Sample t-test
## 
## data:  IQ
## t = -0.5292, df = 99, p-value = 0.5979
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
##   96.26 102.16
## sample estimates:
## mean of x 
##     99.21

# IST
t.test(IQ ~ gender, var.equal=TRUE, data=df)

## 
##  Two Sample t-test
## 
## data:  IQ by gender
## t = -0.9859, df = 98, p-value = 0.3266
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.832  2.969
## sample estimates:
## mean in group F mean in group M 
##           97.75          100.68

t.test(x=df[gender=="F",3], y=df[gender=="M",3], var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  df[gender == "F", 3] and df[gender == "M", 3]
## t = -0.9859, df = 98, p-value = 0.3266
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.832  2.969
## sample estimates:
## mean of x mean of y 
##     97.75    100.68

# DST
t.test(x=RT1, y=RT2, paired=TRUE, data=df)

## 
##  Paired t-test
## 
## data:  RT1 and RT2
## t = 0.9187, df = 99, p-value = 0.3605
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4547  1.2388
## sample estimates:
## mean of the differences 
##                   0.392

One-Way ANOVAs

aov(awesome ~ degree, data=df)

## Call:
##    aov(formula = awesome ~ degree, data = df)
## 
## Terms:
##                 degree Residuals
## Sum of Squares     5.6     807.9
## Deg. of Freedom      3        96
## 
## Residual standard error: 2.901
## Estimated effects may be unbalanced

summary(aov(awesome ~ degree, data=df))

##             Df Sum Sq Mean Sq F value Pr(>F)
## degree       3      6    1.86    0.22   0.88
## Residuals   96    808    8.42

Contrasts

Contrasts in R are unnecessarily confusing. Here is an rpubs doc that will help, hopefully!

# a little messy :)

Two-Way ANOVAs

aov(awesome ~ degree*gender, data=df)

## Call:
##    aov(formula = awesome ~ degree * gender, data = df)
## 
## Terms:
##                 degree gender degree:gender Residuals
## Sum of Squares     5.6  472.2           5.8     329.9
## Deg. of Freedom      3      1             3        92
## 
## Residual standard error: 1.894
## Estimated effects may be unbalanced

summary(aov(awesome ~ degree*gender, data=df))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## degree         3      6       2    0.52   0.67    
## gender         1    472     472  131.71 <2e-16 ***
## degree:gender  3      6       2    0.54   0.66    
## Residuals     92    330       4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# regression coefficients output
summary(lm(awesome ~ degree*gender, data=df))

## 
## Call:
## lm(formula = awesome ~ degree * gender, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.517 -1.256 -0.268  1.328  3.795 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        59.5427     0.5466  108.93  < 2e-16 ***
## degreeHS           -0.4621     0.7730   -0.60     0.55    
## degreeMS            0.4083     0.7580    0.54     0.59    
## degreePhD           0.1155     0.7580    0.15     0.88    
## genderM            -4.5616     0.7580   -6.02  3.6e-08 ***
## degreeHS:genderM    1.0268     1.0720    0.96     0.34    
## degreeMS:genderM    0.0397     1.0720    0.04     0.97    
## degreePhD:genderM  -0.2190     1.0720   -0.20     0.84    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.89 on 92 degrees of freedom
## Multiple R-squared:  0.594,  Adjusted R-squared:  0.564 
## F-statistic: 19.3 on 7 and 92 DF,  p-value: 1.25e-15

Correlation and Regression

cor(df$awesome, df$IQ)

## [1] 0.4604

summary(lm(awesome ~ IQ, data=df))

## 
## Call:
## lm(formula = awesome ~ IQ, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.493 -2.538  0.131  2.285  5.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.5816     1.7345   28.01  < 2e-16 ***
## IQ            0.0888     0.0173    5.13  1.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 98 degrees of freedom
## Multiple R-squared:  0.212,  Adjusted R-squared:  0.204 
## F-statistic: 26.4 on 1 and 98 DF,  p-value: 1.44e-06

Analysis of Categorical Data

table(df[,c(2,4)])

##       degree
## gender BA HS MS PhD
##      F 12 12 13  13
##      M 13 13 12  12

# goodness of fit
chisq.test(table(df$degree)) # are degrees equally distributed?

## 
##  Chi-squared test for given probabilities
## 
## data:  table(df$degree)
## X-squared = 0, df = 3, p-value = 1

chisq.test(table(df$degree), p=c(.4, .3, .2, .1)) # is it 40% HS, 30% BS, 20% MS, and 10% PhD?

## 
##  Chi-squared test for given probabilities
## 
## data:  table(df$degree)
## X-squared = 30.21, df = 3, p-value = 1.248e-06

# test of independence 
chisq.test(table(df[,c(2,4)]))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df[, c(2, 4)])
## X-squared = 0.16, df = 3, p-value = 0.9838

# Get measures of association
library(vcd)

## Loading required package: grid

assocstats(table(df[,c(2,4)]))

##                      X^2 df P(> X^2)
## Likelihood Ratio 0.16004  3  0.98377
## Pearson          0.16000  3  0.98377
## 
## Phi-Coefficient   : 0.04 
## Contingency Coeff.: 0.04 
## Cramer's V        : 0.04

Category: nuggets

Generate Random Correlated Data

Dealing with messy data in R (Dplyr, Tidyr, Jsonlite, and Facebook API)

An overview of data wrangling in R

tidyr

dplyr

more examples

more resources

611 Highlights in R (Part II)

Brief ‘Hello’ to R Markdown

Streamlined Data Workflow with dplyr

Biased and unbiased estimates

Jason Wallin

Creating a population

Sample means are unbiased estimates of population means

Uncorrected sample standard deviations are biased estimates of population standard deviations

DIY functions

DIY functions

Example time!

How might you want to use this?

611 Highlights in R (part 1)

611 Syllabus in R

Exploratory Data Analysis

Distributions

Hypothesis Testing & Estimation: T-tests

One-Way ANOVAs

Contrasts

Two-Way ANOVAs

Correlation and Regression

Analysis of Categorical Data

Subscribe By Email