611 Highlights in R (Part II)
The raw .R file is here and the Rmd file is here.
Brief ‘Hello’ to R Markdown
Let me first introduce you to R markdown and kitr
. This format allows you to integrate your analyses with text that can be automatically translated to html, word, and pdf formats. I just produced a supplementary materials document using this and it probably saved me a ridiculous number of hours. It’s easy to get started:
- Save your file with extension .Rmd.
-
Place something like the following in the very top of your file:
--- author: John Flournoy title: 611 Highlights in R (Part II) output: html_document ---
-
Write as you normally would, using Markdown syntax for formatting. For example:
# This is a big heading 1. this is 2. a numbered list - this is a bulleted list - With **bold** and *italic* text
-
When you want to throw in some R code, place it inside an R chunk by wrapping it in
```{r}
at the beginning and```
at the end. When you do, it comes out something like this:# From Rose's presentation: # a toy data set for us to work with n <- 100 gender <- as.factor(c(rep("M", n/2), rep("F", n/2))) IQ <- rnorm(n, mean=100, sd=15) degree <- rep(c("HS", "BA", "MS", "PhD"), n/4) height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1) RT1 <- rchisq(n, 4) RT2 <- rchisq(n, 4) DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n) df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2) head(df)
## awesome gender IQ degree height RT1 RT2 ## 1 52.89486 M 79.40263 HS 6.292706 1.761990 2.228117 ## 2 57.04782 M 96.31289 BA 4.759042 2.236027 7.363253 ## 3 55.59594 M 104.95331 MS 5.560292 5.159114 7.126281 ## 4 55.44397 M 84.23347 PhD 5.354595 7.652900 2.836956 ## 5 55.45191 M 107.38754 HS 4.523997 7.768904 4.504664 ## 6 54.70468 M 110.07096 BA 6.139368 7.786462 7.508488
Notice that in 4, the we see the output of the command head(df)
. This is useful for outputting tables and plots, as we’ll see.
Here are more resources on
Streamlined Data Workflow with dplyr
Download the data here:
Make sure you have the package (library(dplyr)
), and if you don’t, install it (install.packages('dplyr')
). Here’s an intro to dplyr
straight from the horse’s (Hadley’s) mouth.
By way of data manipulation that I’m actually using for a real project, this tutorial will introduce you to
%>%
- a function chaining operator
mutate
andtransmute
- adds new variables and preserves existing (
transmute
drops existing variables)
- adds new variables and preserves existing (
mutate_each
- adds new variables, applying functions in
funs(...)
to each column
- adds new variables, applying functions in
group_by
- groups data set according to some variable (e.g., gender) – other
dplyr
operations will be done on groups
- groups data set according to some variable (e.g., gender) – other
summarise
- summarizes multiple values to a single value
left_join
- joins two data frames, returning all rows from x, and all columns from x and y
select
- keeps the variables/columns that you mention (see
?select
for special functions that help you select variables)
- keeps the variables/columns that you mention (see
filter
- returns rows with matching conditions (similar to selecting columns like:
aDataFrame[variable == value,]
)
- returns rows with matching conditions (similar to selecting columns like:
group_by
- groups data set according to some variable (e.g., gender) – other
dplyr
operations will be done on groups
- groups data set according to some variable (e.g., gender) – other
summarise
- summarizes multiple values to a single value
do
- use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same
lm
ort.test
separately for each gender, country)
- use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same
Data Subset Prep
I’m going to focus on the World Values Survey because it’s awesome. You can download it in many formats, including an rdata file. I prepared a subset for you all.
It includes
- some basic demographics
- a question about how much time one spends thinking about meaning and purpose of life
- ratings on importance of various things in life (e.g., family, friends, work)
setwd('C:/Users/jackflibb/Documents/UO/rClub/611_dplyr_wvs_etc')
load("WorldValuesSurvey.rdata")
ls()
## [1] "degree"
## [2] "df"
## [3] "DV"
## [4] "gender"
## [5] "height"
## [6] "IQ"
## [7] "n"
## [8] "RT1"
## [9] "RT2"
## [10] "WorldValuesSurvey-Wave6-2010-2014_v2014-11-07"
wvs<-`WorldValuesSurvey-Wave6-2010-2014_v2014-11-07`
dim(wvs)
## [1] 85070 387
countryNames<-cbind(names(attr(wvs,"label.table")[[2]]),attr(wvs,"label.table")[[2]])
library(dplyr)
smallerData<-wvs %>%
transmute(
countryCode=V2,
sex=V240, #1 male 2 female
age=V242,
nationalIncomeLadder=V239,
meaningInLife=V143,
imptcOfFamily=V4,
imptcOfFriends=V5,
imptcOfLeisure=V6,
imptcOfPolitics=V7,
imptcOfWork=V8,
imptcOfReligion=V9)
write.csv(smallerData,'wvs_small.csv',row.names=F)
write.csv(countryNames,'country_codes.csv',row.names=F)
Don’t worry if you don’t understand some of the syntax yet – I just wanted to show you how I prepare the dataset.
Now we’re going to read in the datasets that you downloaded. Change the working directory to wherever you put the downloaded data files.
setwd('C:/Users/jackflibb/Documents/UO/rClub/611_dplyr_wvs_etc') # Change this
wvs<-read.csv('wvs_small.csv')
countryNames<-read.csv('country_codes.csv',stringsAsFactors=F)
head(wvs)
## countryCode sex age nationalIncomeLadder meaningInLife imptcOfFamily
## 1 12 1 21 5 2 1
## 2 12 2 24 6 1 1
## 3 12 2 26 6 1 1
## 4 12 2 28 5 2 1
## 5 12 2 35 7 2 1
## 6 12 1 36 5 2 1
## imptcOfFriends imptcOfLeisure imptcOfPolitics imptcOfWork
## 1 1 1 -2 1
## 2 2 3 4 2
## 3 3 2 4 2
## 4 1 3 4 3
## 5 1 1 2 1
## 6 2 2 2 4
## imptcOfReligion
## 1 1
## 2 2
## 3 1
## 4 1
## 5 1
## 6 1
Using the chaining operator
As you may have noticed, we have a lot of missing data coded unhelpfully as negative values. The first thing I’ll show you is how to easily recode them to NA using the dplyr
function mutate_each
. Before I do that, let me introduce the new operator introduced by dplyr
: %>%
. It’s weird at first, but once you get used to it, it’s great.
Basically, %>%
sends the output data frame of one function to the first argument of the next function. For example:
library(dplyr)
wvs %>% names()
## [1] "countryCode" "sex" "age"
## [4] "nationalIncomeLadder" "meaningInLife" "imptcOfFamily"
## [7] "imptcOfFriends" "imptcOfLeisure" "imptcOfPolitics"
## [10] "imptcOfWork" "imptcOfReligion"
We can easily add another step to the chain:
wvs %>% names() %>% paste('is a variable in the WVS data')
## [1] "countryCode is a variable in the WVS data"
## [2] "sex is a variable in the WVS data"
## [3] "age is a variable in the WVS data"
## [4] "nationalIncomeLadder is a variable in the WVS data"
## [5] "meaningInLife is a variable in the WVS data"
## [6] "imptcOfFamily is a variable in the WVS data"
## [7] "imptcOfFriends is a variable in the WVS data"
## [8] "imptcOfLeisure is a variable in the WVS data"
## [9] "imptcOfPolitics is a variable in the WVS data"
## [10] "imptcOfWork is a variable in the WVS data"
## [11] "imptcOfReligion is a variable in the WVS data"
This isn’t very useful but, it’s a little more interpretable than:
paste(names(wvs),'is a variable in the WVS data')
This will make things a lot more interpretable as we produce complex chains of transformations.
Code missing values using mutate_each
Okay, let’s set the negative values to NA
. We’ll use mutate_each
, which takes a data frame and applies the functions given to it by the function funs
. So you could divide everything in half by doing yourData %>% mutate_each(funs(half = . / 2))
.
wvs_NAs<-wvs %>%
mutate_each(funs(ifelse(.<=-1,NA,.)))
Basic summary stats with summarise, summarise_each, select, and filter
Remember describe
from last week? The function summarise
allows you to easily make your own sort of describe
, tailored exactly to your data set.
Let’s look at the structure of our WVS data to decide what we want to do.
str(wvs_NAs)
## 'data.frame': 85070 obs. of 11 variables:
## $ countryCode : int 12 12 12 12 12 12 12 12 12 12 ...
## $ sex : int 1 2 2 2 2 1 1 2 1 2 ...
## $ age : int 21 24 26 28 35 36 41 44 59 72 ...
## $ nationalIncomeLadder: int 5 6 6 5 7 5 7 2 5 4 ...
## $ meaningInLife : int 2 1 1 2 2 2 2 2 2 2 ...
## $ imptcOfFamily : int 1 1 1 1 1 1 1 1 1 1 ...
## $ imptcOfFriends : int 1 2 3 1 1 2 1 1 1 1 ...
## $ imptcOfLeisure : int 1 3 2 3 1 2 1 1 1 1 ...
## $ imptcOfPolitics : int NA 4 4 4 2 2 1 1 2 2 ...
## $ imptcOfWork : int 1 2 2 3 1 4 1 2 2 1 ...
## $ imptcOfReligion : int 1 2 1 1 1 1 1 2 2 1 ...
Looks like the demographics should be summarized differently than the ratings.
categorical_summary <- wvs_NAs %>%
summarise(
N=n(), #a dplyr helper function
N_countries=length(unique(countryCode)),
Prop_Male=sum(sex==1,na.rm=T)/N # WOAH, you can use variables created within this function call to build other variables in the same funciton call!!
)
categorical_summary
## N N_countries Prop_Male
## 1 85070 59 0.4766428
Let’s look at the numbers for women only…
categorical_summary_F <- wvs_NAs %>%
filter(sex==2) %>%
summarise(
N=n(),
N_countries=length(unique(countryCode)),
Prop_Male=sum(sex==1,na.rm=T)/N
)
categorical_summary_F
## N N_countries Prop_Male
## 1 44434 59 0
We can use summarise_each
to take care of the rest. Like mutate_each
, summarise_each
takes a comma separated list of functions in funs(...)
, and optionally a specification of which variables/columns should be summarized. This is really easy in dplyr
– we’re going to tell it to summarize all variables from age
to imptcOfReligion
just by passing in age:imptcOfReligion
after our list of functions.
continuous_summary <- wvs_NAs %>%
select(age:imptcOfReligion) %>% # you can use the : operator directly on variable names!
summarise_each( # glorious ..._each functions!
funs(
min=min(.,na.rm=T),
max=max(.,na.rm=T),
mean=mean(.,na.rm=T),
median=median(.,na.rm=T),
sd=sd(.,na.rm=T),
Prop_missing=sum(is.na(.))/length(.)
)
)
continuous_summary
## age_min nationalIncomeLadder_min meaningInLife_min imptcOfFamily_min
## 1 16 1 1 1
## imptcOfFriends_min imptcOfLeisure_min imptcOfPolitics_min
## 1 1 1 1
## imptcOfWork_min imptcOfReligion_min age_max nationalIncomeLadder_max
## 1 1 1 99 10
## meaningInLife_max imptcOfFamily_max imptcOfFriends_max
## 1 4 4 4
## imptcOfLeisure_max imptcOfPolitics_max imptcOfWork_max
## 1 4 4 4
## imptcOfReligion_max age_mean nationalIncomeLadder_mean
## 1 4 42.05654 4.908784
## meaningInLife_mean imptcOfFamily_mean imptcOfFriends_mean
## 1 1.844841 1.107936 1.684175
## imptcOfLeisure_mean imptcOfPolitics_mean imptcOfWork_mean
## 1 1.88321 2.612757 1.545598
## imptcOfReligion_mean age_median nationalIncomeLadder_median
## 1 1.914088 40 5
## meaningInLife_median imptcOfFamily_median imptcOfFriends_median
## 1 2 1 2
## imptcOfLeisure_median imptcOfPolitics_median imptcOfWork_median
## 1 2 3 1
## imptcOfReligion_median age_sd nationalIncomeLadder_sd meaningInLife_sd
## 1 2 16.54851 2.104927 0.8582036
## imptcOfFamily_sd imptcOfFriends_sd imptcOfLeisure_sd imptcOfPolitics_sd
## 1 0.3758205 0.7397572 0.8341943 0.9822105
## imptcOfWork_sd imptcOfReligion_sd age_Prop_missing
## 1 0.8116806 1.064065 0.001798519
## nationalIncomeLadder_Prop_missing meaningInLife_Prop_missing
## 1 0.03605266 0.015787
## imptcOfFamily_Prop_missing imptcOfFriends_Prop_missing
## 1 0.003714588 0.005442577
## imptcOfLeisure_Prop_missing imptcOfPolitics_Prop_missing
## 1 0.01120254 0.01702128
## imptcOfWork_Prop_missing imptcOfReligion_Prop_missing
## 1 0.01702128 0.01498766
Dang, that’s kind of ugly. We can reshape these using the reshape2
package and output it to a table using kable
from knitr
.
library(knitr)
library(reshape2)
continuous_summary_melted <- melt(continuous_summary)
## No id variables; using all as measure variables
kable(categorical_summary,
caption='Summary of Categorical Variables')
N | N_countries | Prop_Male |
---|---|---|
85070 | 59 | 0.4766428 |
kable(continuous_summary_melted,
col.names=c('Statistic','Value'),
digits=2,
caption='Summary of Continuous Variables')
Statistic | Value |
---|---|
age_min | 16.00 |
nationalIncomeLadder_min | 1.00 |
meaningInLife_min | 1.00 |
imptcOfFamily_min | 1.00 |
imptcOfFriends_min | 1.00 |
imptcOfLeisure_min | 1.00 |
imptcOfPolitics_min | 1.00 |
imptcOfWork_min | 1.00 |
imptcOfReligion_min | 1.00 |
age_max | 99.00 |
nationalIncomeLadder_max | 10.00 |
meaningInLife_max | 4.00 |
imptcOfFamily_max | 4.00 |
imptcOfFriends_max | 4.00 |
imptcOfLeisure_max | 4.00 |
imptcOfPolitics_max | 4.00 |
imptcOfWork_max | 4.00 |
imptcOfReligion_max | 4.00 |
age_mean | 42.06 |
nationalIncomeLadder_mean | 4.91 |
meaningInLife_mean | 1.84 |
imptcOfFamily_mean | 1.11 |
imptcOfFriends_mean | 1.68 |
imptcOfLeisure_mean | 1.88 |
imptcOfPolitics_mean | 2.61 |
imptcOfWork_mean | 1.55 |
imptcOfReligion_mean | 1.91 |
age_median | 40.00 |
nationalIncomeLadder_median | 5.00 |
meaningInLife_median | 2.00 |
imptcOfFamily_median | 1.00 |
imptcOfFriends_median | 2.00 |
imptcOfLeisure_median | 2.00 |
imptcOfPolitics_median | 3.00 |
imptcOfWork_median | 1.00 |
imptcOfReligion_median | 2.00 |
age_sd | 16.55 |
nationalIncomeLadder_sd | 2.10 |
meaningInLife_sd | 0.86 |
imptcOfFamily_sd | 0.38 |
imptcOfFriends_sd | 0.74 |
imptcOfLeisure_sd | 0.83 |
imptcOfPolitics_sd | 0.98 |
imptcOfWork_sd | 0.81 |
imptcOfReligion_sd | 1.06 |
age_Prop_missing | 0.00 |
nationalIncomeLadder_Prop_missing | 0.04 |
meaningInLife_Prop_missing | 0.02 |
imptcOfFamily_Prop_missing | 0.00 |
imptcOfFriends_Prop_missing | 0.01 |
imptcOfLeisure_Prop_missing | 0.01 |
imptcOfPolitics_Prop_missing | 0.02 |
imptcOfWork_Prop_missing | 0.02 |
imptcOfReligion_Prop_missing | 0.01 |
Clean up that table
We might want to clean up that summary of continuous variables a bit. Let’s do that quickly using transmute
and arrange
.
continuous_summary_melted %>%
transmute(
item=sub('(\\w+?)_(.*)','\\1',variable),
statistic=sub('(\\w+?)_(.*)','\\2',variable),
value=value) %>%
arrange(item,statistic) %>%
kable(
col.names=c('Item','Statistic','Value'),
digits=2,
caption='Summary of Continuous Variables')
Item | Statistic | Value |
---|---|---|
age | Prop_missing | 0.00 |
age | max | 99.00 |
age | mean | 42.06 |
age | median | 40.00 |
age | min | 16.00 |
age | sd | 16.55 |
imptcOfFamily | Prop_missing | 0.00 |
imptcOfFamily | max | 4.00 |
imptcOfFamily | mean | 1.11 |
imptcOfFamily | median | 1.00 |
imptcOfFamily | min | 1.00 |
imptcOfFamily | sd | 0.38 |
imptcOfFriends | Prop_missing | 0.01 |
imptcOfFriends | max | 4.00 |
imptcOfFriends | mean | 1.68 |
imptcOfFriends | median | 2.00 |
imptcOfFriends | min | 1.00 |
imptcOfFriends | sd | 0.74 |
imptcOfLeisure | Prop_missing | 0.01 |
imptcOfLeisure | max | 4.00 |
imptcOfLeisure | mean | 1.88 |
imptcOfLeisure | median | 2.00 |
imptcOfLeisure | min | 1.00 |
imptcOfLeisure | sd | 0.83 |
imptcOfPolitics | Prop_missing | 0.02 |
imptcOfPolitics | max | 4.00 |
imptcOfPolitics | mean | 2.61 |
imptcOfPolitics | median | 3.00 |
imptcOfPolitics | min | 1.00 |
imptcOfPolitics | sd | 0.98 |
imptcOfReligion | Prop_missing | 0.01 |
imptcOfReligion | max | 4.00 |
imptcOfReligion | mean | 1.91 |
imptcOfReligion | median | 2.00 |
imptcOfReligion | min | 1.00 |
imptcOfReligion | sd | 1.06 |
imptcOfWork | Prop_missing | 0.02 |
imptcOfWork | max | 4.00 |
imptcOfWork | mean | 1.55 |
imptcOfWork | median | 1.00 |
imptcOfWork | min | 1.00 |
imptcOfWork | sd | 0.81 |
meaningInLife | Prop_missing | 0.02 |
meaningInLife | max | 4.00 |
meaningInLife | mean | 1.84 |
meaningInLife | median | 2.00 |
meaningInLife | min | 1.00 |
meaningInLife | sd | 0.86 |
nationalIncomeLadder | Prop_missing | 0.04 |
nationalIncomeLadder | max | 10.00 |
nationalIncomeLadder | mean | 4.91 |
nationalIncomeLadder | median | 5.00 |
nationalIncomeLadder | min | 1.00 |
nationalIncomeLadder | sd | 2.10 |
Get country names using left_join
We also have a file with country names to go with our country codes. We’re going to start doing some analysis within-country, so let’s get those names into the data set. This is what some call a one-to-many merge. That is, we have a data frame with one row per country name, and we want to merge each of these rows with every row in the main data set that has a matching country code.
There are 4 join functions (?left_join
for more info), of which left_join
is our best choice. According to the docs it “return[s] all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.”
str(countryNames)
## 'data.frame': 195 obs. of 2 variables:
## $ V1: chr "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
## $ V2: int -5 -4 -3 -2 -1 4 8 12 16 20 ...
names(wvs_NAs)
## [1] "countryCode" "sex" "age"
## [4] "nationalIncomeLadder" "meaningInLife" "imptcOfFamily"
## [7] "imptcOfFriends" "imptcOfLeisure" "imptcOfPolitics"
## [10] "imptcOfWork" "imptcOfReligion"
#We use the same country code variable name, countryCode, as the main data set
names(countryNames)<-c('countryName','countryCode')
#left_join will automatically find matching column names
wvs_NAs_cnames <- left_join(wvs_NAs,countryNames)
## Joining by: "countryCode"
head(wvs_NAs_cnames) #success!
## countryCode sex age nationalIncomeLadder meaningInLife imptcOfFamily
## 1 12 1 21 5 2 1
## 2 12 2 24 6 1 1
## 3 12 2 26 6 1 1
## 4 12 2 28 5 2 1
## 5 12 2 35 7 2 1
## 6 12 1 36 5 2 1
## imptcOfFriends imptcOfLeisure imptcOfPolitics imptcOfWork
## 1 1 1 NA 1
## 2 2 3 4 2
## 3 3 2 4 2
## 4 1 3 4 3
## 5 1 1 2 1
## 6 2 2 2 4
## imptcOfReligion countryName
## 1 1 Algeria
## 2 2 Algeria
## 3 1 Algeria
## 4 1 Algeria
## 5 1 Algeria
## 6 1 Algeria
Summaries by country
We’re going to use the summarise_each
function again, but this time instead of using select
before hand, we can give it the columns we want as the second argument. I’ll use the helper function starts_with
to get every column that starts with ‘imptcOf’ (our “Importance in life of…” variables).
wvs_NAs_cnames %>%
group_by(countryName) %>%
summarise_each(
funs(
mean=mean(.,na.rm=T)),
starts_with('imptcOf')) %>%
kable(caption='Importance of ___ Item Means by Country')
countryName | imptcOfFamily | imptcOfFriends | imptcOfLeisure | imptcOfPolitics | imptcOfWork | imptcOfReligion |
---|---|---|---|---|---|---|
Algeria | 1.100418 | 1.762826 | 2.052453 | 2.719437 | 1.385800 | 1.118189 |
Argentina | 1.123541 | 1.570039 | 1.870117 | 2.880905 | 1.580454 | 2.451581 |
Armenia | 1.032022 | 1.764384 | 1.959032 | 2.912168 | 1.373383 | 1.531392 |
Australia | 1.076923 | 1.475456 | 1.670654 | 2.482562 | 1.965909 | 2.824648 |
Azerbaijan | 1.074850 | 1.791417 | 2.124751 | 2.926148 | 1.496008 | 2.039920 |
Bahrain | 1.674457 | 1.894167 | 2.333333 | 2.174350 | 2.169321 | 1.753756 |
Belarus | 1.130150 | 1.725361 | 1.888227 | 2.817822 | 1.835635 | 2.531003 |
Brazil | 1.134007 | 1.855698 | 1.813514 | 2.751864 | 1.393122 | 1.595687 |
Chile | 1.097000 | 1.871587 | 1.539314 | 3.064777 | 1.541538 | 2.297379 |
China | 1.140907 | 1.596710 | 2.052751 | 2.543602 | 1.802098 | 3.381368 |
Colombia | 1.155423 | 2.011258 | 1.664901 | 3.024585 | 1.258278 | 1.592323 |
Cyprus | 1.073073 | 1.449449 | 1.543543 | 2.728745 | 1.464393 | 1.827655 |
Ecuador | 1.021631 | 1.968386 | 1.593672 | 2.551581 | 1.169717 | 1.477537 |
Egypt | 1.039396 | 1.743926 | 2.356533 | 2.166120 | 1.909389 | 1.065046 |
Estonia | 1.151436 | 1.614781 | 1.796721 | 2.877457 | 1.730211 | 3.021220 |
Germany | 1.251471 | 1.562072 | 1.811765 | 2.587573 | 1.806468 | 2.951399 |
Ghana | 1.052835 | 1.795103 | 1.607603 | 2.611469 | 1.069588 | 1.087629 |
Hong Kong | 1.172345 | 1.631000 | 1.794975 | 2.527694 | 1.929860 | 2.770541 |
India | 1.466413 | 1.928390 | 2.213333 | 2.427936 | 1.750317 | 1.812817 |
Iraq | 1.079167 | 1.665275 | 2.323258 | 2.731707 | 1.453255 | 1.180000 |
Japan | 1.086523 | 1.602763 | 1.658804 | 2.062224 | 1.578105 | 3.115530 |
Jordan | 1.043333 | 1.602500 | 1.956667 | 2.800000 | 1.491109 | 1.068447 |
Kazakhstan | 1.080667 | 1.666667 | 1.811333 | 2.592000 | 1.585333 | 2.345333 |
Kuwait | 1.075136 | 1.531153 | 1.975358 | 2.198367 | 1.290679 | 1.133758 |
Kyrgyzstan | 1.041333 | 1.784000 | 2.146000 | 2.417333 | 1.446000 | 1.795333 |
Lebanon | 1.358079 | 1.651182 | 1.986418 | 2.537511 | 1.556604 | 1.745704 |
Libya | 1.036671 | 1.433540 | 1.836862 | 2.284668 | 1.319885 | 1.037987 |
Malaysia | 1.027692 | 1.636923 | 1.758461 | 2.401078 | 1.220000 | 1.185385 |
Mexico | 1.031500 | 1.879940 | 1.575288 | 2.614307 | 1.192000 | 1.627000 |
Morocco | 1.101836 | 1.901338 | 2.517426 | 3.349817 | 1.225021 | 1.122704 |
Netherlands | 1.174731 | 1.549497 | 1.653744 | 2.687833 | 1.929470 | 3.080558 |
New Zealand | 1.064010 | 1.466340 | 1.578492 | 2.545568 | 1.820707 | 2.746231 |
Nigeria | 1.014213 | 1.426379 | 1.623081 | 2.523891 | 1.314383 | 1.130756 |
Pakistan | 1.058382 | 1.738731 | 2.359797 | 2.802852 | 1.407282 | 1.127410 |
Palestine | 1.048048 | 1.740481 | 2.079879 | 2.508629 | 1.506694 | 1.161161 |
Peru | 1.157025 | 2.228906 | 1.906723 | 2.743633 | 1.339211 | 1.712134 |
Philippines | 1.015000 | 1.844037 | 2.468672 | 2.292540 | 1.109258 | 1.159167 |
Poland | 1.082292 | 1.686397 | 1.771249 | 2.834375 | 1.462274 | 1.816857 |
Qatar | 1.012264 | 1.287736 | 1.777148 | 2.155303 | 1.222641 | 1.013208 |
Romania | 1.081333 | 2.077948 | 1.919679 | 3.089623 | 1.564926 | 1.661549 |
Russia | 1.174578 | 1.887455 | 2.012225 | 2.958983 | 1.863577 | 2.636633 |
Rwanda | 1.096267 | 1.313032 | 2.000655 | 2.290766 | 1.423707 | 1.910282 |
Singapore | 1.077079 | 1.534483 | 1.722617 | 2.454868 | 1.743278 | 1.894416 |
Slovenia | 1.118421 | 1.640449 | 1.717248 | 3.270423 | 1.706439 | 2.816384 |
South Africa | 1.087635 | 1.769318 | 1.983138 | 2.563851 | 1.552554 | 1.623851 |
South Korea | 1.102585 | 1.546366 | 1.809045 | 2.367845 | 1.525609 | 2.361881 |
Spain | 1.093434 | 1.510638 | 1.632740 | 3.083051 | 1.564255 | 2.936279 |
Sweden | 1.124688 | 1.323920 | 1.505034 | 2.281537 | 1.604039 | 3.006700 |
Taiwan | 1.098294 | 1.632052 | 1.775678 | 2.795041 | 1.513934 | 2.366393 |
Thailand | 1.131271 | 1.878028 | 1.986509 | 1.981292 | 1.420655 | 1.569992 |
Trinidad and Tobago | 1.070211 | 1.916499 | 1.646881 | 2.769849 | 1.420363 | 1.314141 |
Tunisia | 1.024147 | 1.768080 | 2.011064 | 2.714286 | 1.199499 | 1.064892 |
Turkey | 1.048005 | 1.437148 | 1.707363 | 2.526415 | 1.720126 | 1.448750 |
Ukraine | 1.092667 | 1.728000 | 1.884000 | 2.926667 | 1.815333 | 2.218000 |
United States | 1.099505 | 1.503389 | 1.691369 | 2.429737 | 1.948058 | 2.023035 |
Uruguay | 1.132132 | 1.695085 | 1.642351 | 2.965726 | 1.495960 | 2.715005 |
Uzbekistan | 1.028000 | 1.577719 | 2.144392 | 2.616521 | 1.490262 | 1.976383 |
Yemen | 1.044266 | 1.598178 | 2.341589 | 2.693499 | 1.492417 | 1.060120 |
Zimbabwe | 1.036667 | 1.824000 | 1.844000 | 2.622667 | 1.150000 | 1.211333 |
Histograms of each variable by country
You can apply this same type of logic for running models for each country. We melt
the data frame so that we get a row for every variable/value pair indexed by country. This makes it easy for ggplot
to give us a a histogram for each variable in the same plot, using facet_wrap
. If we didn’t use facet_wrap
below, we’d get one histogram with the values of all the variables. Check out the first six, last, and a random sample of six lines of the melted data frame:
m_temp <- melt(wvs_NAs_cnames,id.vars='countryName')
kable(head(m_temp))
countryName | variable | value |
---|---|---|
Algeria | countryCode | 12 |
Algeria | countryCode | 12 |
Algeria | countryCode | 12 |
Algeria | countryCode | 12 |
Algeria | countryCode | 12 |
Algeria | countryCode | 12 |
kable(tail(m_temp))
countryName | variable | value | |
---|---|---|---|
935765 | Zimbabwe | imptcOfReligion | 1 |
935766 | Zimbabwe | imptcOfReligion | 1 |
935767 | Zimbabwe | imptcOfReligion | 1 |
935768 | Zimbabwe | imptcOfReligion | 1 |
935769 | Zimbabwe | imptcOfReligion | 1 |
935770 | Zimbabwe | imptcOfReligion | 1 |
kable(sample_n(m_temp,6),row.names=F)
countryName | variable | value |
---|---|---|
India | imptcOfPolitics | 3 |
Egypt | age | 60 |
Yemen | nationalIncomeLadder | 5 |
Pakistan | imptcOfFamily | 2 |
Japan | imptcOfPolitics | 2 |
Yemen | imptcOfFamily | 1 |
library(ggplot2)
#plotprintlist is a variable that is basically trash, but we could use it to print the plots again if we wanted
plotprintlist <- wvs_NAs_cnames %>%
melt(id.vars='countryName') %>%
group_by(countryName) %>%
do(
plot=print(
ggplot(.,aes(x=value))+ # set up the plot with the data frame '.' passed to `do`
geom_histogram()+ # We want a histogram of the values from column 'value'
facet_wrap(~variable,scales='free')+ # We want a facet (i.e., subplot) for every unique value of the column 'variable'
ggtitle(unique(.$countryName)) # give it a title with the country name from the countrName column of '.'
)
)
All the analysis
We can use the same technique to get a coefficients from a linear model for every country
wvs_NAs_cnames_modulus <- wvs_NAs_cnames %>%
mutate(age_ending = age %% 10, age_ending_9 = age_ending %in% 9)
library(biglm)
## Warning: package 'biglm' was built under R version 3.1.2
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.1.2
models <- wvs_NAs_cnames_modulus %>%
group_by(countryName) %>%
do(
mod=biglm(
meaningInLife ~ 1 +
sex +
age +
nationalIncomeLadder +
imptcOfFamily +
imptcOfFriends +
imptcOfLeisure +
imptcOfPolitics +
imptcOfWork +
imptcOfReligion +
age_ending_9,
data=.)
)
modelCoefs <- models %>%
do(
data.frame(
country=.$countryName,
n=.$mod$n,
ci=summary(.$mod)$mat[,'Coef']-summary(.$mod)$mat[,'(95%'],
var=names(coef(.$mod)),
coef=coef(.$mod))
)
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character
kable(modelCoefs[1:20,])
country | n | ci | var | coef |
---|---|---|---|---|
Algeria | 1039 | 0.3973372 | (Intercept) | 2.2335079 |
Algeria | 1039 | 0.1267847 | sex | -0.0623652 |
Algeria | 1039 | 0.0043268 | age | -0.0039015 |
Algeria | 1039 | 0.0307039 | nationalIncomeLadder | -0.0802476 |
Algeria | 1039 | 0.1543852 | imptcOfFamily | -0.0267650 |
Algeria | 1039 | 0.0729042 | imptcOfFriends | -0.1187290 |
Algeria | 1039 | 0.0687132 | imptcOfLeisure | 0.1551985 |
Algeria | 1039 | 0.0594759 | imptcOfPolitics | 0.0364489 |
Algeria | 1039 | 0.0846615 | imptcOfWork | 0.1338679 |
Algeria | 1039 | 0.1517525 | imptcOfReligion | 0.0241294 |
Algeria | 1039 | 0.2130212 | age_ending_9TRUE | -0.0061899 |
Argentina | 943 | 0.4998702 | (Intercept) | 1.3207945 |
Argentina | 943 | 0.1302140 | sex | -0.0232658 |
Argentina | 943 | 0.0037675 | age | 0.0002063 |
Argentina | 943 | 0.0427286 | nationalIncomeLadder | 0.0281963 |
Argentina | 943 | 0.1869523 | imptcOfFamily | -0.0332910 |
Argentina | 943 | 0.0991168 | imptcOfFriends | 0.0729738 |
Argentina | 943 | 0.0870559 | imptcOfLeisure | -0.0057964 |
Argentina | 943 | 0.0689547 | imptcOfPolitics | 0.1117997 |
Argentina | 943 | 0.0829681 | imptcOfWork | 0.1206761 |
modelCoefs %>%
ggplot(aes(x=reorder(country,n),y=coef))+
geom_point(aes(size=n),alpha=.3,color='blue')+
geom_point(size=1,alpha=1,color='black')+
geom_errorbar(aes(ymin=coef-ci,ymax=coef+ci))+
facet_wrap(~var,scales='free_y',ncol=1)+
geom_hline(xintercept = 0,color='red',size=.5)+
theme(
axis.text=element_text(size=10),
axis.text.x=element_text(angle=315,hjust=0,vjust=1),
panel.background=element_rect(fill='white'),
strip.text=element_text(size=10))+
labs(x='Country, ordered by sample size',y='Coefficient Estimate',title='How often do you contemplate the meaning of life?')
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });