January 20, 2015

611 Highlights in R (Part II)

The raw .R file is here and the Rmd file is here.

Brief ‘Hello’ to R Markdown

Let me first introduce you to R markdown and kitr. This format allows you to integrate your analyses with text that can be automatically translated to html, word, and pdf formats. I just produced a supplementary materials document using this and it probably saved me a ridiculous number of hours. It’s easy to get started:

Save your file with extension .Rmd.

Place something like the following in the very top of your file:

---
author: John Flournoy
title: 611 Highlights in R (Part II)
output: html_document
---

Write as you normally would, using Markdown syntax for formatting. For example:

# This is a big heading

1. this is
2. a numbered list

- this is a bulleted list
- With **bold** and *italic*  text

When you want to throw in some R code, place it inside an R chunk by wrapping it in ```{r} at the beginning and ``` at the end. When you do, it comes out something like this:

# From Rose's presentation:
# a toy data set for us to work with
n <- 100
gender <- as.factor(c(rep("M", n/2), rep("F", n/2)))
IQ <- rnorm(n, mean=100, sd=15)
degree <- rep(c("HS", "BA", "MS", "PhD"), n/4)
height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1)
RT1 <- rchisq(n, 4)
RT2 <- rchisq(n, 4)
DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n)

df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2)
head(df)

##    awesome gender        IQ degree   height      RT1      RT2
## 1 52.89486      M  79.40263     HS 6.292706 1.761990 2.228117
## 2 57.04782      M  96.31289     BA 4.759042 2.236027 7.363253
## 3 55.59594      M 104.95331     MS 5.560292 5.159114 7.126281
## 4 55.44397      M  84.23347    PhD 5.354595 7.652900 2.836956
## 5 55.45191      M 107.38754     HS 4.523997 7.768904 4.504664
## 6 54.70468      M 110.07096     BA 6.139368 7.786462 7.508488

Notice that in 4, the we see the output of the command head(df). This is useful for outputting tables and plots, as we’ll see.

Here are more resources on

Streamlined Data Workflow with dplyr

Download the data here:

Make sure you have the package (library(dplyr)), and if you don’t, install it (install.packages('dplyr')). Here’s an intro to dplyr straight from the horse’s (Hadley’s) mouth.

By way of data manipulation that I’m actually using for a real project, this tutorial will introduce you to

%>%
- a function chaining operator
mutate and transmute
- adds new variables and preserves existing (transmute drops existing variables)
mutate_each
- adds new variables, applying functions in funs(...) to each column
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
left_join
- joins two data frames, returning all rows from x, and all columns from x and y
select
- keeps the variables/columns that you mention (see ?select for special functions that help you select variables)
filter
- returns rows with matching conditions (similar to selecting columns like: aDataFrame[variable == value,])
group_by
- groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
summarise
- summarizes multiple values to a single value
do
- use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same lm or t.test separately for each gender, country)

Data Subset Prep

I’m going to focus on the World Values Survey because it’s awesome. You can download it in many formats, including an rdata file. I prepared a subset for you all.

It includes

some basic demographics
a question about how much time one spends thinking about meaning and purpose of life
ratings on importance of various things in life (e.g., family, friends, work)

setwd('C:/Users/jackflibb/Documents/UO/rClub/611_dplyr_wvs_etc')
load("WorldValuesSurvey.rdata")
ls()

##  [1] "degree"                                       
##  [2] "df"                                           
##  [3] "DV"                                           
##  [4] "gender"                                       
##  [5] "height"                                       
##  [6] "IQ"                                           
##  [7] "n"                                            
##  [8] "RT1"                                          
##  [9] "RT2"                                          
## [10] "WorldValuesSurvey-Wave6-2010-2014_v2014-11-07"

wvs<-`WorldValuesSurvey-Wave6-2010-2014_v2014-11-07`
dim(wvs)

## [1] 85070   387

countryNames<-cbind(names(attr(wvs,"label.table")[[2]]),attr(wvs,"label.table")[[2]])

library(dplyr)

smallerData<-wvs %>%
transmute(
    countryCode=V2,
    sex=V240, #1 male 2 female
    age=V242,
    nationalIncomeLadder=V239,
    meaningInLife=V143,
    imptcOfFamily=V4,
    imptcOfFriends=V5,
    imptcOfLeisure=V6,
    imptcOfPolitics=V7,
    imptcOfWork=V8,
    imptcOfReligion=V9)

write.csv(smallerData,'wvs_small.csv',row.names=F)
write.csv(countryNames,'country_codes.csv',row.names=F)

Don’t worry if you don’t understand some of the syntax yet – I just wanted to show you how I prepare the dataset.

Now we’re going to read in the datasets that you downloaded. Change the working directory to wherever you put the downloaded data files.

setwd('C:/Users/jackflibb/Documents/UO/rClub/611_dplyr_wvs_etc') # Change this
wvs<-read.csv('wvs_small.csv')
countryNames<-read.csv('country_codes.csv',stringsAsFactors=F)

head(wvs)

##   countryCode sex age nationalIncomeLadder meaningInLife imptcOfFamily
## 1          12   1  21                    5             2             1
## 2          12   2  24                    6             1             1
## 3          12   2  26                    6             1             1
## 4          12   2  28                    5             2             1
## 5          12   2  35                    7             2             1
## 6          12   1  36                    5             2             1
##   imptcOfFriends imptcOfLeisure imptcOfPolitics imptcOfWork
## 1              1              1              -2           1
## 2              2              3               4           2
## 3              3              2               4           2
## 4              1              3               4           3
## 5              1              1               2           1
## 6              2              2               2           4
##   imptcOfReligion
## 1               1
## 2               2
## 3               1
## 4               1
## 5               1
## 6               1

Using the chaining operator

As you may have noticed, we have a lot of missing data coded unhelpfully as negative values. The first thing I’ll show you is how to easily recode them to NA using the dplyr function mutate_each. Before I do that, let me introduce the new operator introduced by dplyr: %>%. It’s weird at first, but once you get used to it, it’s great.

Basically, %>% sends the output data frame of one function to the first argument of the next function. For example:

library(dplyr)

wvs %>% names()

##  [1] "countryCode"          "sex"                  "age"                 
##  [4] "nationalIncomeLadder" "meaningInLife"        "imptcOfFamily"       
##  [7] "imptcOfFriends"       "imptcOfLeisure"       "imptcOfPolitics"     
## [10] "imptcOfWork"          "imptcOfReligion"

We can easily add another step to the chain:

wvs %>% names() %>% paste('is a variable in the WVS data')

##  [1] "countryCode is a variable in the WVS data"         
##  [2] "sex is a variable in the WVS data"                 
##  [3] "age is a variable in the WVS data"                 
##  [4] "nationalIncomeLadder is a variable in the WVS data"
##  [5] "meaningInLife is a variable in the WVS data"       
##  [6] "imptcOfFamily is a variable in the WVS data"       
##  [7] "imptcOfFriends is a variable in the WVS data"      
##  [8] "imptcOfLeisure is a variable in the WVS data"      
##  [9] "imptcOfPolitics is a variable in the WVS data"     
## [10] "imptcOfWork is a variable in the WVS data"         
## [11] "imptcOfReligion is a variable in the WVS data"

This isn’t very useful but, it’s a little more interpretable than:

paste(names(wvs),'is a variable in the WVS data')

This will make things a lot more interpretable as we produce complex chains of transformations.

Code missing values using mutate_each

Okay, let’s set the negative values to NA. We’ll use mutate_each, which takes a data frame and applies the functions given to it by the function funs. So you could divide everything in half by doing yourData %>% mutate_each(funs(half = . / 2)).

wvs_NAs<-wvs %>%
mutate_each(funs(ifelse(.<=-1,NA,.)))

Basic summary stats with summarise, summarise_each, select, and filter

Remember describe from last week? The function summarise allows you to easily make your own sort of describe, tailored exactly to your data set.

Let’s look at the structure of our WVS data to decide what we want to do.

str(wvs_NAs)

## 'data.frame':    85070 obs. of  11 variables:
##  $ countryCode         : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ sex                 : int  1 2 2 2 2 1 1 2 1 2 ...
##  $ age                 : int  21 24 26 28 35 36 41 44 59 72 ...
##  $ nationalIncomeLadder: int  5 6 6 5 7 5 7 2 5 4 ...
##  $ meaningInLife       : int  2 1 1 2 2 2 2 2 2 2 ...
##  $ imptcOfFamily       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ imptcOfFriends      : int  1 2 3 1 1 2 1 1 1 1 ...
##  $ imptcOfLeisure      : int  1 3 2 3 1 2 1 1 1 1 ...
##  $ imptcOfPolitics     : int  NA 4 4 4 2 2 1 1 2 2 ...
##  $ imptcOfWork         : int  1 2 2 3 1 4 1 2 2 1 ...
##  $ imptcOfReligion     : int  1 2 1 1 1 1 1 2 2 1 ...

Looks like the demographics should be summarized differently than the ratings.

categorical_summary <- wvs_NAs %>%
summarise(
    N=n(), #a dplyr helper function
    N_countries=length(unique(countryCode)),
    Prop_Male=sum(sex==1,na.rm=T)/N # WOAH, you can use variables created within this function call to build other variables in the same funciton call!!
    )
categorical_summary

##       N N_countries Prop_Male
## 1 85070          59 0.4766428

Let’s look at the numbers for women only…

categorical_summary_F <- wvs_NAs %>%
filter(sex==2) %>% 
summarise(
    N=n(),
    N_countries=length(unique(countryCode)),
    Prop_Male=sum(sex==1,na.rm=T)/N 
    )
categorical_summary_F

##       N N_countries Prop_Male
## 1 44434          59         0

We can use summarise_each to take care of the rest. Like mutate_each, summarise_each takes a comma separated list of functions in funs(...), and optionally a specification of which variables/columns should be summarized. This is really easy in dplyr – we’re going to tell it to summarize all variables from age to imptcOfReligion just by passing in age:imptcOfReligion after our list of functions.

continuous_summary <- wvs_NAs %>% 
select(age:imptcOfReligion) %>% # you can use the : operator directly on variable names!
summarise_each( # glorious ..._each functions!
    funs(
        min=min(.,na.rm=T),
        max=max(.,na.rm=T),
        mean=mean(.,na.rm=T),
        median=median(.,na.rm=T),
        sd=sd(.,na.rm=T),
        Prop_missing=sum(is.na(.))/length(.)
        )
    )
continuous_summary

##   age_min nationalIncomeLadder_min meaningInLife_min imptcOfFamily_min
## 1      16                        1                 1                 1
##   imptcOfFriends_min imptcOfLeisure_min imptcOfPolitics_min
## 1                  1                  1                   1
##   imptcOfWork_min imptcOfReligion_min age_max nationalIncomeLadder_max
## 1               1                   1      99                       10
##   meaningInLife_max imptcOfFamily_max imptcOfFriends_max
## 1                 4                 4                  4
##   imptcOfLeisure_max imptcOfPolitics_max imptcOfWork_max
## 1                  4                   4               4
##   imptcOfReligion_max age_mean nationalIncomeLadder_mean
## 1                   4 42.05654                  4.908784
##   meaningInLife_mean imptcOfFamily_mean imptcOfFriends_mean
## 1           1.844841           1.107936            1.684175
##   imptcOfLeisure_mean imptcOfPolitics_mean imptcOfWork_mean
## 1             1.88321             2.612757         1.545598
##   imptcOfReligion_mean age_median nationalIncomeLadder_median
## 1             1.914088         40                           5
##   meaningInLife_median imptcOfFamily_median imptcOfFriends_median
## 1                    2                    1                     2
##   imptcOfLeisure_median imptcOfPolitics_median imptcOfWork_median
## 1                     2                      3                  1
##   imptcOfReligion_median   age_sd nationalIncomeLadder_sd meaningInLife_sd
## 1                      2 16.54851                2.104927        0.8582036
##   imptcOfFamily_sd imptcOfFriends_sd imptcOfLeisure_sd imptcOfPolitics_sd
## 1        0.3758205         0.7397572         0.8341943          0.9822105
##   imptcOfWork_sd imptcOfReligion_sd age_Prop_missing
## 1      0.8116806           1.064065      0.001798519
##   nationalIncomeLadder_Prop_missing meaningInLife_Prop_missing
## 1                        0.03605266                   0.015787
##   imptcOfFamily_Prop_missing imptcOfFriends_Prop_missing
## 1                0.003714588                 0.005442577
##   imptcOfLeisure_Prop_missing imptcOfPolitics_Prop_missing
## 1                  0.01120254                   0.01702128
##   imptcOfWork_Prop_missing imptcOfReligion_Prop_missing
## 1               0.01702128                   0.01498766

Dang, that’s kind of ugly. We can reshape these using the reshape2 package and output it to a table using kable from knitr.

library(knitr)
library(reshape2)

continuous_summary_melted <- melt(continuous_summary)

## No id variables; using all as measure variables

kable(categorical_summary,
    caption='Summary of Categorical Variables')

Summary of Categorical Variables
N	N_countries	Prop_Male
85070	59	0.4766428

kable(continuous_summary_melted,
    col.names=c('Statistic','Value'),
    digits=2,
    caption='Summary of Continuous Variables')

Summary of Continuous Variables
Statistic	Value
age_min	16.00
nationalIncomeLadder_min	1.00
meaningInLife_min	1.00
imptcOfFamily_min	1.00
imptcOfFriends_min	1.00
imptcOfLeisure_min	1.00
imptcOfPolitics_min	1.00
imptcOfWork_min	1.00
imptcOfReligion_min	1.00
age_max	99.00
nationalIncomeLadder_max	10.00
meaningInLife_max	4.00
imptcOfFamily_max	4.00
imptcOfFriends_max	4.00
imptcOfLeisure_max	4.00
imptcOfPolitics_max	4.00
imptcOfWork_max	4.00
imptcOfReligion_max	4.00
age_mean	42.06
nationalIncomeLadder_mean	4.91
meaningInLife_mean	1.84
imptcOfFamily_mean	1.11
imptcOfFriends_mean	1.68
imptcOfLeisure_mean	1.88
imptcOfPolitics_mean	2.61
imptcOfWork_mean	1.55
imptcOfReligion_mean	1.91
age_median	40.00
nationalIncomeLadder_median	5.00
meaningInLife_median	2.00
imptcOfFamily_median	1.00
imptcOfFriends_median	2.00
imptcOfLeisure_median	2.00
imptcOfPolitics_median	3.00
imptcOfWork_median	1.00
imptcOfReligion_median	2.00
age_sd	16.55
nationalIncomeLadder_sd	2.10
meaningInLife_sd	0.86
imptcOfFamily_sd	0.38
imptcOfFriends_sd	0.74
imptcOfLeisure_sd	0.83
imptcOfPolitics_sd	0.98
imptcOfWork_sd	0.81
imptcOfReligion_sd	1.06
age_Prop_missing	0.00
nationalIncomeLadder_Prop_missing	0.04
meaningInLife_Prop_missing	0.02
imptcOfFamily_Prop_missing	0.00
imptcOfFriends_Prop_missing	0.01
imptcOfLeisure_Prop_missing	0.01
imptcOfPolitics_Prop_missing	0.02
imptcOfWork_Prop_missing	0.02
imptcOfReligion_Prop_missing	0.01

Clean up that table

We might want to clean up that summary of continuous variables a bit. Let’s do that quickly using transmute and arrange.

continuous_summary_melted %>% 
transmute(
    item=sub('(\\w+?)_(.*)','\\1',variable),
    statistic=sub('(\\w+?)_(.*)','\\2',variable),
    value=value) %>%
arrange(item,statistic) %>%
kable(
    col.names=c('Item','Statistic','Value'),
    digits=2,
    caption='Summary of Continuous Variables')

Summary of Continuous Variables
Item	Statistic	Value
age	Prop_missing	0.00
age	max	99.00
age	mean	42.06
age	median	40.00
age	min	16.00
age	sd	16.55
imptcOfFamily	Prop_missing	0.00
imptcOfFamily	max	4.00
imptcOfFamily	mean	1.11
imptcOfFamily	median	1.00
imptcOfFamily	min	1.00
imptcOfFamily	sd	0.38
imptcOfFriends	Prop_missing	0.01
imptcOfFriends	max	4.00
imptcOfFriends	mean	1.68
imptcOfFriends	median	2.00
imptcOfFriends	min	1.00
imptcOfFriends	sd	0.74
imptcOfLeisure	Prop_missing	0.01
imptcOfLeisure	max	4.00
imptcOfLeisure	mean	1.88
imptcOfLeisure	median	2.00
imptcOfLeisure	min	1.00
imptcOfLeisure	sd	0.83
imptcOfPolitics	Prop_missing	0.02
imptcOfPolitics	max	4.00
imptcOfPolitics	mean	2.61
imptcOfPolitics	median	3.00
imptcOfPolitics	min	1.00
imptcOfPolitics	sd	0.98
imptcOfReligion	Prop_missing	0.01
imptcOfReligion	max	4.00
imptcOfReligion	mean	1.91
imptcOfReligion	median	2.00
imptcOfReligion	min	1.00
imptcOfReligion	sd	1.06
imptcOfWork	Prop_missing	0.02
imptcOfWork	max	4.00
imptcOfWork	mean	1.55
imptcOfWork	median	1.00
imptcOfWork	min	1.00
imptcOfWork	sd	0.81
meaningInLife	Prop_missing	0.02
meaningInLife	max	4.00
meaningInLife	mean	1.84
meaningInLife	median	2.00
meaningInLife	min	1.00
meaningInLife	sd	0.86
nationalIncomeLadder	Prop_missing	0.04
nationalIncomeLadder	max	10.00
nationalIncomeLadder	mean	4.91
nationalIncomeLadder	median	5.00
nationalIncomeLadder	min	1.00
nationalIncomeLadder	sd	2.10

Get country names using left_join

We also have a file with country names to go with our country codes. We’re going to start doing some analysis within-country, so let’s get those names into the data set. This is what some call a one-to-many merge. That is, we have a data frame with one row per country name, and we want to merge each of these rows with every row in the main data set that has a matching country code.

There are 4 join functions (?left_join for more info), of which left_join is our best choice. According to the docs it “return[s] all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.”

str(countryNames)

## 'data.frame':    195 obs. of  2 variables:
##  $ V1: chr  "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
##  $ V2: int  -5 -4 -3 -2 -1 4 8 12 16 20 ...

names(wvs_NAs)

##  [1] "countryCode"          "sex"                  "age"                 
##  [4] "nationalIncomeLadder" "meaningInLife"        "imptcOfFamily"       
##  [7] "imptcOfFriends"       "imptcOfLeisure"       "imptcOfPolitics"     
## [10] "imptcOfWork"          "imptcOfReligion"

#We use the same country code variable name, countryCode, as the main data set
names(countryNames)<-c('countryName','countryCode') 

#left_join will automatically find matching column names
wvs_NAs_cnames <- left_join(wvs_NAs,countryNames)

## Joining by: "countryCode"

head(wvs_NAs_cnames) #success!

##   countryCode sex age nationalIncomeLadder meaningInLife imptcOfFamily
## 1          12   1  21                    5             2             1
## 2          12   2  24                    6             1             1
## 3          12   2  26                    6             1             1
## 4          12   2  28                    5             2             1
## 5          12   2  35                    7             2             1
## 6          12   1  36                    5             2             1
##   imptcOfFriends imptcOfLeisure imptcOfPolitics imptcOfWork
## 1              1              1              NA           1
## 2              2              3               4           2
## 3              3              2               4           2
## 4              1              3               4           3
## 5              1              1               2           1
## 6              2              2               2           4
##   imptcOfReligion countryName
## 1               1     Algeria
## 2               2     Algeria
## 3               1     Algeria
## 4               1     Algeria
## 5               1     Algeria
## 6               1     Algeria

Summaries by country

We’re going to use the summarise_each function again, but this time instead of using select before hand, we can give it the columns we want as the second argument. I’ll use the helper function starts_with to get every column that starts with ‘imptcOf’ (our “Importance in life of…” variables).

wvs_NAs_cnames %>%
group_by(countryName) %>%
summarise_each(
    funs(
        mean=mean(.,na.rm=T)),
    starts_with('imptcOf')) %>%
kable(caption='Importance of ___ Item Means by Country')

Importance of ___ Item Means by Country
countryName	imptcOfFamily	imptcOfFriends	imptcOfLeisure	imptcOfPolitics	imptcOfWork	imptcOfReligion
Algeria	1.100418	1.762826	2.052453	2.719437	1.385800	1.118189
Argentina	1.123541	1.570039	1.870117	2.880905	1.580454	2.451581
Armenia	1.032022	1.764384	1.959032	2.912168	1.373383	1.531392
Australia	1.076923	1.475456	1.670654	2.482562	1.965909	2.824648
Azerbaijan	1.074850	1.791417	2.124751	2.926148	1.496008	2.039920
Bahrain	1.674457	1.894167	2.333333	2.174350	2.169321	1.753756
Belarus	1.130150	1.725361	1.888227	2.817822	1.835635	2.531003
Brazil	1.134007	1.855698	1.813514	2.751864	1.393122	1.595687
Chile	1.097000	1.871587	1.539314	3.064777	1.541538	2.297379
China	1.140907	1.596710	2.052751	2.543602	1.802098	3.381368
Colombia	1.155423	2.011258	1.664901	3.024585	1.258278	1.592323
Cyprus	1.073073	1.449449	1.543543	2.728745	1.464393	1.827655
Ecuador	1.021631	1.968386	1.593672	2.551581	1.169717	1.477537
Egypt	1.039396	1.743926	2.356533	2.166120	1.909389	1.065046
Estonia	1.151436	1.614781	1.796721	2.877457	1.730211	3.021220
Germany	1.251471	1.562072	1.811765	2.587573	1.806468	2.951399
Ghana	1.052835	1.795103	1.607603	2.611469	1.069588	1.087629
Hong Kong	1.172345	1.631000	1.794975	2.527694	1.929860	2.770541
India	1.466413	1.928390	2.213333	2.427936	1.750317	1.812817
Iraq	1.079167	1.665275	2.323258	2.731707	1.453255	1.180000
Japan	1.086523	1.602763	1.658804	2.062224	1.578105	3.115530
Jordan	1.043333	1.602500	1.956667	2.800000	1.491109	1.068447
Kazakhstan	1.080667	1.666667	1.811333	2.592000	1.585333	2.345333
Kuwait	1.075136	1.531153	1.975358	2.198367	1.290679	1.133758
Kyrgyzstan	1.041333	1.784000	2.146000	2.417333	1.446000	1.795333
Lebanon	1.358079	1.651182	1.986418	2.537511	1.556604	1.745704
Libya	1.036671	1.433540	1.836862	2.284668	1.319885	1.037987
Malaysia	1.027692	1.636923	1.758461	2.401078	1.220000	1.185385
Mexico	1.031500	1.879940	1.575288	2.614307	1.192000	1.627000
Morocco	1.101836	1.901338	2.517426	3.349817	1.225021	1.122704
Netherlands	1.174731	1.549497	1.653744	2.687833	1.929470	3.080558
New Zealand	1.064010	1.466340	1.578492	2.545568	1.820707	2.746231
Nigeria	1.014213	1.426379	1.623081	2.523891	1.314383	1.130756
Pakistan	1.058382	1.738731	2.359797	2.802852	1.407282	1.127410
Palestine	1.048048	1.740481	2.079879	2.508629	1.506694	1.161161
Peru	1.157025	2.228906	1.906723	2.743633	1.339211	1.712134
Philippines	1.015000	1.844037	2.468672	2.292540	1.109258	1.159167
Poland	1.082292	1.686397	1.771249	2.834375	1.462274	1.816857
Qatar	1.012264	1.287736	1.777148	2.155303	1.222641	1.013208
Romania	1.081333	2.077948	1.919679	3.089623	1.564926	1.661549
Russia	1.174578	1.887455	2.012225	2.958983	1.863577	2.636633
Rwanda	1.096267	1.313032	2.000655	2.290766	1.423707	1.910282
Singapore	1.077079	1.534483	1.722617	2.454868	1.743278	1.894416
Slovenia	1.118421	1.640449	1.717248	3.270423	1.706439	2.816384
South Africa	1.087635	1.769318	1.983138	2.563851	1.552554	1.623851
South Korea	1.102585	1.546366	1.809045	2.367845	1.525609	2.361881
Spain	1.093434	1.510638	1.632740	3.083051	1.564255	2.936279
Sweden	1.124688	1.323920	1.505034	2.281537	1.604039	3.006700
Taiwan	1.098294	1.632052	1.775678	2.795041	1.513934	2.366393
Thailand	1.131271	1.878028	1.986509	1.981292	1.420655	1.569992
Trinidad and Tobago	1.070211	1.916499	1.646881	2.769849	1.420363	1.314141
Tunisia	1.024147	1.768080	2.011064	2.714286	1.199499	1.064892
Turkey	1.048005	1.437148	1.707363	2.526415	1.720126	1.448750
Ukraine	1.092667	1.728000	1.884000	2.926667	1.815333	2.218000
United States	1.099505	1.503389	1.691369	2.429737	1.948058	2.023035
Uruguay	1.132132	1.695085	1.642351	2.965726	1.495960	2.715005
Uzbekistan	1.028000	1.577719	2.144392	2.616521	1.490262	1.976383
Yemen	1.044266	1.598178	2.341589	2.693499	1.492417	1.060120
Zimbabwe	1.036667	1.824000	1.844000	2.622667	1.150000	1.211333

Histograms of each variable by country

You can apply this same type of logic for running models for each country. We melt the data frame so that we get a row for every variable/value pair indexed by country. This makes it easy for ggplot to give us a a histogram for each variable in the same plot, using facet_wrap. If we didn’t use facet_wrap below, we’d get one histogram with the values of all the variables. Check out the first six, last, and a random sample of six lines of the melted data frame:

m_temp <- melt(wvs_NAs_cnames,id.vars='countryName')
kable(head(m_temp))

countryName	variable	value
Algeria	countryCode	12
Algeria	countryCode	12
Algeria	countryCode	12
Algeria	countryCode	12
Algeria	countryCode	12
Algeria	countryCode	12

kable(tail(m_temp))

	countryName	variable	value
935765	Zimbabwe	imptcOfReligion	1
935766	Zimbabwe	imptcOfReligion	1
935767	Zimbabwe	imptcOfReligion	1
935768	Zimbabwe	imptcOfReligion	1
935769	Zimbabwe	imptcOfReligion	1
935770	Zimbabwe	imptcOfReligion	1

kable(sample_n(m_temp,6),row.names=F)

countryName	variable	value
India	imptcOfPolitics	3
Egypt	age	60
Yemen	nationalIncomeLadder	5
Pakistan	imptcOfFamily	2
Japan	imptcOfPolitics	2
Yemen	imptcOfFamily	1

library(ggplot2)
#plotprintlist is a variable that is basically trash, but we could use it to print the plots again if we wanted
plotprintlist <- wvs_NAs_cnames %>%
melt(id.vars='countryName') %>%
group_by(countryName) %>%
do(
     plot=print(
        ggplot(.,aes(x=value))+ # set up the plot with the data frame '.' passed to `do`
            geom_histogram()+ # We want a histogram of the values from column 'value'
            facet_wrap(~variable,scales='free')+ # We want a facet (i.e., subplot) for every unique value of the column 'variable'
            ggtitle(unique(.$countryName)) # give it a title with the country name from the countrName column of '.'
            )
)

All the analysis

We can use the same technique to get a coefficients from a linear model for every country

wvs_NAs_cnames_modulus <- wvs_NAs_cnames %>%
mutate(age_ending = age %% 10, age_ending_9 = age_ending %in% 9)

library(biglm)

## Warning: package 'biglm' was built under R version 3.1.2

## Loading required package: DBI

## Warning: package 'DBI' was built under R version 3.1.2

models <- wvs_NAs_cnames_modulus %>%
group_by(countryName) %>%
do(
     mod=biglm(
            meaningInLife ~ 1 + 
            sex +
            age +
            nationalIncomeLadder +
            imptcOfFamily +
            imptcOfFriends +
            imptcOfLeisure +
            imptcOfPolitics +
            imptcOfWork +
            imptcOfReligion +
            age_ending_9,
            data=.)
     )

modelCoefs <- models %>%
do(
    data.frame(
        country=.$countryName,
        n=.$mod$n,
        ci=summary(.$mod)$mat[,'Coef']-summary(.$mod)$mat[,'(95%'],
        var=names(coef(.$mod)),
        coef=coef(.$mod))
    )

## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character

kable(modelCoefs[1:20,])

country	n	ci	var	coef
Algeria	1039	0.3973372	(Intercept)	2.2335079
Algeria	1039	0.1267847	sex	-0.0623652
Algeria	1039	0.0043268	age	-0.0039015
Algeria	1039	0.0307039	nationalIncomeLadder	-0.0802476
Algeria	1039	0.1543852	imptcOfFamily	-0.0267650
Algeria	1039	0.0729042	imptcOfFriends	-0.1187290
Algeria	1039	0.0687132	imptcOfLeisure	0.1551985
Algeria	1039	0.0594759	imptcOfPolitics	0.0364489
Algeria	1039	0.0846615	imptcOfWork	0.1338679
Algeria	1039	0.1517525	imptcOfReligion	0.0241294
Algeria	1039	0.2130212	age_ending_9TRUE	-0.0061899
Argentina	943	0.4998702	(Intercept)	1.3207945
Argentina	943	0.1302140	sex	-0.0232658
Argentina	943	0.0037675	age	0.0002063
Argentina	943	0.0427286	nationalIncomeLadder	0.0281963
Argentina	943	0.1869523	imptcOfFamily	-0.0332910
Argentina	943	0.0991168	imptcOfFriends	0.0729738
Argentina	943	0.0870559	imptcOfLeisure	-0.0057964
Argentina	943	0.0689547	imptcOfPolitics	0.1117997
Argentina	943	0.0829681	imptcOfWork	0.1206761

modelCoefs %>%
ggplot(aes(x=reorder(country,n),y=coef))+
geom_point(aes(size=n),alpha=.3,color='blue')+
geom_point(size=1,alpha=1,color='black')+
geom_errorbar(aes(ymin=coef-ci,ymax=coef+ci))+
facet_wrap(~var,scales='free_y',ncol=1)+
geom_hline(xintercept = 0,color='red',size=.5)+
theme(
    axis.text=element_text(size=10),
    axis.text.x=element_text(angle=315,hjust=0,vjust=1),
    panel.background=element_rect(fill='white'),
    strip.text=element_text(size=10))+
labs(x='Country, ordered by sample size',y='Coefficient Estimate',title='How often do you contemplate the meaning of life?')