611 Highlights in R (Part II)



611 Highlights in R (Part II)


The raw .R file is here and the Rmd file is here.

Brief ‘Hello’ to R Markdown

Let me first introduce you to R markdown and kitr. This format allows you to integrate your analyses with text that can be automatically translated to html, word, and pdf formats. I just produced a supplementary materials document using this and it probably saved me a ridiculous number of hours. It’s easy to get started:

  1. Save your file with extension .Rmd.
  2. Place something like the following in the very top of your file:

    ---
    author: John Flournoy
    title: 611 Highlights in R (Part II)
    output: html_document
    ---
  3. Write as you normally would, using Markdown syntax for formatting. For example:

    # This is a big heading
    
    1. this is
    2. a numbered list
    
    - this is a bulleted list
    - With **bold** and *italic*  text
  4. When you want to throw in some R code, place it inside an R chunk by wrapping it in ```{r} at the beginning and ``` at the end. When you do, it comes out something like this:

    # From Rose's presentation:
    # a toy data set for us to work with
    n <- 100
    gender <- as.factor(c(rep("M", n/2), rep("F", n/2)))
    IQ <- rnorm(n, mean=100, sd=15)
    degree <- rep(c("HS", "BA", "MS", "PhD"), n/4)
    height <- as.numeric(gender)-2 + rnorm(n, mean=5.5, sd=1)
    RT1 <- rchisq(n, 4)
    RT2 <- rchisq(n, 4)
    DV <- 50 - 5*(as.numeric(gender)-1) + .1*IQ + rnorm(n)
    
    df <- data.frame(awesome=DV, gender=gender, IQ=IQ, degree=degree, height=height, RT1=RT1, RT2=RT2)
    head(df)
    ##    awesome gender        IQ degree   height      RT1      RT2
    ## 1 52.89486      M  79.40263     HS 6.292706 1.761990 2.228117
    ## 2 57.04782      M  96.31289     BA 4.759042 2.236027 7.363253
    ## 3 55.59594      M 104.95331     MS 5.560292 5.159114 7.126281
    ## 4 55.44397      M  84.23347    PhD 5.354595 7.652900 2.836956
    ## 5 55.45191      M 107.38754     HS 4.523997 7.768904 4.504664
    ## 6 54.70468      M 110.07096     BA 6.139368 7.786462 7.508488

Notice that in 4, the we see the output of the command head(df). This is useful for outputting tables and plots, as we’ll see.

Here are more resources on

Streamlined Data Workflow with dplyr

Download the data here:

Make sure you have the package (library(dplyr)), and if you don’t, install it (install.packages('dplyr')). Here’s an intro to dplyr straight from the horse’s (Hadley’s) mouth.

By way of data manipulation that I’m actually using for a real project, this tutorial will introduce you to

  • %>%
    • a function chaining operator
  • mutate and transmute
    • adds new variables and preserves existing (transmute drops existing variables)
  • mutate_each
    • adds new variables, applying functions in funs(...) to each column
  • group_by
    • groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
  • summarise
    • summarizes multiple values to a single value
  • left_join
    • joins two data frames, returning all rows from x, and all columns from x and y
  • select
    • keeps the variables/columns that you mention (see ?select for special functions that help you select variables)
  • filter
    • returns rows with matching conditions (similar to selecting columns like: aDataFrame[variable == value,])
  • group_by
    • groups data set according to some variable (e.g., gender) – other dplyr operations will be done on groups
  • summarise
    • summarizes multiple values to a single value
  • do
    • use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list (e.g., run the same lm or t.test separately for each gender, country)

Data Subset Prep

I’m going to focus on the World Values Survey because it’s awesome. You can download it in many formats, including an rdata file. I prepared a subset for you all.

It includes

  • some basic demographics
  • a question about how much time one spends thinking about meaning and purpose of life
  • ratings on importance of various things in life (e.g., family, friends, work)
setwd('C:/Users/jackflibb/Documents/UO/rClub/611_dplyr_wvs_etc')
load("WorldValuesSurvey.rdata")
ls()
##  [1] "degree"                                       
##  [2] "df"                                           
##  [3] "DV"                                           
##  [4] "gender"                                       
##  [5] "height"                                       
##  [6] "IQ"                                           
##  [7] "n"                                            
##  [8] "RT1"                                          
##  [9] "RT2"                                          
## [10] "WorldValuesSurvey-Wave6-2010-2014_v2014-11-07"
wvs<-`WorldValuesSurvey-Wave6-2010-2014_v2014-11-07`
dim(wvs)
## [1] 85070   387
countryNames<-cbind(names(attr(wvs,"label.table")[[2]]),attr(wvs,"label.table")[[2]])

library(dplyr)

smallerData<-wvs %>%
transmute(
    countryCode=V2,
    sex=V240, #1 male 2 female
    age=V242,
    nationalIncomeLadder=V239,
    meaningInLife=V143,
    imptcOfFamily=V4,
    imptcOfFriends=V5,
    imptcOfLeisure=V6,
    imptcOfPolitics=V7,
    imptcOfWork=V8,
    imptcOfReligion=V9)

write.csv(smallerData,'wvs_small.csv',row.names=F)
write.csv(countryNames,'country_codes.csv',row.names=F)

Don’t worry if you don’t understand some of the syntax yet – I just wanted to show you how I prepare the dataset.

Now we’re going to read in the datasets that you downloaded. Change the working directory to wherever you put the downloaded data files.

setwd('C:/Users/jackflibb/Documents/UO/rClub/611_dplyr_wvs_etc') # Change this
wvs<-read.csv('wvs_small.csv')
countryNames<-read.csv('country_codes.csv',stringsAsFactors=F)

head(wvs)
##   countryCode sex age nationalIncomeLadder meaningInLife imptcOfFamily
## 1          12   1  21                    5             2             1
## 2          12   2  24                    6             1             1
## 3          12   2  26                    6             1             1
## 4          12   2  28                    5             2             1
## 5          12   2  35                    7             2             1
## 6          12   1  36                    5             2             1
##   imptcOfFriends imptcOfLeisure imptcOfPolitics imptcOfWork
## 1              1              1              -2           1
## 2              2              3               4           2
## 3              3              2               4           2
## 4              1              3               4           3
## 5              1              1               2           1
## 6              2              2               2           4
##   imptcOfReligion
## 1               1
## 2               2
## 3               1
## 4               1
## 5               1
## 6               1

Using the chaining operator

As you may have noticed, we have a lot of missing data coded unhelpfully as negative values. The first thing I’ll show you is how to easily recode them to NA using the dplyr function mutate_each. Before I do that, let me introduce the new operator introduced by dplyr: %>%. It’s weird at first, but once you get used to it, it’s great.

Basically, %>% sends the output data frame of one function to the first argument of the next function. For example:

library(dplyr)

wvs %>% names()
##  [1] "countryCode"          "sex"                  "age"                 
##  [4] "nationalIncomeLadder" "meaningInLife"        "imptcOfFamily"       
##  [7] "imptcOfFriends"       "imptcOfLeisure"       "imptcOfPolitics"     
## [10] "imptcOfWork"          "imptcOfReligion"

We can easily add another step to the chain:

wvs %>% names() %>% paste('is a variable in the WVS data')
##  [1] "countryCode is a variable in the WVS data"         
##  [2] "sex is a variable in the WVS data"                 
##  [3] "age is a variable in the WVS data"                 
##  [4] "nationalIncomeLadder is a variable in the WVS data"
##  [5] "meaningInLife is a variable in the WVS data"       
##  [6] "imptcOfFamily is a variable in the WVS data"       
##  [7] "imptcOfFriends is a variable in the WVS data"      
##  [8] "imptcOfLeisure is a variable in the WVS data"      
##  [9] "imptcOfPolitics is a variable in the WVS data"     
## [10] "imptcOfWork is a variable in the WVS data"         
## [11] "imptcOfReligion is a variable in the WVS data"

This isn’t very useful but, it’s a little more interpretable than:

paste(names(wvs),'is a variable in the WVS data')

This will make things a lot more interpretable as we produce complex chains of transformations.

Code missing values using mutate_each

Okay, let’s set the negative values to NA. We’ll use mutate_each, which takes a data frame and applies the functions given to it by the function funs. So you could divide everything in half by doing yourData %>% mutate_each(funs(half = . / 2)).

wvs_NAs<-wvs %>%
mutate_each(funs(ifelse(.<=-1,NA,.)))

Basic summary stats with summarise, summarise_each, select, and filter

Remember describe from last week? The function summarise allows you to easily make your own sort of describe, tailored exactly to your data set.

Let’s look at the structure of our WVS data to decide what we want to do.

str(wvs_NAs)
## 'data.frame':    85070 obs. of  11 variables:
##  $ countryCode         : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ sex                 : int  1 2 2 2 2 1 1 2 1 2 ...
##  $ age                 : int  21 24 26 28 35 36 41 44 59 72 ...
##  $ nationalIncomeLadder: int  5 6 6 5 7 5 7 2 5 4 ...
##  $ meaningInLife       : int  2 1 1 2 2 2 2 2 2 2 ...
##  $ imptcOfFamily       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ imptcOfFriends      : int  1 2 3 1 1 2 1 1 1 1 ...
##  $ imptcOfLeisure      : int  1 3 2 3 1 2 1 1 1 1 ...
##  $ imptcOfPolitics     : int  NA 4 4 4 2 2 1 1 2 2 ...
##  $ imptcOfWork         : int  1 2 2 3 1 4 1 2 2 1 ...
##  $ imptcOfReligion     : int  1 2 1 1 1 1 1 2 2 1 ...

Looks like the demographics should be summarized differently than the ratings.

categorical_summary <- wvs_NAs %>%
summarise(
    N=n(), #a dplyr helper function
    N_countries=length(unique(countryCode)),
    Prop_Male=sum(sex==1,na.rm=T)/N # WOAH, you can use variables created within this function call to build other variables in the same funciton call!!
    )
categorical_summary
##       N N_countries Prop_Male
## 1 85070          59 0.4766428

Let’s look at the numbers for women only…

categorical_summary_F <- wvs_NAs %>%
filter(sex==2) %>% 
summarise(
    N=n(),
    N_countries=length(unique(countryCode)),
    Prop_Male=sum(sex==1,na.rm=T)/N 
    )
categorical_summary_F
##       N N_countries Prop_Male
## 1 44434          59         0

We can use summarise_each to take care of the rest. Like mutate_each, summarise_each takes a comma separated list of functions in funs(...), and optionally a specification of which variables/columns should be summarized. This is really easy in dplyr – we’re going to tell it to summarize all variables from age to imptcOfReligion just by passing in age:imptcOfReligion after our list of functions.

continuous_summary <- wvs_NAs %>% 
select(age:imptcOfReligion) %>% # you can use the : operator directly on variable names!
summarise_each( # glorious ..._each functions!
    funs(
        min=min(.,na.rm=T),
        max=max(.,na.rm=T),
        mean=mean(.,na.rm=T),
        median=median(.,na.rm=T),
        sd=sd(.,na.rm=T),
        Prop_missing=sum(is.na(.))/length(.)
        )
    )
continuous_summary
##   age_min nationalIncomeLadder_min meaningInLife_min imptcOfFamily_min
## 1      16                        1                 1                 1
##   imptcOfFriends_min imptcOfLeisure_min imptcOfPolitics_min
## 1                  1                  1                   1
##   imptcOfWork_min imptcOfReligion_min age_max nationalIncomeLadder_max
## 1               1                   1      99                       10
##   meaningInLife_max imptcOfFamily_max imptcOfFriends_max
## 1                 4                 4                  4
##   imptcOfLeisure_max imptcOfPolitics_max imptcOfWork_max
## 1                  4                   4               4
##   imptcOfReligion_max age_mean nationalIncomeLadder_mean
## 1                   4 42.05654                  4.908784
##   meaningInLife_mean imptcOfFamily_mean imptcOfFriends_mean
## 1           1.844841           1.107936            1.684175
##   imptcOfLeisure_mean imptcOfPolitics_mean imptcOfWork_mean
## 1             1.88321             2.612757         1.545598
##   imptcOfReligion_mean age_median nationalIncomeLadder_median
## 1             1.914088         40                           5
##   meaningInLife_median imptcOfFamily_median imptcOfFriends_median
## 1                    2                    1                     2
##   imptcOfLeisure_median imptcOfPolitics_median imptcOfWork_median
## 1                     2                      3                  1
##   imptcOfReligion_median   age_sd nationalIncomeLadder_sd meaningInLife_sd
## 1                      2 16.54851                2.104927        0.8582036
##   imptcOfFamily_sd imptcOfFriends_sd imptcOfLeisure_sd imptcOfPolitics_sd
## 1        0.3758205         0.7397572         0.8341943          0.9822105
##   imptcOfWork_sd imptcOfReligion_sd age_Prop_missing
## 1      0.8116806           1.064065      0.001798519
##   nationalIncomeLadder_Prop_missing meaningInLife_Prop_missing
## 1                        0.03605266                   0.015787
##   imptcOfFamily_Prop_missing imptcOfFriends_Prop_missing
## 1                0.003714588                 0.005442577
##   imptcOfLeisure_Prop_missing imptcOfPolitics_Prop_missing
## 1                  0.01120254                   0.01702128
##   imptcOfWork_Prop_missing imptcOfReligion_Prop_missing
## 1               0.01702128                   0.01498766

Dang, that’s kind of ugly. We can reshape these using the reshape2 package and output it to a table using kable from knitr.

library(knitr)
library(reshape2)

continuous_summary_melted <- melt(continuous_summary)
## No id variables; using all as measure variables
kable(categorical_summary,
    caption='Summary of Categorical Variables')
Summary of Categorical Variables
N N_countries Prop_Male
85070 59 0.4766428
kable(continuous_summary_melted,
    col.names=c('Statistic','Value'),
    digits=2,
    caption='Summary of Continuous Variables')
Summary of Continuous Variables
Statistic Value
age_min 16.00
nationalIncomeLadder_min 1.00
meaningInLife_min 1.00
imptcOfFamily_min 1.00
imptcOfFriends_min 1.00
imptcOfLeisure_min 1.00
imptcOfPolitics_min 1.00
imptcOfWork_min 1.00
imptcOfReligion_min 1.00
age_max 99.00
nationalIncomeLadder_max 10.00
meaningInLife_max 4.00
imptcOfFamily_max 4.00
imptcOfFriends_max 4.00
imptcOfLeisure_max 4.00
imptcOfPolitics_max 4.00
imptcOfWork_max 4.00
imptcOfReligion_max 4.00
age_mean 42.06
nationalIncomeLadder_mean 4.91
meaningInLife_mean 1.84
imptcOfFamily_mean 1.11
imptcOfFriends_mean 1.68
imptcOfLeisure_mean 1.88
imptcOfPolitics_mean 2.61
imptcOfWork_mean 1.55
imptcOfReligion_mean 1.91
age_median 40.00
nationalIncomeLadder_median 5.00
meaningInLife_median 2.00
imptcOfFamily_median 1.00
imptcOfFriends_median 2.00
imptcOfLeisure_median 2.00
imptcOfPolitics_median 3.00
imptcOfWork_median 1.00
imptcOfReligion_median 2.00
age_sd 16.55
nationalIncomeLadder_sd 2.10
meaningInLife_sd 0.86
imptcOfFamily_sd 0.38
imptcOfFriends_sd 0.74
imptcOfLeisure_sd 0.83
imptcOfPolitics_sd 0.98
imptcOfWork_sd 0.81
imptcOfReligion_sd 1.06
age_Prop_missing 0.00
nationalIncomeLadder_Prop_missing 0.04
meaningInLife_Prop_missing 0.02
imptcOfFamily_Prop_missing 0.00
imptcOfFriends_Prop_missing 0.01
imptcOfLeisure_Prop_missing 0.01
imptcOfPolitics_Prop_missing 0.02
imptcOfWork_Prop_missing 0.02
imptcOfReligion_Prop_missing 0.01

Clean up that table

We might want to clean up that summary of continuous variables a bit. Let’s do that quickly using transmute and arrange.

continuous_summary_melted %>% 
transmute(
    item=sub('(\\w+?)_(.*)','\\1',variable),
    statistic=sub('(\\w+?)_(.*)','\\2',variable),
    value=value) %>%
arrange(item,statistic) %>%
kable(
    col.names=c('Item','Statistic','Value'),
    digits=2,
    caption='Summary of Continuous Variables')
Summary of Continuous Variables
Item Statistic Value
age Prop_missing 0.00
age max 99.00
age mean 42.06
age median 40.00
age min 16.00
age sd 16.55
imptcOfFamily Prop_missing 0.00
imptcOfFamily max 4.00
imptcOfFamily mean 1.11
imptcOfFamily median 1.00
imptcOfFamily min 1.00
imptcOfFamily sd 0.38
imptcOfFriends Prop_missing 0.01
imptcOfFriends max 4.00
imptcOfFriends mean 1.68
imptcOfFriends median 2.00
imptcOfFriends min 1.00
imptcOfFriends sd 0.74
imptcOfLeisure Prop_missing 0.01
imptcOfLeisure max 4.00
imptcOfLeisure mean 1.88
imptcOfLeisure median 2.00
imptcOfLeisure min 1.00
imptcOfLeisure sd 0.83
imptcOfPolitics Prop_missing 0.02
imptcOfPolitics max 4.00
imptcOfPolitics mean 2.61
imptcOfPolitics median 3.00
imptcOfPolitics min 1.00
imptcOfPolitics sd 0.98
imptcOfReligion Prop_missing 0.01
imptcOfReligion max 4.00
imptcOfReligion mean 1.91
imptcOfReligion median 2.00
imptcOfReligion min 1.00
imptcOfReligion sd 1.06
imptcOfWork Prop_missing 0.02
imptcOfWork max 4.00
imptcOfWork mean 1.55
imptcOfWork median 1.00
imptcOfWork min 1.00
imptcOfWork sd 0.81
meaningInLife Prop_missing 0.02
meaningInLife max 4.00
meaningInLife mean 1.84
meaningInLife median 2.00
meaningInLife min 1.00
meaningInLife sd 0.86
nationalIncomeLadder Prop_missing 0.04
nationalIncomeLadder max 10.00
nationalIncomeLadder mean 4.91
nationalIncomeLadder median 5.00
nationalIncomeLadder min 1.00
nationalIncomeLadder sd 2.10

Get country names using left_join

We also have a file with country names to go with our country codes. We’re going to start doing some analysis within-country, so let’s get those names into the data set. This is what some call a one-to-many merge. That is, we have a data frame with one row per country name, and we want to merge each of these rows with every row in the main data set that has a matching country code.

There are 4 join functions (?left_join for more info), of which left_join is our best choice. According to the docs it “return[s] all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.”

str(countryNames)
## 'data.frame':    195 obs. of  2 variables:
##  $ V1: chr  "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
##  $ V2: int  -5 -4 -3 -2 -1 4 8 12 16 20 ...
names(wvs_NAs)
##  [1] "countryCode"          "sex"                  "age"                 
##  [4] "nationalIncomeLadder" "meaningInLife"        "imptcOfFamily"       
##  [7] "imptcOfFriends"       "imptcOfLeisure"       "imptcOfPolitics"     
## [10] "imptcOfWork"          "imptcOfReligion"
#We use the same country code variable name, countryCode, as the main data set
names(countryNames)<-c('countryName','countryCode') 

#left_join will automatically find matching column names
wvs_NAs_cnames <- left_join(wvs_NAs,countryNames)
## Joining by: "countryCode"
head(wvs_NAs_cnames) #success!
##   countryCode sex age nationalIncomeLadder meaningInLife imptcOfFamily
## 1          12   1  21                    5             2             1
## 2          12   2  24                    6             1             1
## 3          12   2  26                    6             1             1
## 4          12   2  28                    5             2             1
## 5          12   2  35                    7             2             1
## 6          12   1  36                    5             2             1
##   imptcOfFriends imptcOfLeisure imptcOfPolitics imptcOfWork
## 1              1              1              NA           1
## 2              2              3               4           2
## 3              3              2               4           2
## 4              1              3               4           3
## 5              1              1               2           1
## 6              2              2               2           4
##   imptcOfReligion countryName
## 1               1     Algeria
## 2               2     Algeria
## 3               1     Algeria
## 4               1     Algeria
## 5               1     Algeria
## 6               1     Algeria

Summaries by country

We’re going to use the summarise_each function again, but this time instead of using select before hand, we can give it the columns we want as the second argument. I’ll use the helper function starts_with to get every column that starts with ‘imptcOf’ (our “Importance in life of…” variables).

wvs_NAs_cnames %>%
group_by(countryName) %>%
summarise_each(
    funs(
        mean=mean(.,na.rm=T)),
    starts_with('imptcOf')) %>%
kable(caption='Importance of ___ Item Means by Country')
Importance of ___ Item Means by Country
countryName imptcOfFamily imptcOfFriends imptcOfLeisure imptcOfPolitics imptcOfWork imptcOfReligion
Algeria 1.100418 1.762826 2.052453 2.719437 1.385800 1.118189
Argentina 1.123541 1.570039 1.870117 2.880905 1.580454 2.451581
Armenia 1.032022 1.764384 1.959032 2.912168 1.373383 1.531392
Australia 1.076923 1.475456 1.670654 2.482562 1.965909 2.824648
Azerbaijan 1.074850 1.791417 2.124751 2.926148 1.496008 2.039920
Bahrain 1.674457 1.894167 2.333333 2.174350 2.169321 1.753756
Belarus 1.130150 1.725361 1.888227 2.817822 1.835635 2.531003
Brazil 1.134007 1.855698 1.813514 2.751864 1.393122 1.595687
Chile 1.097000 1.871587 1.539314 3.064777 1.541538 2.297379
China 1.140907 1.596710 2.052751 2.543602 1.802098 3.381368
Colombia 1.155423 2.011258 1.664901 3.024585 1.258278 1.592323
Cyprus 1.073073 1.449449 1.543543 2.728745 1.464393 1.827655
Ecuador 1.021631 1.968386 1.593672 2.551581 1.169717 1.477537
Egypt 1.039396 1.743926 2.356533 2.166120 1.909389 1.065046
Estonia 1.151436 1.614781 1.796721 2.877457 1.730211 3.021220
Germany 1.251471 1.562072 1.811765 2.587573 1.806468 2.951399
Ghana 1.052835 1.795103 1.607603 2.611469 1.069588 1.087629
Hong Kong 1.172345 1.631000 1.794975 2.527694 1.929860 2.770541
India 1.466413 1.928390 2.213333 2.427936 1.750317 1.812817
Iraq 1.079167 1.665275 2.323258 2.731707 1.453255 1.180000
Japan 1.086523 1.602763 1.658804 2.062224 1.578105 3.115530
Jordan 1.043333 1.602500 1.956667 2.800000 1.491109 1.068447
Kazakhstan 1.080667 1.666667 1.811333 2.592000 1.585333 2.345333
Kuwait 1.075136 1.531153 1.975358 2.198367 1.290679 1.133758
Kyrgyzstan 1.041333 1.784000 2.146000 2.417333 1.446000 1.795333
Lebanon 1.358079 1.651182 1.986418 2.537511 1.556604 1.745704
Libya 1.036671 1.433540 1.836862 2.284668 1.319885 1.037987
Malaysia 1.027692 1.636923 1.758461 2.401078 1.220000 1.185385
Mexico 1.031500 1.879940 1.575288 2.614307 1.192000 1.627000
Morocco 1.101836 1.901338 2.517426 3.349817 1.225021 1.122704
Netherlands 1.174731 1.549497 1.653744 2.687833 1.929470 3.080558
New Zealand 1.064010 1.466340 1.578492 2.545568 1.820707 2.746231
Nigeria 1.014213 1.426379 1.623081 2.523891 1.314383 1.130756
Pakistan 1.058382 1.738731 2.359797 2.802852 1.407282 1.127410
Palestine 1.048048 1.740481 2.079879 2.508629 1.506694 1.161161
Peru 1.157025 2.228906 1.906723 2.743633 1.339211 1.712134
Philippines 1.015000 1.844037 2.468672 2.292540 1.109258 1.159167
Poland 1.082292 1.686397 1.771249 2.834375 1.462274 1.816857
Qatar 1.012264 1.287736 1.777148 2.155303 1.222641 1.013208
Romania 1.081333 2.077948 1.919679 3.089623 1.564926 1.661549
Russia 1.174578 1.887455 2.012225 2.958983 1.863577 2.636633
Rwanda 1.096267 1.313032 2.000655 2.290766 1.423707 1.910282
Singapore 1.077079 1.534483 1.722617 2.454868 1.743278 1.894416
Slovenia 1.118421 1.640449 1.717248 3.270423 1.706439 2.816384
South Africa 1.087635 1.769318 1.983138 2.563851 1.552554 1.623851
South Korea 1.102585 1.546366 1.809045 2.367845 1.525609 2.361881
Spain 1.093434 1.510638 1.632740 3.083051 1.564255 2.936279
Sweden 1.124688 1.323920 1.505034 2.281537 1.604039 3.006700
Taiwan 1.098294 1.632052 1.775678 2.795041 1.513934 2.366393
Thailand 1.131271 1.878028 1.986509 1.981292 1.420655 1.569992
Trinidad and Tobago 1.070211 1.916499 1.646881 2.769849 1.420363 1.314141
Tunisia 1.024147 1.768080 2.011064 2.714286 1.199499 1.064892
Turkey 1.048005 1.437148 1.707363 2.526415 1.720126 1.448750
Ukraine 1.092667 1.728000 1.884000 2.926667 1.815333 2.218000
United States 1.099505 1.503389 1.691369 2.429737 1.948058 2.023035
Uruguay 1.132132 1.695085 1.642351 2.965726 1.495960 2.715005
Uzbekistan 1.028000 1.577719 2.144392 2.616521 1.490262 1.976383
Yemen 1.044266 1.598178 2.341589 2.693499 1.492417 1.060120
Zimbabwe 1.036667 1.824000 1.844000 2.622667 1.150000 1.211333

Histograms of each variable by country

You can apply this same type of logic for running models for each country. We melt the data frame so that we get a row for every variable/value pair indexed by country. This makes it easy for ggplot to give us a a histogram for each variable in the same plot, using facet_wrap. If we didn’t use facet_wrap below, we’d get one histogram with the values of all the variables. Check out the first six, last, and a random sample of six lines of the melted data frame:

m_temp <- melt(wvs_NAs_cnames,id.vars='countryName')
kable(head(m_temp))
countryName variable value
Algeria countryCode 12
Algeria countryCode 12
Algeria countryCode 12
Algeria countryCode 12
Algeria countryCode 12
Algeria countryCode 12
kable(tail(m_temp))
countryName variable value
935765 Zimbabwe imptcOfReligion 1
935766 Zimbabwe imptcOfReligion 1
935767 Zimbabwe imptcOfReligion 1
935768 Zimbabwe imptcOfReligion 1
935769 Zimbabwe imptcOfReligion 1
935770 Zimbabwe imptcOfReligion 1
kable(sample_n(m_temp,6),row.names=F)
countryName variable value
India imptcOfPolitics 3
Egypt age 60
Yemen nationalIncomeLadder 5
Pakistan imptcOfFamily 2
Japan imptcOfPolitics 2
Yemen imptcOfFamily 1
library(ggplot2)
#plotprintlist is a variable that is basically trash, but we could use it to print the plots again if we wanted
plotprintlist <- wvs_NAs_cnames %>%
melt(id.vars='countryName') %>%
group_by(countryName) %>%
do(
     plot=print(
        ggplot(.,aes(x=value))+ # set up the plot with the data frame '.' passed to `do`
            geom_histogram()+ # We want a histogram of the values from column 'value'
            facet_wrap(~variable,scales='free')+ # We want a facet (i.e., subplot) for every unique value of the column 'variable'
            ggtitle(unique(.$countryName)) # give it a title with the country name from the countrName column of '.'
            )
)

All the analysis

We can use the same technique to get a coefficients from a linear model for every country

wvs_NAs_cnames_modulus <- wvs_NAs_cnames %>%
mutate(age_ending = age %% 10, age_ending_9 = age_ending %in% 9)

library(biglm)
## Warning: package 'biglm' was built under R version 3.1.2
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.1.2
models <- wvs_NAs_cnames_modulus %>%
group_by(countryName) %>%
do(
     mod=biglm(
            meaningInLife ~ 1 + 
            sex +
            age +
            nationalIncomeLadder +
            imptcOfFamily +
            imptcOfFriends +
            imptcOfLeisure +
            imptcOfPolitics +
            imptcOfWork +
            imptcOfReligion +
            age_ending_9,
            data=.)
     )

modelCoefs <- models %>%
do(
    data.frame(
        country=.$countryName,
        n=.$mod$n,
        ci=summary(.$mod)$mat[,'Coef']-summary(.$mod)$mat[,'(95%'],
        var=names(coef(.$mod)),
        coef=coef(.$mod))
    )
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character
kable(modelCoefs[1:20,])
country n ci var coef
Algeria 1039 0.3973372 (Intercept) 2.2335079
Algeria 1039 0.1267847 sex -0.0623652
Algeria 1039 0.0043268 age -0.0039015
Algeria 1039 0.0307039 nationalIncomeLadder -0.0802476
Algeria 1039 0.1543852 imptcOfFamily -0.0267650
Algeria 1039 0.0729042 imptcOfFriends -0.1187290
Algeria 1039 0.0687132 imptcOfLeisure 0.1551985
Algeria 1039 0.0594759 imptcOfPolitics 0.0364489
Algeria 1039 0.0846615 imptcOfWork 0.1338679
Algeria 1039 0.1517525 imptcOfReligion 0.0241294
Algeria 1039 0.2130212 age_ending_9TRUE -0.0061899
Argentina 943 0.4998702 (Intercept) 1.3207945
Argentina 943 0.1302140 sex -0.0232658
Argentina 943 0.0037675 age 0.0002063
Argentina 943 0.0427286 nationalIncomeLadder 0.0281963
Argentina 943 0.1869523 imptcOfFamily -0.0332910
Argentina 943 0.0991168 imptcOfFriends 0.0729738
Argentina 943 0.0870559 imptcOfLeisure -0.0057964
Argentina 943 0.0689547 imptcOfPolitics 0.1117997
Argentina 943 0.0829681 imptcOfWork 0.1206761
modelCoefs %>%
ggplot(aes(x=reorder(country,n),y=coef))+
geom_point(aes(size=n),alpha=.3,color='blue')+
geom_point(size=1,alpha=1,color='black')+
geom_errorbar(aes(ymin=coef-ci,ymax=coef+ci))+
facet_wrap(~var,scales='free_y',ncol=1)+
geom_hline(xintercept = 0,color='red',size=.5)+
theme(
    axis.text=element_text(size=10),
    axis.text.x=element_text(angle=315,hjust=0,vjust=1),
    panel.background=element_rect(fill='white'),
    strip.text=element_text(size=10))+
labs(x='Country, ordered by sample size',y='Coefficient Estimate',title='How often do you contemplate the meaning of life?')



Comments are closed.