While a lot of us have grown comfortable using Excel to clean and manipulate our data, there is a growing trend toward transparency and reproduciblity that make it difficult to keep going down that road. It’s also just too easy to make a mistake with Excel. For example, someone I was helping out with an email campaign recently sent me an Excel file that was sorted only by one column. I just trusted them, rather than properly checking the file, and sent the email out and over 400 people got an email with the wrong name in the greeting. That was awkward and frustrating. For me, I love being able to look at my code, see where I brought in the raw data, and see all of the manipulations I did to clean it up. Excel doesn’t do that for me. Thanks to Hadley Wickham’s “dplyr” package it is surprisingly easy to manipulate data in R. I recommend printing out RStudio’s “Data Wrangling Cheat Sheet” and hanging it up somewhere visible if you do regularly manipulate data in R. Here is an example of some data manipulation that I recently did in R.
Step 1. Setup
I’ve set my working directory so that R knows what folder to retrieve the raw data files from. Alternatively, you can give R the whole file name including folders when you read in the CSV file and not bother setting a working directory.
I’ve loaded the “dplyr” package, which I installed earlier using the command install.packages(“dplyr”). One problem with dplyr is that it uses some function names that mean something different in base R or in other packages. I’ve run into a lot of errors and found that the best workaround is to simply tell R that when I say “select”, what I mean is use select from the dplyr package.
filter <- dplyr::filter
select <- dplyr::select
Step 2. Bring in Qualtrics data
Here are a couple of rules of thumb that I use:
- Only create one object per data file. It is really confusing to come back 6 months later and see that you have 15 objects that are different versions of the same dataset. I like to see only one version. Dplyr makes it easy to have only one object.
- I almost never refer to rows or columns by number. Column numbers and row numbers change every time you tweak the dataset. For best transparency, use some name or rule-based method to remove data or tweak it.
I’m going to read in the data as a CSV file. I recommend against trying to read it in as an Excel file. There are several packages that supposedly read Excel, but they don’t seem to have consistent performance, and there is no guarantee that your code will work later if you do read it in as Excel.
Notice that I used the “stringsAsFactors = FALSE” argument. By default, R will try to turn everything into factors, which is generally not what I want at all. I used the pipe operator “%>%” to let R know that I’m not done. Now I’m going to make some changes to this data. The pipe operator comes included with dplyr.
In our lab, we have all of our grad students run through the surveys before the real participants do to test it. I want to get rid of the grad student responses, so I filter out all observations that don’t have a student ID that starts with “95”. The grad students are supposed to put “Test” in this field, though John for some reason puts “666”. Filtering out all student ID’s that don’t start with “95” takes care of all of these test observations. Again, it ends with a pipe so that R knows there is more. Because I’m using pipes, I don’t even have to tell it what data I want to execute this command on. It already knows to look at the previous pipe row.
An English translation of this would be “In the row above, filter out all results where ‘ID’ doesn’t start with ’95’.” Grepl matches the standard expression I want, and filter removes those rows. In Qualtrics there is a second header row with really long, unwieldy descriptions. This will remove that row too. If all you wanted to do was remove that row of labels, you could simply remove it by position when you bring it in. Normally I don’t like to refer to rows by number, but I don’t think it does any harm to only remove the first row:
I try to make good, proper names for my variables in Qualtrics, but they always seem to get messed up. I inevitably end up renaming some of them:
adskep_3 = adskep_11,
adskep_4 = adskep_12,
adskep_5 = adskep_13,
adskep_6 = Q28_14,
adskep_7 = Q28_15,
adskep_8 = Q28_16,
adskep_9 = Q28_17,
Out_not_authentic = Symbol_5) %>%
Note that the name to the left of the equal sign is the new name. The name to the right is the messed up name that Qualtrics gave me.
Now, I’m telling R only to keep variables that have the stems I want:
In plain English, this would say “keep only the columns that have ‘cont’ or ‘symbol’ or ‘cred’ or ‘integ’ or ‘out’ or ‘adskep’ or ‘id’ or ‘qpq’ as part of their name.”
All of my variables were read in as character strings, so I will need to transform relevant columns to numeric format:
Using the “mutate_each” command from the dplyr package, I’ve transformed every column except for “ID” to numeric format.
I need a composite variable that is the mean of all variables from the four dimensions of my scale. You can use “mutate” to create a new variable.
In Qualtrics, I ran two conditions. I need a factor variable that tells me which condition the person was in. Right now I have two variables representing the two conditions. Each is a string of 1’s and NA’s. I only need one of these variables to make my new variable since condition “qpq” and “bonus” are mutually exclusive.
I created a new variable called “condition”, which is a factored version of “qpq”. When you create a factor in R, you can use the “exclude=NULL” argument to tell it that you want “NA” to be a factor level, rather than just representing missing data. Next, I used “select” to drop the “qpq” variable that has now become obsolete. Since I didn’t include a pipe operator at the end of my last command, all the code will now run and return my cleaned up data.
Step 3. Bring in a second dataset and merge it with the first
In our lab, we have students answer all of the demographic questions separately. We end up having to merge the data. This is ridiculously easy to do in R:
alldata <- left_join(qualtrics,demos, by="ID")
I’ve brought in the second dataset. People often end up filling out the demos multiple times for whatever reason. I don’t want duplicates because I will end up with duplicate data after I do the join. I have used the “distinct” function to get rid of redundant student ID’s. Then I used “left_join” to keep everything in my dataset on the left “qualtrics,” and to tack on data from my other dataset, “demos” wherever there is a match by the “ID” column, which both datasets have in this case. Again, it’s pretty easy to join two datasets.
The output of this process is three objects:
There is no data1, data2, data3, etc. Very clean, very transparent, and very easy to look back and see exactly what I did.
Here is all of the code in one place: