Introduction
Loading data
Introduction to tidying data
Selecting columns
Renaming columns
Filtering rows
Summarising data
Combining files
Exercise

Introduction

Preprocessing is all the things you have to do to your data before you can analyse it. In the Absolute Beginners’ Guide to R, the preprocessing was mostly done for you, and you just used read_csv to load in the preprocessed data. However, in most realistic situations, data does not come preprocessed. In this worksheet, we’ll look at preprocessing data from computerized experiments. Preprocessing data from these kinds of experiments typically comes in five parts - loading, tidying, filtering, summarizing, and combining. We’ll cover these in turn below.

In this first part of the worksheet, the commands you need to do each step are given to you. It’s important to take time to read all the instructions, try out the commands in RStudio for yourself, and read the descriptions and explanations of how they work. This is because, in the next section, you’ll be preprocessing another data set. You won’t be given any of the commands. Instead, you’ll adapt what you’ve learned to that new data set.

Loading data

Getting the data into your project

In order to get the data for this exercise, we’re going to load it from a git repository. The best-known git repository is github, and that’s the one we’ll use here. Git repositories are a common way of sharing code and data via the internet, and RStudio can easily make use of them. Here’s how:

Create a new RStudio project as before, but click on “Version Control” rather than “New Directory”.
Click on “Git”.
Enter the location of the repository into the first line. It’s https://github.com/ajwills72/rminr-data
Click “Create Project”

That’s it! You should now see that your Files window in RStudio has a number of files showing, including a folder called rawdata. If so, you have successfully downloaded data from a git repository on github and put it in an RStudio project.

Click on the name ‘rawdata’ in the Files window of RStudio. You’ll see there are three CSV files, and one file called README.md. We’ll use the CSV files in a minute.

Each CSV file (e.g. subject-11.csv) contains the data from one participant in a facial prototypes experiment. The experiment was run using OpenSesame. OpenSesame, like R, is a free and open-source program. Install OpenSesame on your machine and run the experiment on yourself. To do this, you will need to download the OpenSesame experiment, which is in the expt-scripts folder of your R project. Click on expt-scripts, tick the box next to facialproto_short.osexp, click ‘More…’ and click ‘Export’. Now open the experiment on your machine using OpenSesame.

Create a new script

Within the rminr-data project, create a new script file and call it preproc.R. Put all your commands for this worksheet into this script, and save regularly. This will make it easier to see what you have done, especially when you come back to it after a break.

Summary of the experiment

In this experiment, people are shown some pictures of male faces. Each is a picture of a real face, but its internal features (eyes, nose, mouth, etc.) have been digitally stretched or compressed either along the x-axis or the y-axis. Each of these manipulations of real faces is shown exactly once. Participants rate each picture for masculinity (1-8 scale, higher numbers = more masculine), just as a way of encouraging them to look at each picture closely.

After participants have been shown 32 pictures, they move to the test phase. They’re shown another 24 pictures, and have to rate their confidence they’ve seen that exact picture before (0 = definitely not seen, 9 = definitely seen). The pictures they’re shown include the exact distortions they’ve seen before (seen), the undistorted version of the faces (prototype), which they have not been shown before, and some other distortions of the same faces that they haven’t seen before (unseen).

The expected result is that people are confident they’ve seen the prototype pictures before, even though they haven’t. One interpretation of this result is that we average across the pictures we see. The prototype is the average of the pictures we’ve seen of that face, so it seems familiar even though we’ve not seen it before.

In order to work out whether we get the expected result, we need to know the mean confidence rating each participant gave for each face type (seen, unseen, prototype).

Loading one participant

We’ll start by loading the data for one of the participants, subject-11.csv. Recall that we can do this using the read_csv command. In this case, the CSV file we want to load is inside the rawdata folder of our project so we have to say rawdata/subject-11.csv rather than just subject-11.csv, so RStudio knows where to look for the file.

Add the following comments and commands to preproc.R, and run the script:

# Preprocessing worksheet
# Load tidyverse
library(tidyverse)
# Load data
dat <- read_csv("rawdata/subject-11.csv")

Click on dat in the Environment tab and take a look at the file you’ve loaded. This is a typical output file for OpenSesame, but it seems quite overwhelming at first sight. It has 101 columns of data, many with unclear names. There are 56 rows of data, so 5656 pieces of data in total, and this is just for one participant in one short experiment!

Our first job is to tidy up this dataset so it’s clear and readable to the average human being.

Tidying data

The first thing you need to know to make sense of this dataset is that each row is one trial of the experiment. In a typical experiment, a trial begins with the presentation of a stimulus (in this case, a face) and ends with the participant making a response (in this case, a rating). Participants rate 32 pictures for masculinity, and then 24 pictures for confidence they’ve seen them before, leading to a total of 56 trials in this experiment, and hence 56 rows in this data frame.

This is what’s sometimes called tidy data, which means there is one row for each observation (each trial, in this case). It is also called long format data, because it has more rows and fewer columns than the main alternative, which is to put all data from a single participant on the same row. This is called wide format, and in this case would result in a dataset with 1 row and 5656 columns.

Most commands in R assume your data is in long format, so it’s good news our raw data is also in that format, even if it still needs a bunch of tidying up before we can analyse it. The first step is to go through the 101 columns and find those we actually need.

Participant number

It’s important that we are able to anonymously identify the participant who generated the data. So, we’re going to need their participant number, which is in the subject_nr column.

Columns to keep: subject_nr

Although this data file contains the participant number, this may not always be the case for other data sets. If you ever need to add participant numbers yourself, take a look at the more on preprocessing worksheet.

Responses

One thing we’ll definitely need to know is how the participant responded on each trial of the experiment. If you scroll through the columns of dat, which are organised alphabetically, you’ll find a column called response. You’ll see that each row contains a number from 0 to 9. So, this is the rating the participant made on that trial. We’ll need to know that to analyse this data, so response is one of the columns we need to keep.

Columns to keep: subject_nr, response

Phases

We also need to know what kind of response the participant was making – a masculinity rating (first part of the experiment), or a confidence rating (second part). When an experiment has two or more different parts, we call those parts phases. And you’ll find there is a column called phase, which has the entries exposure (which is the first part of the experiment) and test (second part). So phase is another column we’ll need.

Columns to keep: subject_nr, phase, response

Trial types

In the second phase of the experiment, there are three different types of trial - unseen faces, seen faces, and prototypes. We’ll need to know what type of stimulus was presented in order to analyse these data. Scroll to the column called type, and you’ll see that for the second phase (rows 33 onwards), it contains this information. For the first phase it contains NA, meaning this is not relevant information for the first phase. So type is another column we’ll need to keep.

Columns to keep: subject_nr, phase, type, response

Trial number

There are many trials in each phase of this experiment, so it makes sense to include a column that says which trial within the phase each row refers to. Take a look at the data — click on dat in the Environment panel if you have not already done so, and this will open a spreadsheet-like view of the data in the top left window of Rstudio. If you scroll through the columns, you’ll find one called live_row. You’ll notice it counts up from zero to 31, and then from 0 to 23. So, this column contains trial numbers, first for the masculinity-rating part of the experiment, and then for the confidence-rating part of the experiment. Counting from zero might seem a bit weird, but it’s quite common in data collected by a computer.

Columns to keep: subject_nr, phase, live_row, type, response

Column ordering

You may have noticed that the list of ‘columns to keep’ is not in alphabetical order. Instead, it follows the conventional ordering of participant (subject_nr), position in experiment (phase, live_row), stimulus (type), response. Most experiments report their data in this order. Having conventions like these make it easier for others to read and understand our analyses.

Selecting columns

Having worked out which columns we need, we now tidy things up by selecting just the columns we need and putting them into a new data frame. This is done using R’s select command. Add the following to your script and run it

# Select columns; place into 'dat_subset'
dat_subset <- dat %>% select(subject_nr, phase, live_row, type, response)

Explanation of command - Our original data frame dat is piped (sent to, %>%) the select command, which picks just the columns we name. These selected columns are put (<-) into a new dataframe called dat_subset.

Click on dat_subset in the Environment window. You’ll see we have a much easier-to-read data frame, still with 56 rows, but now only 5 columns.

Renaming columns

When working with data, it’s useful to have meaningful column names, because it makes it easier to remember what they contain. The column live_row could be more clearly named as trial, as that’s the information it contains. Also, subject_nr is longer than it needs to be, subj is clear, and quicker to type. We rename columns in R using the set_names command. Add the following to your script and run it:

# Rename columns; place into 'tidydat'
tidydat <- dat_subset %>% 
  set_names(c("subj", "phase", "trial", "type", "response"))

Explanation of command - The command c() means concatenate i.e. put together. So c("subj", "phase", "trial", "type", "response") is a way of putting the five names together so they can be sent somewhere. We use the set_names function to change the column names of the dataframe to be this list. The result is sent (<-) and saved in a new variable, tidydat.

Filtering rows

We now have a clear, tidy dataframe (tidydat), which we can start to analyse. The expected result is that the participant will show high confidence ratings for having seen the prototype, despite not having seen it before. Does this participant show this pattern of results?

Our predictions are about the test phase, but this data frame also contains data about the exposure phase (the masculinity ratings). So, the first thing we have to do is filter the data so that it only contains the test phase. As covered in Absolute Beginners’ Guide to R, we use the filter command to do this, telling it which parts of the data we want to keep. Add the following to your script and run it:

# Filter test phase data into 'testdat'
testdat <- tidydat %>% filter(phase == "test")

Explanation of command - The tidydat data is passed (%>%) to the filter command, which keeps only those rows (trials) where phase == "test", i.e. where the column phase contains the word test. This filtered data is then written (<-) to a new data frame called testdat.

Click on testdat in the Environment window. You’ll see that now we just have 24 rows, all from the test phase.

Summarising data

We can now use the group_by and summarise commands covered in Absolute Beginners’ Guide to R to answer our question. Add the following to your script and run it:

# Group data by 'type', display mean of 'response'
testdat %>% group_by(type) %>% summarise(mean(response))

# A tibble: 3 × 2
  type      `mean(response)`
  <chr>                <dbl>
1 prototype             4.75
2 seen                  3.88
3 unseen                5

Explanation of command - Our test-phase data (testdat) is piped to group_by(type), which groups it into the three parts given by the type column (seen, unseen, prototype). This grouped data is then sent to summarise to work out a summary value for each group. We tell summarise what summary we want, in this case we want a mean. And we tell the mean command that the data we want a mean of is in the response column.

As before, you can safely ignore the “ungrouping” message that you receive. Looking at the output, we can see that this participant was more confident they’d seen the prototypes (4.75) than that they’d seen the pictures they’d actually seen before (3.88). This fits our hypothesis. But, somewhat oddly, they were even more confident about other distortions they hadn’t seen (5). An alternative hypothesis (and the correct one in this case) is that this particular participant just randomly pressed the buttons, so any difference in the three scores is down to chance.

Combining files

Experiments generally involve testing relatively large numbers of people, in order to achieve good statistical power. This means that another important part of preprocessing is to combine the data from different participants into a single data frame, so we can analyse the data from all the participants at the same time.

Combining two files

We’ll start by loading in data from a second participant into a different dataframe. Add the following to your script and run it:

# Load data from second participant
dat2 <- read_csv("rawdata/subject-12.csv")

To combine these two participants into a single dataframe, we use the bind_rows command. Add the following to your script and run it:

# Combine the two data sets; place into 'alldat'
alldat <- bind_rows(dat, dat2)

Explanation of command - The bind_rows command takes the dat2 dataframe and adds it to the end of the dat dataframe. This combined data set is then written to (<-) a new dataframe called alldat.

Look at alldat in Environment window; you’ll see it now has 112 rows, the 56 rows from subject-11 and then the 56-rows from subject-12.

Combining many files

Generally, we have a lot more than two people in an experiment. You could use the same technique to combine the data of tens, hundreds, or thousands of participants. However, that would take a really long time and be quite tedious. Fortunately, R allows us to speed up this process so that, even if we have tens of thousands of participants, we can quickly combine their data without error.

List of files

The first thing we need to do to combine many participants is to get a list of the names of their data files (e.g. subject-11.csv). We do this using the list.files command. Add the following to your script and run it:

# Display list of filenames
tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE))

# A tibble: 3 × 1
  filename              
  <chr>                 
1 rawdata/subject-11.csv
2 rawdata/subject-12.csv
3 rawdata/subject-13.csv

Explanation of command: We say tibble(filename= to tell R to make a new dataframe, with a column containing filenames. The first part of list.files, "rawdata", tells R which folder the data files are in. Generally, it’s a good idea to keep all your data files inside a single folder, so it is easier for commands like this to find them. The second part of list.files, "*.csv" tells R to only give the names of files that end in .csv (the "*" means “any character or number”). This is useful, because sometimes raw data folders contain other files, too. For example, rawdata contains README.md, a file that generally provides information relating to the data, rather than being data itself. The third part, full.names=TRUE, tells R to give the name of the file including the name of the folder it is in, so rawdata/subject-11.csv rather than just subject-11.csv.

Load every file

We know how to make a new dataframe containing a list of filenames to process. We can use this to load each file in turn and combine them into a single large dataset.

To do this, we make use another tidyverse function called do.

The do function performs an operation for each group in a dataframe. All of the results are joined into a large, combined datafile. All we need to do is:

Make a dataframe containing the list of file names to be used
Remind R to read the files one at a time (explanation below)
Tell it which function to use to read the raw files

The entire command looks like this; which you should enter into your script and run:

# Combine data from all participants into 'alldat'
alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE)) %>% 
  group_by(filename) %>% 
  do(read_csv(.$filename))

Explanation of command: At the start of line 1 we write alldat <-. This means the results will be saved with the name alldat. Then we reuse the code from above to create a new dataframe with one column: filename. We then use group_by to tell R to process each filename individually. If we didn’t do this then R would ask read_csv to open all the files at once, and read_csv would be confused!

When we write do(read_csv(.$filename)) we are telling do to apply the read_csv function to each filename. The ".$filename" part is shorthand: The period "." means “the group in the data I am working with now”. The "$filename" part means use the values from the filename column. So read_csv(.$filename) means, read the csv file using the value in the filename column.

If you look at alldat in the Environment window, you’ll see that it has 168 rows. Using do has combined all three participants, each with 56 trials, into one big data frame. Although with three participants we could have done this as quickly in other ways, generally we have much more data than this - typically somewhere between 30 and 300 participants in a single experiment. Using do saves a lot of time in these situations.

Back to tidying and summarizing

Of course, this combined data file has the same problems as the individual files that make it up – there are over 100 columns, most of which we don’t need. We can fix this with the same commands as before.

Add this comment to your script:

# Select and rename columns; place into 'tidydat'

Next add some commands to select the relevant columns, put them into tidydat and then rename the columns.

Run those commands. Now, click on tidydat, you’ll see a human-readable data frame with 168 rows, and 5 columns. This is the full dataset for the experiment, so we can now start analyzing it.

Back to summarising

Now we have this combined file, we summarise the test phase data for each participant using the same commands as before. We just need to include subj in the group_by command, so group_by(subj, type), so we get a separate summary for each person.

Add the following comment to your script:

# Filter test phase data into 'test' 

# Group 'test' by 'type' and 'subj', calculate mean of 'response', place into 'test.sum'

Now, fill in the blanks, i.e. insert and run some commands to filter the data to the test phase, and put that data into a data frame called test. Now use test to group by subject and trial type, and summarize using the mean response, and put the resulting summary into test.sum. If you’ve got this right, you’ll end up with this summary. You can view it either by typing test.sum, or by clicking on test.sum in the Environment window:

# A tibble: 9 × 3
# Groups:   filename [3]
  filename               type      `mean(response)`
  <chr>                  <chr>                <dbl>
1 rawdata/subject-11.csv prototype             4.75
2 rawdata/subject-11.csv seen                  3.88
3 rawdata/subject-11.csv unseen                5   
4 rawdata/subject-12.csv prototype             5.5 
5 rawdata/subject-12.csv seen                  4.75
6 rawdata/subject-12.csv unseen                5.12
7 rawdata/subject-13.csv prototype             4.88
8 rawdata/subject-13.csv seen                  4.38
9 rawdata/subject-13.csv unseen                4.38

Looking at the above summary, we can see that all three participants were more confident they’d seen the prototype than the pictures they’d actually seen, which supports our hypothesis. But they’re also all at least as confident about other unseen distortions, too. So, again, an alternative hypothesis is that participants were just pressing keys randomly. This was in fact the case for these data. In the next part of the worksheet, you’ll look at some data from a different experiment where the participants took the task more seriously.

Exercise

In this section of the worksheet, you’ll preprocess the data from a different experiment. You’ll find the data in the folder called lexdec, and the OpenSesame implementation of this experiment in the expt-scripts folder.

In this experiment, words are shown on the screen one at a time. Some are real words, others are made-up words (non-words). The participant’s job is to decide whether each is a word or non-word, as quickly as possible. The computer records for each decision whether they got it correct, and how long it took them to respond (their response time). The experiment begins with a practice phase, so participants can get used to the task. This is followed by the test phase, which is the part we analyse.

Your task is to write R code that gives the mean reaction time for each participant, for both words and non-words. You should analyse only reaction times for trials in which they made the correct response, and you should not include the practice phase in your analysis.

This can, and should, be done with less than 15 files of R code (plus comments, see below). When your code is working correctly, it will give you the following numbers:

# A tibble: 6 × 3
# Groups:   subj [3]
   subj type    `mean(rt)`
  <dbl> <chr>        <dbl>
1    11 nonword      1158.
2    11 word          888.
3    12 nonword      1339 
4    12 word         1818 
5    13 nonword      1458.
6    13 word         1526

Other requirements:

Use short, meaningful names for data frames and column names.
Use comments in your code to make it more human readable. Comments are any line that begins with # and are ignored by R. They are there to make your code easier for humans to understand. For example:

# Create a dataframe containing all the filenames we want to read
raw.data.files <- tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE))

Hints: Your code should first get a list of files, then do to load all those files in and combine them. Next you’ll have to find the relevant columns among the 100+ columns of the data file. Once you’ve found them, use select to select just those columns and use set_names to give the selected columns better names. Use filter to remove the practice phase and keep only the correct responses. Note that accuracy in this data file is a number, not a word, so the correct phrase is something like acc == 1, rather than acc == "1". Finally, use group_by and summarise to report the mean response time for words and non words for each participant. Good luck!

Once you’re getting the right answers, paste your R code into PsycEL.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.

Preprocessing data from experiments

Andy Wills, Clare Walsh, Chris Longmore and Ben Whalley

Contents