Preprocessing is all the things you have to do to your data before
you can analyse it. In the Absolute Beginners’ Guide to R, the
preprocessing was mostly done for you, and you just used
read_csv
to load in the preprocessed data. However, in most
realistic situations, data does not come preprocessed. In this
worksheet, we’ll look at preprocessing data from computerized
experiments. Preprocessing data from these kinds of experiments
typically comes in five parts - loading, tidying, filtering,
summarizing, and combining. We’ll cover these in turn below.
In this first part of the worksheet, the commands you need to do each step are given to you. It’s important to take time to read all the instructions, try out the commands in RStudio for yourself, and read the descriptions and explanations of how they work. This is because, in the next section, you’ll be preprocessing another data set. You won’t be given any of the commands. Instead, you’ll adapt what you’ve learned to that new data set.
In order to get the data for this exercise, we’re going to load it from a git repository. The best-known git repository is github, and that’s the one we’ll use here. Git repositories are a common way of sharing code and data via the internet, and RStudio can easily make use of them. Here’s how:
Create a new RStudio project as before, but click on “Version Control” rather than “New Directory”.
Click on “Git”.
Enter the location of the repository into the first line. It’s
https://github.com/ajwills72/rminr-data
Click “Create Project”
That’s it! You should now see that your Files window in RStudio has a number of files showing, including a folder called rawdata. If so, you have successfully downloaded data from a git repository on github and put it in an RStudio project.
Click on the name ‘rawdata’ in the Files window of RStudio. You’ll
see there are three CSV files, and one file called
README.md
. We’ll use the CSV files in a minute.
Each CSV file (e.g. subject-11.csv
) contains the data
from one participant in a facial
prototypes experiment. The experiment was run using OpenSesame. OpenSesame, like R, is a
free and open-source program. Install OpenSesame on your machine and run
the experiment on yourself. To do this, you will need to download the
OpenSesame experiment, which is in the expt-scripts
folder
of your R project. Click on expt-scripts
, tick the box next
to facialproto_short.osexp
, click ‘More…’ and click
‘Export’. Now open the experiment on your machine using OpenSesame.
Within the rminr-data
project, create a new script file
and call it preproc.R. Put all your commands for this
worksheet into this script, and save regularly. This will make it easier
to see what you have done, especially when you come back to it after a
break.
In this experiment, people are shown some pictures of male faces. Each is a picture of a real face, but its internal features (eyes, nose, mouth, etc.) have been digitally stretched or compressed either along the x-axis or the y-axis. Each of these manipulations of real faces is shown exactly once. Participants rate each picture for masculinity (1-8 scale, higher numbers = more masculine), just as a way of encouraging them to look at each picture closely.
After participants have been shown 32 pictures, they move to the test
phase. They’re shown another 24 pictures, and have to rate their
confidence they’ve seen that exact picture before (0 =
definitely not seen, 9 = definitely seen). The pictures they’re shown
include the exact distortions they’ve seen before (seen
),
the undistorted version of the faces (prototype
), which
they have not been shown before, and some other distortions of the same
faces that they haven’t seen before (unseen
).
The expected result is that people are confident they’ve seen the
prototype
pictures before, even though they haven’t. One
interpretation of this result is that we average across the pictures we
see. The prototype is the average of the pictures we’ve seen of that
face, so it seems familiar even though we’ve not seen it before.
In order to work out whether we get the expected result, we need to
know the mean confidence rating each participant gave for each face type
(seen, unseen, prototype
).
We’ll start by loading the data for one of the participants,
subject-11.csv
. Recall that we can do this using the
read_csv
command. In this case, the CSV file we want to
load is inside the rawdata
folder of our project so we have
to say rawdata/subject-11.csv
rather than just
subject-11.csv
, so RStudio knows where to look for the
file.
Add the following comments and commands to preproc.R, and run the script:
# Preprocessing worksheet
# Load tidyverse
library(tidyverse)
# Load data
dat <- read_csv("rawdata/subject-11.csv")
Click on dat
in the Environment tab and take a look at
the file you’ve loaded. This is a typical output file for OpenSesame,
but it seems quite overwhelming at first sight. It has 101 columns of
data, many with unclear names. There are 56 rows of data, so 5656 pieces
of data in total, and this is just for one participant in one short
experiment!
Our first job is to tidy up this dataset so it’s clear and readable to the average human being.
The first thing you need to know to make sense of this dataset is that each row is one trial of the experiment. In a typical experiment, a trial begins with the presentation of a stimulus (in this case, a face) and ends with the participant making a response (in this case, a rating). Participants rate 32 pictures for masculinity, and then 24 pictures for confidence they’ve seen them before, leading to a total of 56 trials in this experiment, and hence 56 rows in this data frame.
This is what’s sometimes called tidy data, which means there is one row for each observation (each trial, in this case). It is also called long format data, because it has more rows and fewer columns than the main alternative, which is to put all data from a single participant on the same row. This is called wide format, and in this case would result in a dataset with 1 row and 5656 columns.
Most commands in R assume your data is in long format, so it’s good news our raw data is also in that format, even if it still needs a bunch of tidying up before we can analyse it. The first step is to go through the 101 columns and find those we actually need.
It’s important that we are able to anonymously identify the
participant who generated the data. So, we’re going to need their
participant number, which is in the subject_nr
column.
Columns to keep: subject_nr
Although this data file contains the participant number, this may not always be the case for other data sets. If you ever need to add participant numbers yourself, take a look at the more on preprocessing worksheet.
One thing we’ll definitely need to know is how the participant
responded on each trial of the experiment. If you scroll through the
columns of dat
, which are organised alphabetically, you’ll
find a column called response
. You’ll see that each row
contains a number from 0 to 9. So, this is the rating the participant
made on that trial. We’ll need to know that to analyse this data, so
response
is one of the columns we need to keep.
Columns to keep: subject_nr
, response
We also need to know what kind of response the participant was making
– a masculinity rating (first part of the experiment), or a confidence
rating (second part). When an experiment has two or more different
parts, we call those parts phases. And you’ll find there is a
column called phase
, which has the entries
exposure
(which is the first part of the experiment) and
test
(second part). So phase
is another column
we’ll need.
Columns to keep: subject_nr
, phase
,
response
In the second phase of the experiment, there are three different
types of trial - unseen faces, seen faces, and prototypes. We’ll need to
know what type of stimulus was presented in order to analyse these data.
Scroll to the column called type
, and you’ll see that for
the second phase (rows 33 onwards), it contains this information. For
the first phase it contains NA
, meaning this is not
relevant information for the first phase. So type
is
another column we’ll need to keep.
Columns to keep: subject_nr
, phase
,
type
, response
There are many trials in each phase of this experiment, so it makes
sense to include a column that says which trial within the phase each
row refers to. Take a look at the data — click on dat
in
the Environment panel if you have not already done so, and this will
open a spreadsheet-like view of the data in the top left window of
Rstudio. If you scroll through the columns, you’ll find one called
live_row
. You’ll notice it counts up from zero to 31, and
then from 0 to 23. So, this column contains trial numbers, first for the
masculinity-rating part of the experiment, and then for the
confidence-rating part of the experiment. Counting from zero might seem
a bit weird, but it’s quite common in data collected by a computer.
Columns to keep: subject_nr
, phase
,
live_row
, type
, response
You may have noticed that the list of ‘columns to keep’ is not in
alphabetical order. Instead, it follows the conventional ordering of
participant (subject_nr
), position in
experiment (phase, live_row
), stimulus
(type
), response. Most experiments report their
data in this order. Having conventions like these make it easier for
others to read and understand our analyses.
Having worked out which columns we need, we now tidy things up by
selecting just the columns we need and putting them into a new data
frame. This is done using R’s select
command. Add the
following to your script and run it
# Select columns; place into 'dat_subset'
dat_subset <- dat %>% select(subject_nr, phase, live_row, type, response)
Explanation of command - Our original data frame
dat
is piped (sent to, %>%
) the
select
command, which picks just the columns we name. These
selected columns are put (<-
) into a new dataframe
called dat_subset
.
Click on dat_subset
in the Environment window.
You’ll see we have a much easier-to-read data frame, still with 56 rows,
but now only 5 columns.
When working with data, it’s useful to have meaningful column names,
because it makes it easier to remember what they contain. The column
live_row
could be more clearly named as trial
,
as that’s the information it contains. Also, subject_nr
is
longer than it needs to be, subj
is clear, and quicker to
type. We rename columns in R using the set_names
command.
Add the following to your script and run it:
# Rename columns; place into 'tidydat'
tidydat <- dat_subset %>%
set_names(c("subj", "phase", "trial", "type", "response"))
Explanation of command - The command
c()
means concatenate i.e. put together. So
c("subj", "phase", "trial", "type", "response")
is a way of
putting the five names together so they can be sent somewhere. We use
the set_names
function to change the column names of the
dataframe to be this list. The result is sent (<-
) and
saved in a new variable, tidydat
.
We now have a clear, tidy dataframe (tidydat
), which we
can start to analyse. The expected result is that the participant will
show high confidence ratings for having seen the prototype, despite not
having seen it before. Does this participant show this pattern of
results?
Our predictions are about the test phase, but this data frame also
contains data about the exposure phase (the masculinity ratings). So,
the first thing we have to do is filter the data so that it
only contains the test phase. As covered in Absolute Beginners’ Guide to R,
we use the filter
command to do this, telling it which
parts of the data we want to keep. Add the following to your script and
run it:
# Filter test phase data into 'testdat'
testdat <- tidydat %>% filter(phase == "test")
Explanation of command - The tidydat
data is passed (%>%
) to the filter
command,
which keeps only those rows (trials) where phase == "test"
,
i.e. where the column phase
contains the word
test
. This filtered data is then written
(<-
) to a new data frame called
testdat
.
Click on testdat
in the Environment window. You’ll see
that now we just have 24 rows, all from the test phase.
We can now use the group_by
and summarise
commands covered in Absolute
Beginners’ Guide to R to answer our question. Add the following to
your script and run it:
# Group data by 'type', display mean of 'response'
testdat %>% group_by(type) %>% summarise(mean(response))
# A tibble: 3 × 2
type `mean(response)`
<chr> <dbl>
1 prototype 4.75
2 seen 3.88
3 unseen 5
Explanation of command - Our test-phase data
(testdat
) is piped to group_by(type)
, which
groups it into the three parts given by the type
column
(seen, unseen, prototype). This grouped data is then sent to
summarise
to work out a summary value for each group. We
tell summarise
what summary we want, in this case we want a
mean
. And we tell the mean
command that the
data we want a mean of is in the response
column.
As before, you can safely ignore the “ungrouping” message that you receive. Looking at the output, we can see that this participant was more confident they’d seen the prototypes (4.75) than that they’d seen the pictures they’d actually seen before (3.88). This fits our hypothesis. But, somewhat oddly, they were even more confident about other distortions they hadn’t seen (5). An alternative hypothesis (and the correct one in this case) is that this particular participant just randomly pressed the buttons, so any difference in the three scores is down to chance.
Experiments generally involve testing relatively large numbers of people, in order to achieve good statistical power. This means that another important part of preprocessing is to combine the data from different participants into a single data frame, so we can analyse the data from all the participants at the same time.
We’ll start by loading in data from a second participant into a different dataframe. Add the following to your script and run it:
# Load data from second participant
dat2 <- read_csv("rawdata/subject-12.csv")
To combine these two participants into a single dataframe, we use the
bind_rows
command. Add the following to your script and run
it:
# Combine the two data sets; place into 'alldat'
alldat <- bind_rows(dat, dat2)
Explanation of command - The bind_rows
command takes the dat2
dataframe and adds it to the end of
the dat
dataframe. This combined data set is then written
to (<-
) a new dataframe called alldat
.
Look at alldat
in Environment window; you’ll
see it now has 112 rows, the 56 rows from subject-11 and then the
56-rows from subject-12.
Generally, we have a lot more than two people in an experiment. You could use the same technique to combine the data of tens, hundreds, or thousands of participants. However, that would take a really long time and be quite tedious. Fortunately, R allows us to speed up this process so that, even if we have tens of thousands of participants, we can quickly combine their data without error.
The first thing we need to do to combine many participants is to get
a list of the names of their data files
(e.g. subject-11.csv
). We do this using the
list.files
command. Add the following to your script and
run it:
# Display list of filenames
tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE))
# A tibble: 3 × 1
filename
<chr>
1 rawdata/subject-11.csv
2 rawdata/subject-12.csv
3 rawdata/subject-13.csv
Explanation of command: We say
tibble(filename=
to tell R to make a new dataframe, with a
column containing filenames. The first part of list.files
,
"rawdata"
, tells R which folder the data files are in.
Generally, it’s a good idea to keep all your data files inside a single
folder, so it is easier for commands like this to find them. The second
part of list.files
, "*.csv"
tells R to only
give the names of files that end in .csv
(the
"*"
means “any character or number”). This is useful,
because sometimes raw data folders contain other files, too. For
example, rawdata
contains README.md
, a file
that generally provides information relating to the data, rather than
being data itself. The third part, full.names=TRUE
, tells R
to give the name of the file including the name of the folder it is in,
so rawdata/subject-11.csv
rather than just
subject-11.csv
.
We know how to make a new dataframe containing a list of filenames to process. We can use this to load each file in turn and combine them into a single large dataset.
To do this, we make use another tidyverse function called
do
.
The do
function performs an operation for each group in
a dataframe. All of the results are joined into a large, combined
datafile. All we need to do is:
Make a dataframe containing the list of file names to be used
Remind R to read the files one at a time (explanation below)
Tell it which function to use to read the raw files
The entire command looks like this; which you should enter into your script and run:
# Combine data from all participants into 'alldat'
alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE)) %>%
group_by(filename) %>%
do(read_csv(.$filename))
Explanation of command: At the start of line 1 we
write alldat <-
. This means the results will be saved
with the name alldat
. Then we reuse the code from above to
create a new dataframe with one column: filename
. We then
use group_by
to tell R to process each filename
individually. If we didn’t do this then R would ask
read_csv
to open all the files at once, and
read_csv
would be confused!
When we write do(read_csv(.$filename))
we are telling
do
to apply the read_csv
function to each
filename. The ".$filename"
part is shorthand: The period
"."
means “the group in the data I am working with now”.
The "$filename"
part means use the values from the
filename
column. So read_csv(.$filename)
means, read the csv file using the value in the filename
column.
If you look at alldat
in the Environment window, you’ll
see that it has 168 rows. Using do
has combined all three
participants, each with 56 trials, into one big data frame. Although
with three participants we could have done this as quickly in other
ways, generally we have much more data than this - typically somewhere
between 30 and 300 participants in a single experiment. Using
do
saves a lot of time in these situations.
Of course, this combined data file has the same problems as the individual files that make it up – there are over 100 columns, most of which we don’t need. We can fix this with the same commands as before.
Add this comment to your script:
# Select and rename columns; place into 'tidydat'
Next add some commands to select the relevant
columns, put them into tidydat
and then rename the
columns.
Run those commands. Now, click on tidydat
, you’ll see a
human-readable data frame with 168 rows, and 5 columns. This is the full
dataset for the experiment, so we can now start analyzing it.
Now we have this combined file, we summarise the test phase data for
each participant using the same commands as before. We just need to
include subj
in the group_by
command, so
group_by(subj, type)
, so we get a separate summary for each
person.
Add the following comment to your script:
# Filter test phase data into 'test'
# Group 'test' by 'type' and 'subj', calculate mean of 'response', place into 'test.sum'
Now, fill in the blanks, i.e. insert and run some
commands to filter the data to the test phase, and put that
data into a data frame called test
. Now use
test
to group by subject and trial type, and summarize
using the mean response, and put the resulting summary into
test.sum
. If you’ve got this right, you’ll end up with this
summary. You can view it either by typing test.sum
, or by
clicking on test.sum
in the Environment
window:
# A tibble: 9 × 3
# Groups: filename [3]
filename type `mean(response)`
<chr> <chr> <dbl>
1 rawdata/subject-11.csv prototype 4.75
2 rawdata/subject-11.csv seen 3.88
3 rawdata/subject-11.csv unseen 5
4 rawdata/subject-12.csv prototype 5.5
5 rawdata/subject-12.csv seen 4.75
6 rawdata/subject-12.csv unseen 5.12
7 rawdata/subject-13.csv prototype 4.88
8 rawdata/subject-13.csv seen 4.38
9 rawdata/subject-13.csv unseen 4.38
Looking at the above summary, we can see that all three participants were more confident they’d seen the prototype than the pictures they’d actually seen, which supports our hypothesis. But they’re also all at least as confident about other unseen distortions, too. So, again, an alternative hypothesis is that participants were just pressing keys randomly. This was in fact the case for these data. In the next part of the worksheet, you’ll look at some data from a different experiment where the participants took the task more seriously.
In this section of the worksheet, you’ll preprocess the data from a
different experiment. You’ll find the data in the folder called
lexdec
, and the OpenSesame implementation of this
experiment in the expt-scripts
folder.
In this experiment, words are shown on the screen one at a time. Some are real words, others are made-up words (non-words). The participant’s job is to decide whether each is a word or non-word, as quickly as possible. The computer records for each decision whether they got it correct, and how long it took them to respond (their response time). The experiment begins with a practice phase, so participants can get used to the task. This is followed by the test phase, which is the part we analyse.
Your task is to write R code that gives the mean reaction time for each participant, for both words and non-words. You should analyse only reaction times for trials in which they made the correct response, and you should not include the practice phase in your analysis.
This can, and should, be done with less than 15 files of R code (plus comments, see below). When your code is working correctly, it will give you the following numbers:
# A tibble: 6 × 3
# Groups: subj [3]
subj type `mean(rt)`
<dbl> <chr> <dbl>
1 11 nonword 1158.
2 11 word 888.
3 12 nonword 1339
4 12 word 1818
5 13 nonword 1458.
6 13 word 1526
Other requirements:
Use short, meaningful names for data frames and column names.
Use comments in your code to make it more human readable.
Comments are any line that begins with #
and are ignored by
R. They are there to make your code easier for humans to understand. For
example:
# Create a dataframe containing all the filenames we want to read
raw.data.files <- tibble(filename = list.files("rawdata", "*.csv", full.names = TRUE))
Hints: Your code should first get a list of files,
then do
to load all those files in and combine them. Next
you’ll have to find the relevant columns among the 100+ columns of the
data file. Once you’ve found them, use select
to select
just those columns and use set_names
to give the selected
columns better names. Use filter
to remove the practice
phase and keep only the correct responses. Note that accuracy in this
data file is a number, not a word, so the correct phrase is something
like acc == 1
, rather than acc == "1"
.
Finally, use group_by
and summarise
to report
the mean response time for words and non words for each participant.
Good luck!
Once you’re getting the right answers, paste your R code into PsycEL.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.