In the Intermediate Guide to R, we covered some basic techniques for preprocessing data from experiments, see the preprocessing data from experiments and the more processing worksheets. In this worksheet, we’ll cover some additional techniques for preprocessing experimental data.
To prepare for this worksheet:
Open the rminr-data
project we used previously.
Open the Files
tab. You should see a
going-further
folder which contains the files
perruchet-raw.csv
and
perruchet-sum.csv
.
If you don’t see the folder or files, ask git to
“pull
” the latest version of the repository. Select the
Git
tab, which is located in the row of tabs which includes
the Environment
tab. Click the Pull
button
with a downward pointing arrow. A window will open showing the files
which have been pulled from the repository. Close the
Git pull
window.
Create a script named preproc-experiments.R
in the
rminr-data
folder (the folder above
case-studies
). Put all the commands from this worksheet
into this file, and run them from there. Save your script
regularly.
We’ll start by loading some data from an undergraduate student dissertation on the Perruchet Effect. Participants were repeatedly shown a picture of a brown cylinder. This picture was sometimes followed by a loud unpleasant noise. Their reaction to the noise was measured using their galvanic skin response. In addition, every time the cylinder was shown, participants rated their expectancy that the noise was about to occur on a 1-5 scale.
Enter these comments and commands into your script, and run them:
# Further preprocessing for experiments
# Clear the environment
rm(list = ls())
# Load tidyverse
library(tidyverse)
# Load data into 'raw'
raw <- read_csv('going-further/perruchet-raw.csv')
# Rename columns of 'raw'
colnames(raw) <- c("subj", "time", "key", "value")
# Display first few rows of 'raw'
raw %>% head()
# A tibble: 6 × 4
subj time key value
<dbl> <dbl> <chr> <dbl>
1 200 50 GS 0
2 200 106 GS 0
3 200 153 GS 0
4 200 201 GS 0
5 200 253 GS 0
6 200 301 GS 0
Explanation of commands:
Command 1 clears the environment. Commands 2 and 3 should be familiar
from previous worksheets; we nearly always need the
tidyverse
package, and we use the read_csv
command to load the data. The going-further/
part of the
filename going-further/perruchet-raw.csv
says that the file
perruchet-raw.csv
is to be found in the directory (folder)
called going-further
.
Command 4 renames the columns, as covered in the Preprocessing data from experiments
worksheet. Command 5 prints the first few rows of raw
.
Explanation of output:
Our data is in long format i.e. one observation per row. It is a large file, containing almost two million observations in total. The four columns are as follows:
subj
: A number indicating which participant this
data comes from.
time
: The time this observation was made, measured
in milliseconds from the beginning of the experiment for that
participant. Observations are made approximately every 50 ms.
key
: The type of observation being made. Mostly,
these are GS
observations, which are galvanic skin
responses. These are measures of level of arousal in the
participant. There were also ES
observations, which are the
participant’s ratings of their expectancy of a shock.
value
: The actual observation. Where
key
is GS
, these are galvanic skin response
values, which are measures of conductance (the reciprocal of electrical
resistance). Higher values indicate higher arousal. Where the
key
is ES
, the scores are ratings on a scale
of 1-5.
This experiment was run by three people over many different days.
They tried to manually stitch all their data files together, by
copying-and-pasting them into one big file in Excel. This type of manual
manipulation of data is notoriously error prone, and in the process,
they accidentally duplicated some data. In this case, it’s easy to
detect and fix in R, because no two rows of this data can be identical -
the subject number or the time is always going to be different. So, we
can remove duplicates by telling R to give us just one copy of each of
the different rows in the data set. The command distinct
does that for us.
Enter this comment and command into your script, and run it:
# De-duplicate data
raw <- distinct(raw)
In this dissertation, the student first ran a pilot study,
with just a few participants. On the basis of this pilot, they changed
their experiment a bit, and then went on to collect data from many more
participants. They only wanted to analyze the main study, so they
removed the pilot participants. This was easy, because the participants
in the pilot study all had participant numbers of 161 or higher, so they
could use a simple filter
command, such as we have used in
many previous worksheets.
Enter this comment and command into your script, and run it:
# Remove pilot study
raw <- raw %>% filter(subj > 161)
As part of this experiment, participants were told to rate their expectancy of hearing an unpleasant noise, on every trial. However, they only had 8 seconds to do this, and the experimenters noticed that sometimes people failed to make a rating. The experimental apparatus recorded this as a rating of zero.
Participants who miss a lot of ratings aren’t really following the instructions, making their data hard to interpret. In such cases, we normally exclude those participants from further analysis. In order to do this, we need to find out how widespread the problem is. One way to do this is to calculate the number of zero ratings each participant makes.
Enter this comment and these commands into your script, and run them:
## Number of expectancy ratings, by participant
inspect <- raw %>% filter(key == "ER") %>% filter(value == 0)
table(inspect$subj)
163 164 166 176 184 188 191 196 199 202 223 233 235 236 239 246
1 19 46 16 2 3 2 1 32 3 3 1 3 9 34 1
The first command filters those rows of the raw data that contain
expectancy ratings (key == "ER"
) and then to cases where
that rating is zero (value == 0
). This is then stored in
the data frame inspect
. One could look at this manually,
but it’s more efficient and less error prone to get R to tell us how
many times each participant has a zero rating. This is what the second
command, table(inspect$subj)
does - it counts the number of
times each subject number occurs in inspect
.
The output shows the number of trials on which each participant failed to give an expectancy rating. If a participant never made a zero rating, their participant number does not appear on the table. We can see that sixteen participants missed at least one rating, and that six participants missed more than 3 (one participant never made a rating, missing all 46 trials of the experiment!).
In this case, the students decided to exclude all participants who missed more than three ratings. There would have been a number of ways to do this, but the simplest in this case was just to manually put together a list of the subject numbers they wanted to exclude.
Enter this comment and these commands into your script, and run them:
## Exclude participants
exclude <- c(164, 166, 176, 199, 236, 239)
raw <- raw %>% filter(!(subj %in% exclude))
The first command creates a list of the participant numbers we wish
to exclude. With the second command, we remove those participants from
the raw
data frame. The filter
command will be
familiar from previous worksheets. The command
subj %in% exclude
means ‘any participant whose subject
number is in our list exclude
’. The use of !()
in the filter statement means not
. So,
filter(!(subj %in% exclude))
means keep the participants
whose subject number is not in the exclude
list.
There are still a few trials where the participant makes no
expectancy rating, and the computer records this as an expectancy of
zero. There’s a good case to be made for recoding the rating on those
trials as either NA
(meaning ‘missing data’) or
3
, meaning “participant is unsure whether the noise will
occur”. For brevity, we don’t make these changes here, but once you’ve
gone through the whole worksheet, perhaps try to work out how one might
do this.
After they had excluded participants, the students then performed a large number of further preprocessing steps. These steps were specific to the type of large neuroscience data set they had collected, so we won’t go through them in this more general-purpose worksheet. You can read about the steps they took in the case study if you are curious.
Instead, we’ll just load in the results of their preprocessing.
Enter this comment and command into your script, and run it:
# Log transformation of GSR
# Load data into 'sum.data'
sum.data <- read_csv('going-further/perruchet-sum.csv')
The data frame sum.data
now contains one row for each
trial of the experiment, for each participant. For our current purposes
the key column is cval
, which is a measure of the size of
the galvanic skin response on that trial. The value
column
contains the expectancy rating for that trial. The trial
and subj
columns show the trial number and participant
number, respectively. The run
column refers to a concept
specific to this particular study, which you can read about in the case study if you wish.
The central measurement in this study was Galvanic Skin Response (GSR). Studies of GSR data normally log transform it before analysis. Log transforming data means to take the logarithm of the data, in this case the natural logarithm. Two reasonable questions about this preprocessing are: (a) what does that mean, and (b) why would you do it?
To answer the first question, a log transform compresses the range of the data. The following graph should help visualize this.
Enter this command into your script, and run it:
# Visualizing logarithms
plot(1:50,log(1:50))
On the x-axis we have the numbers 1 to 50. On the y-axis, we have the natural logarithm of those numbers. As you can see, while the x-axis rises from 1 to 50, the y-axis rises from zero to around 4. You should also be able to notice that for each increase in x, the increase in y gets smaller. So, for example, where x rises from 1 to 2, log(x) rises from 0 to about .8; but where x rises from 2 to 3, log(x) rises by about 0.4.
As a result, a log transform doesn’t change small numbers very much, but reduces large numbers a lot more. This can be useful if, for example, a few participants have unusually large GSRs. These very large GSRs would increase the variance of the data substantially, and this can make it harder to detect real differences between conditions. A log transform can reduce that problem, and hence can make it easier to find real effects in such situations. So, this also answers the second part of the question - we apply a log transform to GSR data because it can make it easier to find real effects. Other common transformations of this type include a square root transformation, and a reciprocal (1/x) transformation.
The students log transformed their GSR data using the
mutate
command we have come across in previous
worksheets.
Enter this comment and command into your script, and run it:
# Log transformation of GSR
sum.data <- sum.data %>% mutate(lcval = log(cval + 1))
mutate
creates a new column called lcval
by
transforming the existing column cval
. This transformation
uses the command log
takes the natural logarithm. We
calculate log(cval+1)
rather than log(cval)
because the log(0)
is an infinitely small number, which is
something that cannot be analyzed. By adding 1 we don’t hit this problem
(unless cval is -1 or smaller, which it seldom is).
This exercise uses data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Before and after each group listened to a clip of their preferred music, measurements were taken using the 20-item Positive and Negative Affect Schedule (PANAS). The data is in wide format, with one row per participant.
We start by loading the data and counting the number of rows in the data frame.
Enter comments and commands into your script, and run them:
# Load data into 'panas'
panas <- read_csv('going-further/music-panas.csv')
# Count number of rows in 'panas'
count(panas)
# A tibble: 1 × 1
n
<int>
1 297
Like the Perruchet Effect experiment, the 297 rows in this data includes some duplicates. Add commands (and appropriate comments) to de-duplicate the data, and then count the rows again. If you’ve done it right, the output should look like this:
# A tibble: 1 × 1
n
<int>
1 294
This version of the PANAS used 10 questions to measure positive
affect (pa
), and 10 to measure negative affect
(na
). These measures were taken twice (pre_
and post_
). Answers to questions were scored from 1 to 5,
meaning that scores could range from 10-50. For the second negative
affect measurement, the researchers calculated and entered the scores by
hand. This is a notoriously error-prone way to do things, and it
resulted in some impossible scores - specifically scores greater than
50.
Use filter()
to find any participants with a
post_na
score greater than 50. Use another
filter()
command to exclude any participants who meet this
criterion. Enter these commands, along with appropriate comments, into
your script.
The reciprocal of a number is the result of dividing 1 by that number. So the reciprocal of 10 is 1 / 10, or 0.1. More generally, if x is our number, 1 / x is the reciprocal. Like the log transformation, taking a reciprocal reduces the disproportional effect that a small number of extreme values have on the variance.
Notice that taking a reciprocal reverses the order of your data e.g. 1 / 5 = 0.2, which is larger than 0.1. This is easy to correct for by subtracting each value of x from the maximum value before the division. You don’t need to do that for this exercise. However, because 1 / 0 will produce an error, we’ll use the formula 1 / (x + 1) to calculate a reciprocal.
Use this formula to transform pre_na
and
post_na
. Then use summarise()
to calculate
mean pre- and post- negative affect by subculture. Include the relevant
commands and comments into your script. The results should look like
this.
# A tibble: 4 × 3
subculture `mean(pre_na)` `mean(post_na)`
<chr> <dbl> <dbl>
1 Emo 0.0574 0.0672
2 Goth 0.0685 0.0768
3 Mainstream 0.0705 0.0786
4 Metal 0.0670 0.0784
Copy the R code you used for this exercise, including the comments, into PsycEL.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.