Introduction

In the Intermediate Guide to R, we covered some basic techniques for preprocessing data from experiments, see the preprocessing data from experiments and the more processing worksheets. In this worksheet, we’ll cover some additional techniques for preprocessing experimental data.

Getting started

To prepare for this worksheet:

Open the rminr-data project we used previously.
Open the Files tab. You should see a going-further folder which contains the files perruchet-raw.csv and perruchet-sum.csv.
If you don’t see the folder or files, ask git to “pull” the latest version of the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window.
Create a script named preproc-experiments.R in the rminr-data folder (the folder above case-studies). Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.

Loading data

We’ll start by loading some data from an undergraduate student dissertation on the Perruchet Effect. Participants were repeatedly shown a picture of a brown cylinder. This picture was sometimes followed by a loud unpleasant noise. Their reaction to the noise was measured using their galvanic skin response. In addition, every time the cylinder was shown, participants rated their expectancy that the noise was about to occur on a 1-5 scale.

Enter these comments and commands into your script, and run them:

# Further preprocessing for experiments
# Clear the environment
rm(list = ls()) 
# Load tidyverse
library(tidyverse)
# Load data into 'raw'
raw <- read_csv('going-further/perruchet-raw.csv')
# Rename columns of 'raw'
colnames(raw) <- c("subj", "time", "key", "value")
# Display first few rows of 'raw'
raw %>% head()

# A tibble: 6 × 4
   subj  time key   value
  <dbl> <dbl> <chr> <dbl>
1   200    50 GS        0
2   200   106 GS        0
3   200   153 GS        0
4   200   201 GS        0
5   200   253 GS        0
6   200   301 GS        0

Explanation of commands:

Command 1 clears the environment. Commands 2 and 3 should be familiar from previous worksheets; we nearly always need the tidyverse package, and we use the read_csv command to load the data. The going-further/ part of the filename going-further/perruchet-raw.csv says that the file perruchet-raw.csv is to be found in the directory (folder) called going-further.

Command 4 renames the columns, as covered in the Preprocessing data from experiments worksheet. Command 5 prints the first few rows of raw.

Explanation of output:

Our data is in long format i.e. one observation per row. It is a large file, containing almost two million observations in total. The four columns are as follows:

subj : A number indicating which participant this data comes from.
time : The time this observation was made, measured in milliseconds from the beginning of the experiment for that participant. Observations are made approximately every 50 ms.
key : The type of observation being made. Mostly, these are GS observations, which are galvanic skin responses. These are measures of level of arousal in the participant. There were also ES observations, which are the participant’s ratings of their expectancy of a shock.
value : The actual observation. Where key is GS, these are galvanic skin response values, which are measures of conductance (the reciprocal of electrical resistance). Higher values indicate higher arousal. Where the key is ES, the scores are ratings on a scale of 1-5.

De-duplicating data

This experiment was run by three people over many different days. They tried to manually stitch all their data files together, by copying-and-pasting them into one big file in Excel. This type of manual manipulation of data is notoriously error prone, and in the process, they accidentally duplicated some data. In this case, it’s easy to detect and fix in R, because no two rows of this data can be identical - the subject number or the time is always going to be different. So, we can remove duplicates by telling R to give us just one copy of each of the different rows in the data set. The command distinct does that for us.

Enter this comment and command into your script, and run it:

# De-duplicate data
raw <- distinct(raw)

Excluding participants

Exclusions by participant number

In this dissertation, the student first ran a pilot study, with just a few participants. On the basis of this pilot, they changed their experiment a bit, and then went on to collect data from many more participants. They only wanted to analyze the main study, so they removed the pilot participants. This was easy, because the participants in the pilot study all had participant numbers of 161 or higher, so they could use a simple filter command, such as we have used in many previous worksheets.

Enter this comment and command into your script, and run it:

# Remove pilot study
raw <- raw %>% filter(subj > 161)

Exclusions by criterion

As part of this experiment, participants were told to rate their expectancy of hearing an unpleasant noise, on every trial. However, they only had 8 seconds to do this, and the experimenters noticed that sometimes people failed to make a rating. The experimental apparatus recorded this as a rating of zero.

Participants who miss a lot of ratings aren’t really following the instructions, making their data hard to interpret. In such cases, we normally exclude those participants from further analysis. In order to do this, we need to find out how widespread the problem is. One way to do this is to calculate the number of zero ratings each participant makes.

Enter this comment and these commands into your script, and run them:

## Number of expectancy ratings, by participant
inspect  <- raw %>% filter(key == "ER") %>% filter(value == 0)
table(inspect$subj)


163 164 166 176 184 188 191 196 199 202 223 233 235 236 239 246 
  1  19  46  16   2   3   2   1  32   3   3   1   3   9  34   1

Explanation of commands

The first command filters those rows of the raw data that contain expectancy ratings (key == "ER") and then to cases where that rating is zero (value == 0). This is then stored in the data frame inspect. One could look at this manually, but it’s more efficient and less error prone to get R to tell us how many times each participant has a zero rating. This is what the second command, table(inspect$subj) does - it counts the number of times each subject number occurs in inspect.

Explanation of output

The output shows the number of trials on which each participant failed to give an expectancy rating. If a participant never made a zero rating, their participant number does not appear on the table. We can see that sixteen participants missed at least one rating, and that six participants missed more than 3 (one participant never made a rating, missing all 46 trials of the experiment!).

Removing participants from the dataset

In this case, the students decided to exclude all participants who missed more than three ratings. There would have been a number of ways to do this, but the simplest in this case was just to manually put together a list of the subject numbers they wanted to exclude.

Enter this comment and these commands into your script, and run them:

## Exclude participants
exclude  <- c(164, 166, 176, 199, 236, 239)
raw  <- raw %>% filter(!(subj %in% exclude))

Explanation of commands

The first command creates a list of the participant numbers we wish to exclude. With the second command, we remove those participants from the raw data frame. The filter command will be familiar from previous worksheets. The command subj %in% exclude means ‘any participant whose subject number is in our list exclude’. The use of !() in the filter statement means not. So, filter(!(subj %in% exclude)) means keep the participants whose subject number is not in the exclude list.

Note

There are still a few trials where the participant makes no expectancy rating, and the computer records this as an expectancy of zero. There’s a good case to be made for recoding the rating on those trials as either NA (meaning ‘missing data’) or 3, meaning “participant is unsure whether the noise will occur”. For brevity, we don’t make these changes here, but once you’ve gone through the whole worksheet, perhaps try to work out how one might do this.

Transforming data

Loading preprocessed data

After they had excluded participants, the students then performed a large number of further preprocessing steps. These steps were specific to the type of large neuroscience data set they had collected, so we won’t go through them in this more general-purpose worksheet. You can read about the steps they took in the case study if you are curious.

Instead, we’ll just load in the results of their preprocessing.

Enter this comment and command into your script, and run it:

# Log transformation of GSR
# Load data into 'sum.data'
sum.data <- read_csv('going-further/perruchet-sum.csv')

The data frame sum.data now contains one row for each trial of the experiment, for each participant. For our current purposes the key column is cval, which is a measure of the size of the galvanic skin response on that trial. The value column contains the expectancy rating for that trial. The trial and subj columns show the trial number and participant number, respectively. The run column refers to a concept specific to this particular study, which you can read about in the case study if you wish.

Log transform

The central measurement in this study was Galvanic Skin Response (GSR). Studies of GSR data normally log transform it before analysis. Log transforming data means to take the logarithm of the data, in this case the natural logarithm. Two reasonable questions about this preprocessing are: (a) what does that mean, and (b) why would you do it?

To answer the first question, a log transform compresses the range of the data. The following graph should help visualize this.

Enter this command into your script, and run it:

# Visualizing logarithms
plot(1:50,log(1:50))

On the x-axis we have the numbers 1 to 50. On the y-axis, we have the natural logarithm of those numbers. As you can see, while the x-axis rises from 1 to 50, the y-axis rises from zero to around 4. You should also be able to notice that for each increase in x, the increase in y gets smaller. So, for example, where x rises from 1 to 2, log(x) rises from 0 to about .8; but where x rises from 2 to 3, log(x) rises by about 0.4.

As a result, a log transform doesn’t change small numbers very much, but reduces large numbers a lot more. This can be useful if, for example, a few participants have unusually large GSRs. These very large GSRs would increase the variance of the data substantially, and this can make it harder to detect real differences between conditions. A log transform can reduce that problem, and hence can make it easier to find real effects in such situations. So, this also answers the second part of the question - we apply a log transform to GSR data because it can make it easier to find real effects. Other common transformations of this type include a square root transformation, and a reciprocal (1/x) transformation.

The students log transformed their GSR data using the mutate command we have come across in previous worksheets.

Enter this comment and command into your script, and run it:

# Log transformation of GSR
sum.data <- sum.data %>% mutate(lcval = log(cval + 1))

Explanation of commands

mutate creates a new column called lcval by transforming the existing column cval. This transformation uses the command log takes the natural logarithm. We calculate log(cval+1) rather than log(cval) because the log(0) is an infinitely small number, which is something that cannot be analyzed. By adding 1 we don’t hit this problem (unless cval is -1 or smaller, which it seldom is).

Exercise

This exercise uses data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Before and after each group listened to a clip of their preferred music, measurements were taken using the 20-item Positive and Negative Affect Schedule (PANAS). The data is in wide format, with one row per participant.

We start by loading the data and counting the number of rows in the data frame.

Enter comments and commands into your script, and run them:

# Load data into 'panas'
panas <- read_csv('going-further/music-panas.csv')
# Count number of rows in 'panas'
count(panas)

# A tibble: 1 × 1
      n
  <int>
1   297

Step 1

Like the Perruchet Effect experiment, the 297 rows in this data includes some duplicates. Add commands (and appropriate comments) to de-duplicate the data, and then count the rows again. If you’ve done it right, the output should look like this:

# A tibble: 1 × 1
      n
  <int>
1   294

Step 2

This version of the PANAS used 10 questions to measure positive affect (pa), and 10 to measure negative affect (na). These measures were taken twice (pre_ and post_). Answers to questions were scored from 1 to 5, meaning that scores could range from 10-50. For the second negative affect measurement, the researchers calculated and entered the scores by hand. This is a notoriously error-prone way to do things, and it resulted in some impossible scores - specifically scores greater than 50.

Use filter() to find any participants with a post_na score greater than 50. Use another filter() command to exclude any participants who meet this criterion. Enter these commands, along with appropriate comments, into your script.

Step 3

The reciprocal of a number is the result of dividing 1 by that number. So the reciprocal of 10 is 1 / 10, or 0.1. More generally, if x is our number, 1 / x is the reciprocal. Like the log transformation, taking a reciprocal reduces the disproportional effect that a small number of extreme values have on the variance.

Notice that taking a reciprocal reverses the order of your data e.g. 1 / 5 = 0.2, which is larger than 0.1. This is easy to correct for by subtracting each value of x from the maximum value before the division. You don’t need to do that for this exercise. However, because 1 / 0 will produce an error, we’ll use the formula 1 / (x + 1) to calculate a reciprocal.

Use this formula to transform pre_na and post_na. Then use summarise() to calculate mean pre- and post- negative affect by subculture. Include the relevant commands and comments into your script. The results should look like this.

# A tibble: 4 × 3
  subculture `mean(pre_na)` `mean(post_na)`
  <chr>               <dbl>           <dbl>
1 Emo                0.0574          0.0672
2 Goth               0.0685          0.0768
3 Mainstream         0.0705          0.0786
4 Metal              0.0670          0.0784

Copy the R code you used for this exercise, including the comments, into PsycEL.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.

Further preprocessing for experiments

Andy Wills and Paul Sharpe

Contents

Introduction

Getting started

Loading data

De-duplicating data

Excluding participants

Exclusions by participant number

Exclusions by criterion

Explanation of commands

Explanation of output

Removing participants from the dataset

Explanation of commands

Note

Transforming data

Loading preprocessed data

Log transform

Explanation of commands

Exercise

Step 1

Step 2

Step 3