Sometimes, a data file does not contain a participant number within it, it’s just provided as part of the filename. If you encounter this issue, here’s how to resolve it using the add_column
command. The following assumes you have a project in Rstudio associated with the git repository used in the preprocessing worksheet.
library(tidyverse)
subj.11 <- read_csv('rawdata/subject-11.csv') %>%
add_column(subj = 11, .before = "acc") # .before = "acc" means: 'insert the new column before the existing column called "acc"'
In the case that you need to read multiple participants’ datafiles at once, we saw how to use do
with read_csv
in the preproc worksheet:
alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE)) %>%
group_by(filename) %>%
do(read_csv(.$filename))
Explanation of command: This is the same code we saw in the preproc worksheet. We use list.files
to produce a list of all the files in the rawdata
directory which end in .csv
. This list is used to make a column in a new dataframe, which is piped to the group_by(filename)
function. The grouped data is then piped to the do
function. This works on each group (in this case, each filename) in turn and uses the filename
column as input the read_csv
command.
Because read_csv
produces a dataframe as output, these are automatically combined into a single dataframe of all participants. The filename
column remains and provides a record of where the data came from.
When you run this code, you should notice that alldat
has a new column, filename
. This contains the original file name of the raw data.
That’s OK, but it would be better if we could just have the participant number (e.g. 11
) because it’s more compact and easy to use like that. So, we need to be able to cut out the participant number 11
from the filename. We can do this using the str_sub
command. Here’s an example of how str_sub
works:
str_sub("investment", 3, 6)
[1] "vest"
Explanation of command: str_sub
is short for “string subset”, with a string being a collection of characters (e.g. a word) and a subset being part of that string. The first number, 3
is the start of the substring, and the second number 6
is the end of the substring. So, if we take from the third to the sixth character in “investment”, we get “vest”.
Looking at the filename rawdata/subject-11.csv
, we can see that the participant number starts at the 17th position and ends at the 18th. This will be true for any two-digit participant number (a good reason to start subject numbers at 11 rather than at 1). So, putting this all together, we get:
alldat <- tibble(filename = list.files("rawdata", "*.csv", full.names=TRUE)) %>%
group_by(filename) %>%
do(read_csv(.$filename)) %>%
mutate(subj = str_sub(filename, 17, 18), .before="filename")
These four lines of code load and combine every data file, and extract the participant number for each row.
If you didn’t always use 2-digit subject numbers in your experiment (e.g. you used 1..9, 10, 11, 12 and so on), or have more than 99 participants, there is another more advanced trick which can be useful.
The str_extract
function uses a special language to define patterns in a string. These can be used to identify and extact regular or repeating patterns in your filenames. These patterns are called regular expressions. To give one example:
str_extract("participant-9999", "(\\d+)")
[1] "9999"
Explanation of the code: str_extract
is being used to match patterns in the text "participant-9999"
. The pattern used is "\\d+"
. The \\d
part means ’match any digit from 0 to 9. The +
means, match as many of what went before as you can. So \\d+
means match as many digits as you can.
Adapt the code from above to use str_extract
rather than str_sub
.
Optionally, if you think matching patterns in your text data might be a useful skill, see this guide for lots more detail: https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.