Intelligence, personality, and many other psychological constructs
are often measured using scales. This type of data is normally collected
using questionnaires (also called surveys). Answers to the questions are
given numerical values, most commonly using a Likert scale. Likert
scales associate numbers with a set of answers which express some degree
of agreement with each question
e.g. 0=Not at all, 1=A little, 2=Somewhat, 3=A lot, 4=Extremely
.
A formula is applied to the scores for some or all of the questions to
calculate an overall score for the scale. The formula often just
consists of adding up the individual scores (more on this below). This
worksheet assumes that your survey software has recorded Likert
responses as numbers. Refer back to the Cleaning
up questionnaire data worksheet if you need a reminder of how to
convert text responses to numbers.
A psychometric scale is a scale which has undergone some degree of testing to ensure that it is a valid and reliable measure of the underlying construct. For example, a valid intelligence scale would truly measure intelligence, rather than some other construct (e.g. memory). A reliable scale gives consistent results, i.e. a person who completed the scale at different times would produce similar scores, as would two people who are similar in terms of the construct measured by the scale. Most published scales have been tested to ensure they are valid and reliable, so it’s advisable to use an existing scale if one exists, before creating your own.
Surveys can be created using JISC, Gorilla Survey, OpenSesame, The Experiment Factory, Qualtrics and many other software packages. Most software will allow you to save your data as a CSV file. The precise structure of the data varies between packages, so you are likely to have to start by preprocessing your data.
In this worksheet, we’ll cover some common techniques you are likely to use to preprocess psychometric scale data. These techniques should be useful regardless of the software you used to administer your survey data, although they will need slight modifications depending on the way your raw data is organised.
To prepare for this worksheet:
Open the rminr-data
project we used previously.
If you don’t see a folder named going-further
, it
means you created your project before the data required for
this worksheet was added to the rminr-data
git repository.
You can get the latest files by asking git to “pull
” the
repository. Select the Git
tab, which is located in the row
of tabs which includes the Environment
tab. Click the
Pull
button with a downward pointing arrow. A window will
open showing the files which have been pulled from the repository. Close
the Git pull
window.
Open the Files
tab. The going-further
folder should contain the files dass21.csv
and
sses.sav
.
Create a script named scales.R
in the
rminr-data
folder (the folder above
going-further
). Add the comments and code to this script as
you work through each section of the worksheet.
We start with some lines to clear the workspace and load
tidyverse
.
Enter these comments and commands into your script, and run them:
# Data preprocessing for scale
# Clear the environment
rm(list = ls())
# Load tidyverse
library(tidyverse)
Our first step will be to load the data and remove columns from the raw survey data which aren’t needed for analysis. We’ll demonstrate this using some real data from the Depression Anxiety Stress Scales—21 (DASS-21, Henry & Crawford, 2005), a 21-item scale for measuring depression, anxiety and stress.
Enter these comments and commands into your script, and run them:
# Load data
dass21_raw <- read_csv("going-further/dass21.csv")
# Select relevant columns of data
dass21_raw <- dass21_raw %>% select(partID, Age:DASS21)
Explanation of command:
We read the DASS-21 CSV file into the data frame
dass21_raw
.
We then select()
just the columns in
dass21_raw
that we want to keep. The first column we
select()
, is the participant ID, which is stored in the
partID
column. Arguments to select()
can also
be consecutive ranges of columns in a data frame, consisting of the
first and last column name (ordered from left to right), separated by a
:
. This avoids having to type out long lists of column
names. Here we use Age:DASS21
to select all columns between
Age
and DASS21
.
The table below shows the first few rows from
dass21_raw
. In this study, the data was recorded in “wide”
format (one row for each participant). Notice that our data frame
contains only the columns that we selected in the commands above. The
DASS-21 scores are in columns
DASS1
-DASS21
.
partID | Age | Gender | Stage | DASS1 | DASS2 | DASS3 | DASS4 | DASS5 | DASS6 | DASS7 | DASS8 | DASS9 | DASS10 | DASS11 | DASS12 | DASS13 | DASS14 | DASS15 | DASS16 | DASS17 | DASS18 | DASS19 | DASS20 | DASS21 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 18 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
35 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
36 | 37 | 2 | 2 | 2 | 1 | 3 | 1 | 3 | 1 | 0 | 1 | 2 | 1 | 1 | 1 | 2 | 0 | 2 | 3 | 1 | 2 | 1 | 0 | 2 |
37 | 19 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
38 | 20 | 2 | 1 | 1 | 0 | 1 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
39 | 20 | 2 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
If participants don’t complete (or partially complete) a survey, you
may want to exclude their data from your analyses. Here are some rows
from dass21_raw
.
partID | Age | Gender | Stage | DASS1 | DASS2 | DASS3 | DASS4 | DASS5 | DASS6 | DASS7 | DASS8 | DASS9 | DASS10 | DASS11 | DASS12 | DASS13 | DASS14 | DASS15 | DASS16 | DASS17 | DASS18 | DASS19 | DASS20 | DASS21 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
106 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
107 | 19 | 1 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
108 | 18 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
109 | 20 | 2 | 3 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 2 | 1 | 3 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
We can see that participants 108
, and 109
(rows 3 and 4) have numbers in all columns, indicating that their data
is complete. However, participants 106
and 107
have cells containing the value NA
, which means these cells
in the CSV file were empty. For participant 106
, all cells
are NA
(perhaps they dropped out of the study), and for
participant 107
, all of the DASS-21 cells are
NA
(perhaps they skipped this survey).
If you select dass21_raw
in the
Environment pane and look through the rest of the data,
you’ll see that participants 35, 49, 61, and 77 also have no data for
this survey. We exclude these participants from the data frame.
Enter these comments and commands into your script, and run them:
# Exclude participants with no data
exclude <- c(35,49,61,77,106,107)
dass21 <- dass21_raw %>% filter(!(partID %in% exclude))
Explanation of commands:
Line 1 creates a list of the participant numbers we wish to exclude.
In line 2, we remove those participants from dass21_raw
.
The filter
command will be familiar from previous
worksheets. The command partID %in% exclude
means ‘any
participant whose subject number is in our list exclude
’.
The use of !()
in the filter statement means
not
. So, filter(!(partID %in% exclude))
means
keep the participants whose subject number is not in the
exclude
list.
If you look at the Value column in the
Environment pane, you’ll see that dass21
now has six fewer rows than dass21_raw
.
Our next step is to calculate the scores for the constructs measured
by our scale. Many scales consist of groups of questions which measure
multiple, distinct constructs. The DASS-21 is an example of a scale with
subscale scores for depression, anxiety and stress. These are calculated
by adding together responses for specific items, which we can do using
the rowSums()
function.
Enter these comments and commands into your script, and run them:
# Calculate depression subscale score
dass21 <- dass21 %>%
mutate(depression = rowSums(dass21[4 + c(3,5,10,13,16,17,21)]))
# Add relevant columns to 'dass21_total'
dass21_total <- dass21 %>% select(partID, Age, Gender, depression)
Explanation of command:
mutate()
to create a depression
column which is the sum of items 3, 5, 10, 13, 16, 17 and 21. Item 1 of
the DASS-21 data is in column 5 of dass21
, so we add 4 to
each item number to select the correct columns to add together. The
command dass21[4 + c(3,5,10,13,16,17,21)]
is an example of
“vectorised addition”. It adds 4
to each of the columns
defined in the vector to the right of the +
. For each row,
the values in the resulting columns are added together using
rowSums()
. We assign the result back to
dass21
, thereby creating a depression
column
for each row.Use similar commands to add scores for anxiety and stress to
dass21
. The anxiety subscale is the sum of questions
2,4,7,9,15,19 and 20. The stress subscale is the sum of questions
1,6,8,11,12,14 and 18. After running your commands, the first few rows
of dass21_total
should look like this:
partID | Age | Gender | depression | anxiety | stress |
---|---|---|---|---|---|
34 | 18 | 2 | 0 | 1 | 2 |
36 | 37 | 2 | 15 | 7 | 8 |
37 | 19 | 2 | 6 | 4 | 7 |
38 | 20 | 2 | 5 | 1 | 5 |
39 | 20 | 2 | 1 | 0 | 2 |
40 | 18 | 1 | 2 | 1 | 5 |
41 | 20 | 1 | 1 | 0 | 0 |
42 | 20 | 2 | 3 | 5 | 4 |
43 | 19 | 0 | 0 | 0 | 0 |
44 | 19 | 2 | 19 | 17 | 14 |
Copy the R code you used for this exercise, along with appropriate comments, into PsycEL.
Some data benefits from a little more tidying than simply removing columns which aren’t required. We’ll demonstrate this more advanced preprocessing using a different dataset. This data came from from an experiment in which self-esteem was measured before and after participants completed one of two mental imagery conditions, or a control condition.
The experiment used the State Self-Esteem Scale (SSES, Heatherton & Polivy, 1991), a 20-item scale used to measure short-lived (state) changes in self-esteem.
Enter this comment and command into your script, and run it:
# Load data into 'sses'
sses <- read_csv('going-further/sses.csv')
partID | Age | Gender | Stage | Pre_SSE_1 | Pre_SSE_2 | Pre_SSE_3 | Pre_SSE_4 | Pre_SSE_5 | Pre_SSE_6 | Pre_SSE_7 | Pre_SSE_8 | Pre_SSE_9 | Pre_SSE_10 | Pre_SSE_11 | Pre_SSE_12 | Pre_SSE_13 | Pre_SSE_14 | Pre_SSE_15 | Pre_SSE_16 | Pre_SSE_17 | Pre_SSE_18 | Pre_SSE_19 | Pre_SSE_20 | Condition | Post_SSE_1 | Post_SSE_2 | Post_SSE_3 | Post_SSE_4 | Post_SSE_5 | Post_SSE_6 | Post_SSE_7 | Post_SSE_8 | Post_SSE_9 | Post_SSE_10 | Post_SSE_11 | Post_SSE_12 | Post_SSE_13 | Post_SSE_14 | Post_SSE_15 | Post_SSE_16 | Post_SSE_17 | Post_SSE_18 | Post_SSE_19 | Post_SSE_20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | 20 | male | 2 | 3 | 4 | 2 | 2 | 3 | 3 | 0 | 1 | 3 | 0 | 4 | 3 | 2 | 3 | 0 | 2 | 3 | 2 | 1 | 3 | control | 3 | 2 | 3 | 2 | 1 | 2 | 0 | 1 | 3 | 0 | 4 | 3 | 2 | 2 | 0 | 1 | 2 | 1 | 1 | 1 |
51 | 22 | female | 2 | 2 | 3 | 3 | 2 | 1 | 3 | 0 | 0 | 3 | 2 | 2 | 3 | 3 | 3 | 0 | 0 | 2 | 0 | 2 | 2 | control | 3 | 3 | 3 | 3 | 1 | 3 | 1 | 1 | 4 | 1 | 2 | 2 | 2 | 3 | 0 | 0 | 2 | 0 | 2 | 2 |
57 | 20 | female | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 2 | 0 | 2 | 3 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | control | 3 | 0 | 0 | 0 | 0 | 2 | 4 | 0 | 2 | 0 | 2 | 3 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
The data will be easier to analyse if we rename the columns. It will also be useful to divide the data into two data frames, one for the pre-intervention SSES, the other for the post-intervention SSES. We’ll do this in stages.
Enter this comment and command into your script, and run it:
# Place pre-intervention SSES into 'sses_pre_raw'
sses_pre_raw <- sses %>% select(1, 5:25)
Explanation of command:
sses_pre_raw <- sses %>% select(1, 5:25)
- We
select()
column 1, and columns 5:25 from sses
,
and store the resulting data frame in sses_pre_raw
. Column
1 is the participant id, columns 5:24 are the SSES scores, and column 25
contains a number indicating which of the three experimental conditions
the subject was assigned to.Here are the first three participants of our pre-intervention data:
partID | Pre_SSE_1 | Pre_SSE_2 | Pre_SSE_3 | Pre_SSE_4 | Pre_SSE_5 | Pre_SSE_6 | Pre_SSE_7 | Pre_SSE_8 | Pre_SSE_9 | Pre_SSE_10 | Pre_SSE_11 | Pre_SSE_12 | Pre_SSE_13 | Pre_SSE_14 | Pre_SSE_15 | Pre_SSE_16 | Pre_SSE_17 | Pre_SSE_18 | Pre_SSE_19 | Pre_SSE_20 | Condition |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | 3 | 4 | 2 | 2 | 3 | 3 | 0 | 1 | 3 | 0 | 4 | 3 | 2 | 3 | 0 | 2 | 3 | 2 | 1 | 3 | control |
51 | 2 | 3 | 3 | 2 | 1 | 3 | 0 | 0 | 3 | 2 | 2 | 3 | 3 | 3 | 0 | 0 | 2 | 0 | 2 | 2 | control |
57 | 3 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 2 | 0 | 2 | 3 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | control |
Next, we’ll rename the SSES columns based on their question number. This will make them easier to refer to in the rest of our code.
Enter this comment and command into your script, and run them:
# Rename columns
sses_pre_raw <- sses_pre_raw %>%
set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q"))
Explanation of command:
set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q"))
- We use the function set_names()
to rename our columns.
The ~
is a way of telling set_names()
to apply
a function to each column name. The remainder of the command is a
“sub-pipeline” which tidies up the column name. The command
str_to_lower(.)
converts a string (the .
means
the current column name) to lower case. This lower case name is piped to
str_replace_all("pre_sse_", "q"))
which replaces any string
with the prefix pre_sse_
with q
. All our
columns are now lowercase, and the SSES questions are named
q1:q20
.partid | q1 | q2 | q3 | q4 | q5 | q6 | q7 | q8 | q9 | q10 | q11 | q12 | q13 | q14 | q15 | q16 | q17 | q18 | q19 | q20 | condition |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | 3 | 4 | 2 | 2 | 3 | 3 | 0 | 1 | 3 | 0 | 4 | 3 | 2 | 3 | 0 | 2 | 3 | 2 | 1 | 3 | control |
51 | 2 | 3 | 3 | 2 | 1 | 3 | 0 | 0 | 3 | 2 | 2 | 3 | 3 | 3 | 0 | 0 | 2 | 0 | 2 | 2 | control |
57 | 3 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 2 | 0 | 2 | 3 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | control |
Now we’ll convert some columns to factors.
Enter this comment and command into your script, and run them:
# Convert columns to factors; add factor column 'time', set to 'pre'; select relevant columns
sses_pre_raw <- sses_pre_raw %>%
mutate(subj = factor(partid), condition = factor(condition),
time = factor('pre')) %>%
select(subj, condition, time, q1:q20)
mutate(subj = factor(partid), condition = factor(condition), time = factor('pre'))
- We use mutate
to add and modify some columns. The
argument subj = factor(partid)
creates a new column named
subj
(which is a bit clearer than partid
) by
copying the partid
column and making it a factor. The
argument condition = factor(condition)
makes the
condition
column a factor. The argument
time = factor('pre')
creates a new factor called
time
and sets all values to pre
.select(subj, condition, time, q1:q20)
just puts our
columns in a more logical order.Our data is now much tidier:
subj | condition | time | q1 | q2 | q3 | q4 | q5 | q6 | q7 | q8 | q9 | q10 | q11 | q12 | q13 | q14 | q15 | q16 | q17 | q18 | q19 | q20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | control | pre | 3 | 4 | 2 | 2 | 3 | 3 | 0 | 1 | 3 | 0 | 4 | 3 | 2 | 3 | 0 | 2 | 3 | 2 | 1 | 3 |
51 | control | pre | 2 | 3 | 3 | 2 | 1 | 3 | 0 | 0 | 3 | 2 | 2 | 3 | 3 | 3 | 0 | 0 | 2 | 0 | 2 | 2 |
57 | control | pre | 3 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 2 | 0 | 2 | 3 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
Note that we could do all of these steps in a single pipeline (DO NOT enter these commands, you do not need to do the same thing twice, this is just an illustration of how the previous commands could be combined).
sses_pre_raw <- select(sses, 1, 5:25) %>%
set_names(~ str_to_lower(.) %>% str_replace_all("pre_sse_", "q")) %>%
mutate(subj = factor(partid), condition = factor(condition),
time = factor('pre')) %>%
select(subj, condition, time, q1:q20)
Write a similar pipeline (including comments) to create a data frame
named sses_post_raw
containing the post-intervention SSES
data. The condition and post-intervention SSES data are in columns
25:45
. The SSES columns have the prefix
post_sse_
rather than pre_sse_
. Set the value
in the time
factor to post
. After running your
commands, the first few rows of sses_post_raw
should look
like this:
subj | condition | time | q1 | q2 | q3 | q4 | q5 | q6 | q7 | q8 | q9 | q10 | q11 | q12 | q13 | q14 | q15 | q16 | q17 | q18 | q19 | q20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47 | control | post | 3 | 2 | 3 | 2 | 1 | 2 | 0 | 1 | 3 | 0 | 4 | 3 | 2 | 2 | 0 | 1 | 2 | 1 | 1 | 1 |
51 | control | post | 3 | 3 | 3 | 3 | 1 | 3 | 1 | 1 | 4 | 1 | 2 | 2 | 2 | 3 | 0 | 0 | 2 | 0 | 2 | 2 |
57 | control | post | 3 | 0 | 0 | 0 | 0 | 2 | 4 | 0 | 2 | 0 | 2 | 3 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
Copy the R code and comments you used for this exercise into PsycEL.
Heatherton, T. F., & Polivy, J. (1991). Development and validation of a scale for measuring state self-esteem Journal of Personality and Social Psychology, 60(6), 895.
Henry, J. D., & Crawford, J. R. (2005). The short-form version of the Depression Anxiety Stress Scales (DASS-21): Construct validity and normative data in a large non-clinical sample British Journal of Clinical Psychology, 44(2), 227–239.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.