Before starting this exercise, you should have completed all the previous Absolute Beginners’, Part 1 workshop exercises. Each section below indicates which of the earlier worksheets are particularly relevant.
“What’s the inter-rater reliability?” is a technical way of asking “How much do people agree?”. If inter-rater reliabiltiy is high, they agree a lot. If it’s low, they disagree a lot. If two people independently code some interview data, and their codes largely agree, then that’s evidence that the coding scheme is objective (i.e. is the same whichever person uses it), rather than subjective (i.e. the answer depends on who is coding the data). Generally, we want our data to be objective, so it’s important to establish that inter-rater reliabilty is high. This worksheet covers two ways of working out inter-rater reliabiltiy: percentage agreement, and Cohen’s kappa.
Relevant worksheet: Intro to R Studio
You and your partner must first complete the friendship interview coding exercise. You’ll then get a CSV file that contains both your ratings. If you were unable to complete the coding exercise, you can use this example CSV file instead. You can only gain marks for this exercise if you use your personal CSV file.
Once you have downloaded your CSV file, set up a new project on RStudio Server
(Plymouth University students: Call your new project
psyc414
), upload your CSV to your
project, and create a new R script called irr.R
.
Relevant worksheet: Exploring data
Add comments and commands to your script to load the tidyverse package, and load your data, and run them (CTRL+ENTER):
## Inter-rater reliability
# Load package
library(tidyverse)
# Load data
friends <- read_csv("irr.csv")
Note: Everyone’s CSV file will have a different
name. For example, yours might be called 10435678irr.csv
.
In the example above, you’ll need to replace irr.csv
with
the name of your personal CSV file.
Look at the data by clicking on it in the Environment tab in RStudio. Each row is one participant in the interviews you coded. Here’s what each of the columns in the data set contain:
Column | Description | Values |
---|---|---|
subj | Anonymous ID number for the participant you coded | a number |
rater1 | How the first rater coded the participant’s response | One of: “Stage 0”, “Stage 1”, “Stage 2”, “Stage 3”, “Stage 4” |
rater2 | How the second rater coded the particiapnt’s response | as rater 1 |
To what extent did you and your workshop partner agree on how each participant’s response should be coded? The simplest way to answer this question is just to count up the number of times you gave the same answer. You’re already looking at the data (you clicked on it in the Environment tab in the previous step, see above), and you only categorized a few participants, so it’s easy to do this by hand.
For example, you might have given the same answer for four out of the five participants. You therefore agreed on 80% of occasions. Your percentage agreement in this example was 80%. The number might be higher or lower for your workshop pair.
For realistically-sized data sets, calculating percent agreement by hand would be tedious and error prone. In these cases, it would be better to get R to calculate it for you, so we’ll practice on your current data set. We can do this in a couple of steps:
Relevant worksheet: Group differences.
Your friends
data frame contains not only your ratings,
but also a list of participant numbers in the subj
column.
For the next step to work properly, we have to remove this column. We
can do this using the select
command. In the Group
Differences worksheet, you learned how to use the
filter
command to say which rows of a data frame you wanted
to keep. The command select
works in a similar way, except
that it filters columns.
Add this comment and command to your script and run it:
# Select 'rater1' and 'rater2' columns from 'friends', write to 'ratings'
ratings <- friends %>% select(rater1, rater2)
In the command above, we take the friends
data frame,
select the rater1
and rater2
columns, and put
them in a new data frame called ratings
.
We can now use the agree
command to work out percentage
agreement. The agree
command is part of the package
irr
(short for Inter-Rater Reliability), so we need to load
that package first.
Add these comments and commands to your script and run them:
# Load inter-rater reliability package
library(irr)
# Calculate percentage agreement
agree(ratings)
Percentage agreement (Tolerance=0)
Subjects = 5
Raters = 2
%-agree = 80
NOTE: If you get an error here, type
install.packages("irr")
, wait for the package to finish
installing, and try again.
The key result here is %-agree
, which is your percentage
agreement. The output also tells you how many subjects you rated, and
the number of people who made ratings. The bit that says
Tolerance=0
refers to an aspect of percentage agreement not
covered in this course. If you’re curious about tolerance in a
percentage agreement calculation, type ?agree
into the
Console and read the help file for this command.
Enter your percentage agreement into your lab book.
One problem with the percentage agreement measure is that people will sometimes agree purely by chance. For example, imagine your coding scheme had only two options (e.g. “Stage 0” or “Stage 1”). Where there are two options then, just by random chance, we’d expect your percentage agreement to be around 50%. For example, imagine that for each participant, each rater flipped a coin, coding the response as “Stage 0” if the coin landed heads, and “Stage 1” if it landed tails. 25% of the time both coins would come up heads, and 25% of the time both coins would come up tails. So, on 50% of occasions, the raters would agree, purely by chance. So, 50% agreement is not particularly impressive when there are two options.
50% agreement is a lot more impressive if there are, say, six options. Imagine in this case that both raters roll a dice. One time in six they would get the same number. So, percentage agreement by chance when there are six options is 1/6 — about 17% agreement. If two raters agree 50% of the time when using six options, that level of agreement is much higher than we’d expect by chance.
Jacob Cohen thought it would be much neater if we could have a measure of agreement where zero always meant the level of agreement expected by chance, and 1 always meant perfect agreement. This can be achieved by the following sum:
(P - C) / (100 - C)
where P is the percentage agreement between the two raters, and C is the percentage agreement we’d expect by chance. For example, say that in your coding exercise, you had a percentage agreement of 80%. You were given five categories to use, so the percentage agreement by chance, if you were both just throwing five-sided dice, is 20%. This gives you an agreement score of:
(80 - 20) / (100 - 20)
[1] 0.75
So, on a scale of zero (chance) to one (perfect), your agreement in this example was about 0.75 – not bad!
Cohen’s kappa is a measure of agreement that’s calculated in a similar way to the above example. The difference between Cohen’s kappa and what we just did is that Cohen’s kappa also deals with situations where raters use some of the categories more than others. This affects the calculation of how likely it is they will agree by chance. For more information on this, see more on Cohen’s kappa.
To calculate Cohen’s kappa in R, we use the command
kappa2
from the irr
package.
Add this comment and command to your script and run it:
# Calculate Cohen's kappa
kappa2(ratings)
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 5
Raters = 2
Kappa = 0.75
z = 3.54
p-value = 0.000407
The key result here is Kappa
which is your Cohen’s kappa
value (in this example, it’s about 0.75 – your value may be higher or
lower than this). The output also tells you how many subjects you rated,
and the number of people who made ratings. The bit that says
Weights: unweighted
refers to an aspect of Cohen’s kappa
not covered in this course. If you’re curious, type ?kappa2
into the Console and read the help file for this command.
Enter your Cohen’s kappa into your lab book.
Depending on your ratings, you may get a value for Kappa that is zero, or even negative. For further explanation of why this happens, see more on Cohen’s kappa.
There are some words that psychologists sometimes use to describe the level of agreement between raters, based on the value of kappa they get. These words are:
Kappa | Level of agreement |
---|---|
< 0.21 | slight |
0.21 - 0.40 | fair |
0.41 - 0.60 | moderate |
0.61 - 0.80 | substantial |
0.81 - 0.99 | almost perfect |
1 | perfect |
So, in the above example, there is substantial agreement between the two raters.
This choice of words comes from an article by Landis & Koch (1977). Their choice was based on personal opinion.
Relevant worksheet: Evidence.
Let’s take another look at the output we just generated, because there are some bits we haven’t talked about yet:
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 5
Raters = 2
Kappa = 0.75
z = 3.54
p-value = 0.000407
The z
and p-value
lines relate to a
significance test, much like the ones you covered in the
Evidence worksheet. As we said back then, psychologists often
misinterpret p values, so it’s important to emphasise here that
this p value is not, for example, the
probability that the raters agree at the level expected by chance. In
fact, there is no way to explain this p value that is simple, useful and
accurate. A Bayes Factor (see the Evidence workshop) would have
been more useful, and easier to interpret, but the irr
package does not provide one.
By convention, if the p value is less than 0.05, psychologists will generally believe you when you assert that the two raters agreed more than would be expected by chance. If the p value is greater than 0.05, they will generally be skeptical. If you were writing a report, you could make a statement like:
The agreement between raters was substantial, κ = 0.75, and greater than would be expected by chance, Z = 3.54, p < .05.
Depending on your ratings, the kappa2
command may give
you NaN
for the Z score and p value. For further
explanation of why this happens, see more
on Cohen’s kappa. If you get NaN
, it is better to omit
the Z and p scores entirely, perhaps with a note that they could not be
estimated for your data.
Postcript: Why is it called “Cohen’s kappa”?
The “Cohen” bit comes from its inventor, Jacob
Cohen. Kappa (κ) is the Greek letter he decided to use to name his
measure (others have used Roman letters, e.g. the ‘t’ in ‘t-test’, but
measures of agreement, by convention, use Greek letters). The R command
is kappa2
rather than kappa
because the
command kappa
also exists and does something very
different, which just happens to use the same letter to represent it.
It’d probably have been better to call the command something like
cohen.kappa
, but they didn’t.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.