Inter-rater reliability

Before you start…

Before starting this exercise, you should have completed all the previous Absolute Beginners’, Part 1 workshop exercises. Each section below indicates which of the earlier worksheets are particularly relevant.

Inter-rater reliability

“What’s the inter-rater reliability?” is a technical way of asking “How much do people agree?”. If inter-rater reliabiltiy is high, they agree a lot. If it’s low, they disagree a lot. If two people independently code some interview data, and their codes largely agree, then that’s evidence that the coding scheme is objective (i.e. is the same whichever person uses it), rather than subjective (i.e. the answer depends on who is coding the data). Generally, we want our data to be objective, so it’s important to establish that inter-rater reliabilty is high. This worksheet covers two ways of working out inter-rater reliabiltiy: percentage agreement, and Cohen’s kappa.

Getting the data into R

Relevant worksheet: Intro to R Studio

You and your partner must first complete the friendship interview coding exercise. You’ll then get a CSV file that contains both your ratings. If you were unable to complete the coding exercise, you can use this example CSV file instead. You can only gain marks for this exercise if you use your personal CSV file.

Once you have downloaded your CSV file, set up a new project on RStudio Server (Plymouth University students: Call your new project psyc414), upload your CSV to your project, and create a new R script called irr.R.

Exploring your data

Load

Relevant worksheet: Exploring data

Add comments and commands to your script to load the tidyverse package, and load your data, and run them (CTRL+ENTER):

## Inter-rater reliability
# Load package
library(tidyverse)
# Load data
friends <- read_csv("irr.csv")

Note: Everyone’s CSV file will have a different name. For example, yours might be called 10435678irr.csv. In the example above, you’ll need to replace irr.csv with the name of your personal CSV file.

Inspect

Look at the data by clicking on it in the Environment tab in RStudio. Each row is one participant in the interviews you coded. Here’s what each of the columns in the data set contain:

Column	Description	Values
subj	Anonymous ID number for the participant you coded	a number
rater1	How the first rater coded the participant’s response	One of: “Stage 0”, “Stage 1”, “Stage 2”, “Stage 3”, “Stage 4”
rater2	How the second rater coded the particiapnt’s response	as rater 1

Percentage agreement

To what extent did you and your workshop partner agree on how each participant’s response should be coded? The simplest way to answer this question is just to count up the number of times you gave the same answer. You’re already looking at the data (you clicked on it in the Environment tab in the previous step, see above), and you only categorized a few participants, so it’s easy to do this by hand.

For example, you might have given the same answer for four out of the five participants. You therefore agreed on 80% of occasions. Your percentage agreement in this example was 80%. The number might be higher or lower for your workshop pair.

For realistically-sized data sets, calculating percent agreement by hand would be tedious and error prone. In these cases, it would be better to get R to calculate it for you, so we’ll practice on your current data set. We can do this in a couple of steps:

1. Select the ratings

Relevant worksheet: Group differences.

Your friends data frame contains not only your ratings, but also a list of participant numbers in the subj column. For the next step to work properly, we have to remove this column. We can do this using the select command. In the Group Differences worksheet, you learned how to use the filter command to say which rows of a data frame you wanted to keep. The command select works in a similar way, except that it filters columns.

Add this comment and command to your script and run it:

# Select 'rater1' and 'rater2' columns from 'friends', write to 'ratings'
ratings <- friends %>% select(rater1, rater2)

In the command above, we take the friends data frame, select the rater1 and rater2 columns, and put them in a new data frame called ratings.

2. Calculate percentage agreement

We can now use the agree command to work out percentage agreement. The agree command is part of the package irr (short for Inter-Rater Reliability), so we need to load that package first.

Add these comments and commands to your script and run them:

# Load inter-rater reliability package
library(irr)
# Calculate percentage agreement
agree(ratings)

 Percentage agreement (Tolerance=0)

 Subjects = 5 
   Raters = 2 
  %-agree = 80

NOTE: If you get an error here, type install.packages("irr"), wait for the package to finish installing, and try again.

The key result here is %-agree, which is your percentage agreement. The output also tells you how many subjects you rated, and the number of people who made ratings. The bit that says Tolerance=0 refers to an aspect of percentage agreement not covered in this course. If you’re curious about tolerance in a percentage agreement calculation, type ?agree into the Console and read the help file for this command.

Enter your percentage agreement into your lab book.

Cohen’s kappa

One problem with the percentage agreement measure is that people will sometimes agree purely by chance. For example, imagine your coding scheme had only two options (e.g. “Stage 0” or “Stage 1”). Where there are two options then, just by random chance, we’d expect your percentage agreement to be around 50%. For example, imagine that for each participant, each rater flipped a coin, coding the response as “Stage 0” if the coin landed heads, and “Stage 1” if it landed tails. 25% of the time both coins would come up heads, and 25% of the time both coins would come up tails. So, on 50% of occasions, the raters would agree, purely by chance. So, 50% agreement is not particularly impressive when there are two options.

50% agreement is a lot more impressive if there are, say, six options. Imagine in this case that both raters roll a dice. One time in six they would get the same number. So, percentage agreement by chance when there are six options is 1/6 — about 17% agreement. If two raters agree 50% of the time when using six options, that level of agreement is much higher than we’d expect by chance.

Jacob Cohen thought it would be much neater if we could have a measure of agreement where zero always meant the level of agreement expected by chance, and 1 always meant perfect agreement. This can be achieved by the following sum:

(P - C) / (100 - C)

where P is the percentage agreement between the two raters, and C is the percentage agreement we’d expect by chance. For example, say that in your coding exercise, you had a percentage agreement of 80%. You were given five categories to use, so the percentage agreement by chance, if you were both just throwing five-sided dice, is 20%. This gives you an agreement score of:

(80 - 20) / (100 - 20)

[1] 0.75

So, on a scale of zero (chance) to one (perfect), your agreement in this example was about 0.75 – not bad!

Calculating Cohen’s kappa

Cohen’s kappa is a measure of agreement that’s calculated in a similar way to the above example. The difference between Cohen’s kappa and what we just did is that Cohen’s kappa also deals with situations where raters use some of the categories more than others. This affects the calculation of how likely it is they will agree by chance. For more information on this, see more on Cohen’s kappa.

To calculate Cohen’s kappa in R, we use the command kappa2 from the irr package.

Add this comment and command to your script and run it:

# Calculate Cohen's kappa
kappa2(ratings)

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 5 
   Raters = 2 
    Kappa = 0.75 

        z = 3.54 
  p-value = 0.000407

The key result here is Kappa which is your Cohen’s kappa value (in this example, it’s about 0.75 – your value may be higher or lower than this). The output also tells you how many subjects you rated, and the number of people who made ratings. The bit that says Weights: unweighted refers to an aspect of Cohen’s kappa not covered in this course. If you’re curious, type ?kappa2 into the Console and read the help file for this command.

Enter your Cohen’s kappa into your lab book.

Depending on your ratings, you may get a value for Kappa that is zero, or even negative. For further explanation of why this happens, see more on Cohen’s kappa.

Describing Cohen’s kappa

There are some words that psychologists sometimes use to describe the level of agreement between raters, based on the value of kappa they get. These words are:

Kappa	Level of agreement
< 0.21	slight
0.21 - 0.40	fair
0.41 - 0.60	moderate
0.61 - 0.80	substantial
0.81 - 0.99	almost perfect
1	perfect

So, in the above example, there is substantial agreement between the two raters.

This choice of words comes from an article by Landis & Koch (1977). Their choice was based on personal opinion.

Cohen’s kappa and evidence

Relevant worksheet: Evidence.

Let’s take another look at the output we just generated, because there are some bits we haven’t talked about yet:

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 5 
   Raters = 2 
    Kappa = 0.75 

        z = 3.54 
  p-value = 0.000407

The z and p-value lines relate to a significance test, much like the ones you covered in the Evidence worksheet. As we said back then, psychologists often misinterpret p values, so it’s important to emphasise here that this p value is not, for example, the probability that the raters agree at the level expected by chance. In fact, there is no way to explain this p value that is simple, useful and accurate. A Bayes Factor (see the Evidence workshop) would have been more useful, and easier to interpret, but the irr package does not provide one.

By convention, if the p value is less than 0.05, psychologists will generally believe you when you assert that the two raters agreed more than would be expected by chance. If the p value is greater than 0.05, they will generally be skeptical. If you were writing a report, you could make a statement like:

The agreement between raters was substantial, κ = 0.75, and greater than would be expected by chance, Z = 3.54, p < .05.

Depending on your ratings, the kappa2 command may give you NaN for the Z score and p value. For further explanation of why this happens, see more on Cohen’s kappa. If you get NaN, it is better to omit the Z and p scores entirely, perhaps with a note that they could not be estimated for your data.

Postcript: Why is it called “Cohen’s kappa”?

The “Cohen” bit comes from its inventor, Jacob Cohen. Kappa (κ) is the Greek letter he decided to use to name his measure (others have used Roman letters, e.g. the ‘t’ in ‘t-test’, but measures of agreement, by convention, use Greek letters). The R command is kappa2 rather than kappa because the command kappa also exists and does something very different, which just happens to use the same letter to represent it. It’d probably have been better to call the command something like cohen.kappa, but they didn’t.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.

Inter-rater reliability

Michaela Gummerum, Andy Wills

Before you start…