Before starting this exercise, you should have had a brief introduction to getting and using RStudio – Introduction to RStudio. You should also have also completed the workshop exercises for Exploring Data and Group Differences. If not, take a look these earlier worksheets before continuing.
First, log in to
RStudio
server, and make sure you are in your psyc411
project.
Next, create a new script file in your project,
called evidence.R
.
Now, you load tidyverse and you load the income data
frame, and create the cpslow
data frame by adding the
following commands and comments to your R script, and using CTRL+ENTER
to run each command in turn:
## EVIDENCE
# Load package
library(tidyverse)
# Load data
cpsdata <- read_csv(url("https://www.andywills.info/cps2.csv"))
# Filter to incomes < $150K
cpslow <- cpsdata %>% filter(income < 150000)
The first thing we’re going to go in this class is calculate a p value. In its most basic form, a p value is just a probability. A probability is a number between 0 and 1 that represents how likely something is to happen. For example, if you flip a fair coin, P(Heads) = 0.5, and P(Tails) = 0.5. If you roll a six-sided dice, the probability of rolling a 3 (or any other number) is 1/6.
In psychology, when people talk about p values, they normally mean something a bit more specific. They mean the probability that we would have got the data we did, given a theory that we have. When psychologists talk about p values, that theory is nearly always a theory of no difference (a null hypothesis).
For example, we might have the null hypothesis that men and women in
the US have exactly the same mean income, and we collect some data to
test this. For example, we have our cpsdata
dataset, which
is data on the incomes of 10,000 US men and women.
When we started looking at this data (see the previous Group Differences workshop), we observed that the mean salary for women was several thousand dollars lower than the mean salary for men.
Our sample of 10,000 people is less than 1% of the US population, and the effect size in our sample is small (the difference in means is about one-fifth of the standard deviation). Still, on the face of it, it seems unlikely that our data (our sample) would have a gender pay gap of several thousands dollars if, in the US as a whole (i.e. in the US population), male and female mean salaries were exactly the same. So, the probability of our data, given the null hypothesis, seems like it should be quite low.
R allows us to estimate how low this probability is, under the assumption that we have a random sample of the population. One way to do this is to use a between-subjects t-test.
We are aiming to calculate the probability of the income data we have collected, under the theory that men and women have identical mean incomes. One way we can do that is to perform a t-test. A t-test takes two things into account when calculating this probability:
the effect size, see Group Differences
the sample size (i.e. the number of people in our sample, in this case 10,000).
If we hope to find a low probability despite a low effect size, we will need a large sample. Although our effect size is small (about 0.2, see Group Differences), our sample is quite large (10,000 people).
In R, the command for a t-test is very similar to the command for
calculating effect size (see Group
Differences). In fact, all we need to do is replace
cohen.d
with t.test
.
So, the comment and command to add to your script and run is:
# Perform t-test on income, by sex
t.test(cpsdata$income ~ cpsdata$sex)
Welch Two Sample t-test
data: cpsdata$income by cpsdata$sex
t = -3.6654, df = 9991.3, p-value = 0.0002482
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-14518.10 -4400.64
sample estimates:
mean in group female mean in group male
82677.29 92136.66
The key figure in the output is the p-value
. We can see
here that the p value is, as we expected, very low, it’s about .0002.
So, given a theory that male and female mean incomes are identical in
the US population (our null hypothesis), the probability of observing
the data we did (our sample), is very low (less than 1 in 1000).
Note: Technically, it’s the probability of observing at least the difference in means that we did (i.e. a difference in means equal or larger than the one we observed), under the assumption the null hypothesis is correct.
Psychologists will generally record this probability in their
articles. So, in this case, they would writep = .0002
or,
alternatively, p < .05
. The latter form,
p < .05
, reads “the p value is less than .05”. More on
this later.
Psychologists usually also report at least two further numbers when
reporting a t-test in their articles. The first is the t value,
t
. Our t value in this case is about 3.67 (we can safely
ignore the minus sign).
The second is the “degrees of freedom” (df
). Degrees
of freedom are a way of talking about the size of the dataset used
in the t test. You’ll see it isn’t quite the sample size, because we
know our sample size was 10,000, yet the df
are a bit lower
than that: 9991.3.
In summary, t = 3.67
, df = 9991.3
, and
p = .0002
. Psychologists have a standard shorthand for
this, which is to write, t(9991.3) = 3.67, p = .0002
.
You’ll see this sort of thing in most psychology articles. In more
recent articles, you’ll also see a report of the effect size, in this
case d = .020
, as we calculated previously.
The whole sentence would be something like:
Men earned more than women, d = .20, t(9991.3) = 3.67, p < .05
.
There’s quite a bit of other output from our t-test that we haven’t discussed. If you’re curious about what the rest of the output means, take at look at more on t-tests. In particular, there is a section on confidence intervals, which may come up in some of your other teaching.
As we covered above, psychologists often report any p value less than
.05 as simply p < .05
. This comparison to .05 (1 in 20)
is an arbitrary convention in our subject. The convention, in practice,
works as follows:
If p < .05
then psychologists will believe your
observed difference is real and allow you to proclaim it as
“statistically significant”, and to publish it.
If p >= .05
then psychologists will be sceptical.
They probably won’t believe your observed difference is real, and
they’ll expect you to describe it as “statistically non-significant”.
Sometimes, if p is between .05 and about 0.15, authors will describe
their finding as “marginally significant”.
Psychologists will also tell you (quite accurately) that a “non significant” p value gives you no more information than you had before you collected the data. So, a non-significant p-value does not mean you have evidence for the null. Instead, you have an absence of evidence. This seems crazy – after all, you now have some data, surely that tells you something? This is a big problem with p values, but it is not the only one, as we’ll see in a moment.
The p value is
P(data | null) – which means the probability of our data, assuming the null hypothesis is correct.
The traditional way of interpreting this p value (see above) treats it is as if it were the:
P(null | data) – the probability of the null hypothesis, given the data.
These two things are not the same, but psychologists have a long history of not realising this. So, if you can’t see what the problem is just yet, you’re in good company! Below, I’ll try and illustrate the error with an analogy. If you understand the analogy, you understand (at some level) what the problem is. If you don’t get it, don’t worry too much – in the next section, we’ll look at a different sort of test that allows us to avoid the error.
To make this all a bit less abstract, let’s make the data “your car doesn’t start”, and my theory “there’s a banana in your exhaust pipe”. Sticking a banana up your car’s exhaust pipe (not recommended!!) is a reasonably effective way of making it so your car won’t start. So the probability your car won’t start, given there’s a banana in the exhaust pipe, is quite high. Let’s say that P(nostart | banana) = 0.95.
However, the probability your car has a banana in its exhaust pipe, given the fact it doesn’t start, is quite low. There are lots of reasons a car might not start (flat battery, no fuel, …), so the probability that my theory (that there’s a banana in your exhaust) is correct, given the data that your car does not start, is quite low. You’d be better off checking the battery and the fuel gauge first. You can save searching for fruit in your exhaust pipe for later. Let’s say P(banana | nostart ) = .01.
In summary, P(nostart | banana) = 0.95, but P(banana | nostart) = .01. The only thing to take from this example is that these two numbers are not the same. Similarly, P(data|null) will seldom be the same as P(null|data).
It is possible to calculate what we want to know, P(null|data), given what a t-test gives us, P(data|null), but we have to know P(null) – the probability the null hypothesis was correct before we collected our data. We normally don’t know this precisely but we can make an informed guess.
For example, say you have the hypothesis that if people read a few sentences related to old age, they will walk more slowly. Believe it or not, this is a real example of a psychological hypothesis.
Before collecting any data, this hypothesis seems unlikely. If the experiment is well designed, we might conclude that the probability of the null hypothesis was rather high. Let’s say, P(null) = 0.95.
We run the experiment, do a t-test, and get a p value of .049. So, P(data|null) = .049. Traditionally, psychologists would call this a significant result, believe you, and allow you to publish it. This is the wrong response because, in this case, P(null|data) is about 0.5 (not .05, .5! 50:50 !) .
In other words, instead of running the experiment, you’d have done just as well to flip a coin and say your hypothesis was right if the coin came up heads. This is no way to run a research programme, and probably goes some way to explain why psychology is full of reports of weird findings that aren’t real (aka. the “replication crisis”“).
We can escape this whole confusing mess of the traditional t-test by calculating a Bayes Factor rather than a p value. A Bayes Factor is a number that tells us how likely it is there’s a difference between groups
For example, a Bayes Factor of 10 means it’s ten times more likely there is a difference than there isn’t.
A Bayes Factor less than 1 is also informative. For example, a Bayes Factor of 0.1 (1/10) tells you it’s ten times more likely that there isn’t a difference than there is.
A Bayes Factor of exactly 1 tells you that the presence or absence of a difference is equally likely.
All the above assumes that, before you collected the data, you thought the presence or absence of a difference were equally likely outcomes. If that is not the case, see more on Bayes Factors.
A Bayesian t-test is used for the same sorts of things as a traditional t-test (see above), but returns a Bayes Factor rather than a p value.
The easiest way to do a Bayesian t-test in R is to use the
BayesFactor package. Recall, that you load a package using the
library
command.
So, the next comment and command to add to your script and run is:
# Load package for Bayes Factors
library(BayesFactor)
R will return a welcome message and various other text in response to this command (we do not show this text here, as it is quite long). You can safely ignore that text. However, if you get an error here, please see common errors.
The command for a Bayesian t-test is similar, but not identical, to the commands for effect size and for a traditional t-test. For our gender pay gap example, we add the following command to our script and run it:
# Perform Bayesian t-test on income, by sex
ttestBF(formula = income ~ sex, data = data.frame(cpsdata))
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 18.25138 ±0%
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS
The Bayes Factor is reported on the third line, towards the right. Our Bayes Factor is about 18.25. This means it’s about 18 times more likely that there is a gender pay gap in the US population than that there isn’t.
If you’re curious about what the rest of the output means, see more on Bayes Factors.
The ttestBF()
(short for “Bayes Factor t-test”) command
has two components, separated by a comma:
formula =
- Here we tell R what we want to analyse
(the income
column of our data frame), and which group each
income belongs to (which is found in the sex
column). The
tilde, ~
is used in the same way as in the effect size
calculation and the standard t-test, i.e. it means “as a function of”.
So income ~ sex
means look at income as a function of
biological sex. Unlike cohen.d
, we don’t need to say
cpsdata$income
because of the second component of the
command:
data =
- Here we tell R which data frame to use; in
our case cpsdata
. Due to a limitation of the BayesFactor
package, we have to specifically tell it to treat our data as a data
frame (hence data.frame(cpsdata)
rather than just
cpsdata
).
Psychologists love a “line in the sand”, so a convention has emerged that we believe there is a difference if the Bayes Factor is greater than 3, and believe there isn’t a difference if the Bayes Factor is less than 0.33. We sometimes describe these lines in the sand as “substantial evidence for a difference” (BF > 3) and “substantial evidence for the null” (BF < 0.33). In good, recent papers, you will see a Bayes Factor reported alongside the effect size and results of a traditional t-test. The full sentence might be something like:
Men earned more than women, d = .20, BF = 18.25, t(9991.3) = 3.67, p < .05
.
Others, like Kass & Raftery (1993), have provided some additional guidelines on how to interpret Bayes Factors. It is a more detailed guide than a simple “line in a sand” and can prove very useful when comparing multiple studies. The following comes from Andraszewicz et al. (2014):
A rule-of-thumb interpretation of Bayes Factors | |
Bayes Factors | Interpretation |
---|---|
> 100 | Extreme evidence for the alternative hypothesis |
30 - 100 | Very strong evidence for the alternative hypothesis |
10 - 30 | Strong evidence for the alternative hypothesis |
3 - 10 | Moderate evidence for the alternative hypothesis |
1 - 3 | Anecdotal evidence for the alternative hypothesis |
0.33 - 1 | Anecdotal evidence for the null hypothesis |
0.10 – 0.33 | Moderate evidence for the null hypothesis |
0.03 – 0.10 | Strong evidence for the null hypothesis |
0.01 – 0.03 | Very strong evidence for the null hypothesis |
< 0.01 | Extreme evidence for the null hypothesis |
Note that this is BF10, which means that the larger the Bayes Factor is, the more evidence we have for the alternative hypothesis. |
Sometimes, you’ll see Bayes Factor written as BF10, which means the same thing as BF. You’ll also occasionally see BF01, which is the same idea but flipped, so BF01 < 1/3 means substantial evidence for a difference, and BF01 > 3 means substantial evidence for the null.
In order to calculate a Bayes Factor, R has to make some assumptions
about how big the difference is likely to be. The ttestBF
command does this for you, making some broad assumptions that cover the
range of effect sizes typically seen in psychology. Other commands may
make different assumptions, including estimating the likely effect size
from previous experiments. So, when reporting Bayes Factors, it’s
particularly important to report, somewhere in your Results section, the
specific method that you used. For the ttestBF
function
you’d write something like:
“Bayesian t-tests were calculated using the BayesFactor package (Morey & Rouder, 2022), within the R environment (R Core Team, 2022).”
You can get the references for these citations by typing
citation("BayesFactor")
and citation()
.
In this exercise, you’ll consolidate what you’ve learned so far.
The task is to further examine this sample of participants
who are living in the US (cpsdata
) , and earning less than
$150k (cpslow
).
# EXERCISE
Specifically, the task is to perform a traditional t-test and a Bayesian t-test to address the question of whether people born in the US earn more. So, after adding the comment above, add commands to the end of your script to do this, and run them. Your output should look like the below if you’ve got it right.
As you can see, the Bayesian evidence for a difference is pretty overwhelming in this case – it’s about 3.5 million times more likely there is a difference than there isn’t!
Welch Two Sample t-test
data: cpslow$income by cpslow$native
t = -6.5669, df = 1473.8, p-value = 7.102e-11
alternative hypothesis: true difference in means between group foreign and group native is not equal to 0
95 percent confidence interval:
-9716.752 -5246.980
sample estimates:
mean in group foreign mean in group native
51422.69 58904.56
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 3534729 ±0%
Against denominator:
Null, mu1-mu2 = 0
---
Bayes factor type: BFindepSample, JZS
Note: 7.102e-11, what does that mean? It’s scientific notation (a.k.a. “Standard Form”), so is read 7.102 x 10-11. You would have been taught scientific notation in school, but here’s a reminder if you need it BBC bitesize revision guide on standard form.
Now write a single sentence that reports what you’ve found in the standard way. It will be of the form:
Immigrants earn less than those born in the US,
BF = , t( ) = , p =
(or you could replace
p=
with p <
).
Fill in the blanks!
Question: So, what if your p-value is less than .05, but your Bayes Factor is less than 3?
Answer: Believe the Bayes Factor, and ignore the p value. The p value is there for historical reasons and doesn’t tell you anything you actually want to know. If you’d like to read more about this issue, take a look at more on Bayes Factors.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.