This worksheet covers the Wilcoxon rank-sum test, which is an alternative to the between-subjects t-test, and the Kruskal-Wallis H test, which is an alternative to the between-subjects one-factor ANOVA.
The Wilcoxon rank-sum and Kruskal-Wallis H tests are both non-parametric tests. This means they make fewer assumptions about your data than do standard parametric tests (such as t-tests, and ANOVA). Specifically, where your sample size is small, parametric tests assume that your population is approximately normally distributed. However, it’s important to realise that as your sample size increases, parametric tests make fewer and fewer assumptions about your population distribution, due to the central limit theorem. It’s also important to realise that parametric tests have greater statistical power than non-parametric tests, so we should use parametric tests when their assumptions are met.
Putting this all together, there’s a relatively small set of situations where it makes sense to use a non-parametric test such as the Wilcoxon rank-sum or Kruskal-Wallis H. These are when:
For example, you wouldn’t use an non-parametric test on a small sample of IQ scores, because IQ is known to be normally distributed. The point about effect size follows from the other two - if your sample size is small, you will only be able to detect large effects - see the statistical power worksheet.
Where’s the Bayes Factor? The Wilcoxon rank-sum and Kruskal-Wallis are traditional tests, in the sense that they give us a p-value rather than a Bayes Factor. As we have previously covered, p-values are widely misinterpreted by psychologists, and can never provide evidence for the null hypothesis. For this reason, we have generally advised that you instead use the Bayesian equivalents of these tests - such as Bayesian t-tests, Bayesian ANOVA, and Bayesian chi-square. However, it is not straight forward to calculate Bayesian equivalents of the Mann-Whitney and Kruskal-Wallis tests in R at the moment. So, in this case, we’ll stick to the traditional tests.
To prepare for this worksheet:
Open the rminr-data
project we used previously.
Open the Files
tab. You should see a folder called
going-further
. This folder should contain the files
picture-naming-long.csv
and
music-emotion-preproc.csv
.
If you don’t see the folder or the files, it means you created
your project before the data required for this worksheet was
added to the rminr-data
git repository. You fix this by
asking git to “pull
” the repository. Select the
Git
tab, which is located in the row of tabs which includes
the Environment
tab. Click the Pull
button
with a downward pointing arrow. A window will open showing the files
which have been pulled from the repository. Close the
Git pull
window.
Create a script named non-parametric.R
in the
rminr-data
folder (the folder above
going-further
). Add the comments and commands to this
script as you work through each section of the worksheet.
We’ll demonstrate the Wilcoxon rank-sum test using data from an experiment which evaluated children’s language development using the Words in Game (WinG) test. WinG consists of a set of picture cards that are used in four tests: noun comprehension, noun production, predicate comprehension, and predicate production. The Italian and English versions of the WinG cards use different pictures to depict the associated concepts. The experiment tested whether English-speaking children aged approximately 30 months produce different responses for the two sets of cards.
We start by loading the data. Enter these commands into your script, and run them:
# Non-parametric tests
# Clear environment
rm(list = ls())
# Load tidyverse
library(tidyverse)
# Load data
wing_preproc <- read_csv('going-further/picture-naming-long.csv')
Explanation of commands:
These commands should be familiar from previous worksheets. Command
line 1 clears the environment. Command line 2 loads the
tidyverse
package. Command line 3 reads the data.
The first few lines of wing_preproc
look like this:
subj | gender | cards | task | correct |
---|---|---|---|---|
1 | female | english | nc | 12 |
1 | female | english | np | 4 |
1 | female | english | pc | NA |
1 | female | english | pp | NA |
2 | male | italian | nc | 18 |
2 | male | italian | np | 12 |
2 | male | italian | pc | 17 |
2 | male | italian | pp | 9 |
3 | female | english | nc | 18 |
3 | female | english | np | 13 |
3 | female | english | pc | 17 |
3 | female | english | pp | 9 |
The first three columns are the participant ID number, gender of the participant, and the type of card presented. The fourth column is the test (e.g. “nc” = “Noun comprehension”). The final column is the number of correct responses. Some data is missing - indicated as “NA”.
In the next section, we are going to compare English and Italian on the noun comprehension task. So, we filter the data (which contains all four tasks) to include just that task. We also remove any missing data.
Enter this comment and command into your script, and run it:
# Select 'nc' task; drop NA entries
nc_include <- wing_preproc %>% filter(task == 'nc') %>% drop_na()
Explanation of command: The filter
command should be familiar from many previous worksheets.
drop_na()
removes any row that contains an NA
in it.
When we report non-parametric tests, we normally report the median (rather than the mean) as our descriptive statistic. This makes practical sense, because the mean can be misleading when the distribution is skewed and, if we have chosen to do a non-parametric test, we are (presumably) uncertain whether the distribution is skewed or not.
Enter these comments and commands into your script, and run them:
# Display medians
nc_include %>%
group_by(cards) %>%
summarise(median = median(correct))
# A tibble: 2 × 2
cards median
<chr> <dbl>
1 english 19
2 italian 17
Explanation of commands: These commands should be
familiar from several previous worksheets. We group the data by
cards
, and use summarise
, to calculate the
median()
score for each group.
As we said earlier in this worksheet, the Wilcoxon rank sum test is a non-parametric equivalent of a between-subjects t-test. It works by ranking all of the scores in the two groups, adding the ranks in each group, and comparing these “summed ranks” to determine if they differ.
We’ll run a Wilcoxon rank-sum test to see if there were any
significant differences between scores for the Italian and English cards
on the noun comprehension (nc
) task.
Enter this comment and command into your script, and run it:
# Perform Wilcoxon rank-sum test for noun comprehension
wilcox.test(correct ~ cards, nc_include)
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties
Wilcoxon rank sum test with continuity correction
data: correct by cards
W = 58, p-value = 0.125
alternative hypothesis: true location shift is not equal to 0
Explanation of commands:
The command wilcox.test(correct ~ cards, nc_include)
runs the test to compare the correct
scores for the Italian
and English cards
.
Explanation of output:
The phrase with continuity correction
is a technical
detail that can be safely ignored. If you’re curious, you can read more
in the help file, by typing ?wilcoxon.test
into the
console.
The phrase correct by cards
reminds you that you
compared the values in correct
between the levels in the
cards
factor. A p
value of less than .05 is
generally considered by psychologists to be evidence that the two groups
are different.
The phrase
alternative hypothesis: true location shift is not equal to 0
can be safely ignored, as it’s just another (rather obscure) way of
saying that you are testing whether the two groups are different.
The warning cannot compute exact p-value with ties
lets
you know that the method used to calculate p will not be exact,
because some items in the Italian and English scores had identical
rankings. It is possible to calculate the exact p value using the
pwilcox
command, but that’s beyond what we’ll cover in this
worksheet. If your p-value is sufficiently close to .05 that it would
matter if the estimate is a bit off, then a better solution would be to
attempt to replicate your finding in a second study with a larger
sample.
We now have all of the information we need to report the results. For the noun comprehension task, there was no significant difference in accuracy between the Italian (Mdn = 17) and English (Mdn = 19) cards W = 58, p = 0.13.
In some journal articles, you may see a non-parametric test called a “Mann-Whitney U”. This is exactly the same test as computed in the Wilcoxon rank-sum test above, just with a different name (and represented by a U rather than a W).
Calculate summary statistics and the Wilcoxon rank-sum for the noun production task. Your results should look like this:
# A tibble: 2 × 2
cards median
<chr> <dbl>
1 english 13
2 italian 11
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties
Wilcoxon rank sum test with continuity correction
data: correct by cards
W = 47.5, p-value = 0.5619
alternative hypothesis: true location shift is not equal to 0
Copy the R code you used for this exercise into PsycEL.
The Kruskal-Wallis H test is a non-parametric equivalent of a one-way between subjects ANOVA. It extends the Mann-Whitney test to situations where there are more than two groups. Like the Mann-Whitney test, the Kruskal-Wallis test works on ranked data.
We’ll demonstrate the Kruskal-Wallis test using data from a study which compared emotion regulation strategies between fans of mainstream (control group), goth, metal and emo music. Participants were measured using the Emotion Regulation Strategies for Artistic Creative Activities Scale (ERS-ACA), an 18 item inventory, with each item scored from 1 (‘strongly disagree’) to 5 (‘strongly agree’). The ERS-ACA gives an overall measure of the strategy people use to regulate their emotions when they engage in artistic, creative activities, and scores on three strategy sub-scales; avoidance, approach and self-development.
We start by loading the data; enter these comments and command into your script, and run it:
# Kruskal-Wallis H test
# Load data
ers_l <- read_csv('going-further/music-emotion-preproc.csv')
Explanation of command:
This data has already undergone some preprocessing and is in long
format. The first few lines of ers_l
look like this:
subj | subculture | ers | score |
---|---|---|---|
17 | Goth | avoidance | 1.286 |
17 | Goth | approach | 4.167 |
17 | Goth | development | 4.2 |
17 | Goth | total | 3.056 |
18 | Metal | avoidance | 4 |
18 | Metal | approach | 3.5 |
18 | Metal | development | 3.4 |
18 | Metal | total | 3.667 |
We’ll start by calculating the medians for the ‘approach’ subscale.
Enter these commands into your script, and run them:
# Select 'approach' conditionl; drop NAs
approach <- ers_l %>% filter(ers == 'approach') %>% drop_na()
# Display medians by 'subculture'
approach %>%
group_by(subculture) %>%
summarise(median = median(score))
# A tibble: 4 × 2
subculture median
<chr> <dbl>
1 Emo 3.83
2 Goth 3.67
3 Mainstream 3.67
4 Metal 3.83
Explanation of commands:
Command line 1 filters the data to only include measurements for the
‘approach’ subscale and removes missing data. Command lines 2-5 are very
similar to the summary statistics we generated for the Mann-Whitney
test. In this case we group by music subculture
.
Explanation of output:
The differences in medians between groups look quite small.
We can now run the Kruskal-Wallis test.
Enter this comment and command into your script, and run it:
# Perform Kruskal-Wallis test
kruskal.test(score ~ subculture, data = approach)
Kruskal-Wallis rank sum test
data: score by subculture
Kruskal-Wallis chi-squared = 5.2313, df = 3, p-value = 0.1556
Explanation of commands:
The command
kruskal.test(score ~ subculture, data = approach)
runs the
test to compare the ERS-ACA score
scores for the four
groups in subculture
.
Explanation of output:
The string score by subculture
reminds you that you
compared the values in score
between the levels in the
subculture
factor. The Kruskal-Wallis H statistic
is 5.2313. R describes it as chi-squared
because it is
possible to estimate the relevant p-value using a chi-square
distribution with the degrees of freedom (df
) for that
distribution set to one less than the number of groups, in this case
df
= 3. The p
value tells us whether there was
a significant difference between the four medians. It does not tell us
which pairs of groups differ significantly from each other (for that,
using a Wilcoxon rank-sum).
The results of this test are as follows:
There was no significant difference in approach style between the mainstream (Mdn = 3.67), goth (Mdn = 3.67), metal (Mdn = 3.83) and emo (Mdn = 3.83) groups, H = 5.23, p = 0.16.
Calculate summary statistics and Kruskal-Wallis H for the self-development emotional response subscale of the ERS-ACA. Your results should look like this:
# A tibble: 4 × 2
subculture median
<chr> <dbl>
1 Emo 3.8
2 Goth 4
3 Mainstream 3.6
4 Metal 3.8
Kruskal-Wallis rank sum test
data: score by subculture
Kruskal-Wallis chi-squared = 8.5011, df = 3, p-value = 0.03671
Copy the R code you used for this exercise into PsycEL.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.