Before starting this exercise, you should have completed all the previous Absolute Beginners’, Part 1 workshop exercises. Each section below indicates which of the earlier worksheets are particularly relevant.
Relevant worksheet: Introduction to RStudio, Exploring data
Download this CSV file, which contains the
all the data you need for this worksheet. Then, create or open an
appropriate project on RStudio Server for this analysis
(Plymouth University students: use the project
‘psyc414’ created in the inter-rater reliability
worksheet), upload your
CSV to your project, and create a new R script called
chi.R
.
Now, add these comments and commands to your script and run them; they will load the tidyverse package, and load your data.
## Relationships
# Load package
library(tidyverse)
# Load data into 'friends'
friends <- read_csv("chi.csv")
Look at the data by clicking on it in the Environment tab in RStudio. Each row is one participant in an interview about friendships. Here’s what each of the columns in the data set contain:
Column | Description | Values |
---|---|---|
subj | Anonymous ID number of participant | a number |
age | Age of the participant | One of: “7 years”, “9 years”, “12 years”, “15 years” |
gender | Gender of the participant | One of: “male”, “female” |
culture | Culture of participant | One of: “China”, “East Germany”, “Iceland”, or “Russia” |
coded | How their interview response was coded | One of “activity”, “feelings”, “helping”, “length”, “norms”, or “trust” |
This is a large dataset comprising over 700 participants of different ages, genders, and cultures. It is based on, but not identical to, real data on this topic analysed by Michaela (Gummerum et al., 2008). An R script was used to generate these data from Michaela’s more complex data set.
Let’s start by looking at how often each of the coded responses
(i.e. activities, feelings, helping, length, norms, and trust)
appear in the interviews. We could do this by hand, but it would be slow
and error prone. Instead, we use the table
command in R to
do it for us.
Add this comment and command to your script and run it (CTRL+ENTER):
# Table 'coded' column of 'friends'
table(friends$coded)
activity feelings helping length norms trust
255 96 133 192 53 55
R gives us a table, which reports how often each of the coded responses occurred in the data set. We can see that activity was used the most, norms the least. In fact, activity was used more than feelings, norms, and trust combined.
Here’s a step-by-step explanation of how the above command works. You’ll need this in a moment to calculate some frequency tables for yourself.
table()
- This command counts how many times each
thing occurs (in this case, how often each type of coded response
occurs).
friends$coded
- We need to tell table()
where to find the data we are interested in. In this case, it’s the
coded
column of the friends
dataframe
that we loaded earlier. We tell R this by typing
friends$coded
. Yes, that’s $
, the same symbol
as we use to indicate US Dollars. However, it doesn’t mean “dollars” in
R. It means column. So, friends$coded
means the
coded
column of the friends
dataframe.
Now produce frequency tables for each of the other variables
in this dataframe (i.e. age
, gender
,
and culture
). You do this by changing the command
table(friends$coded)
so that it now refers to a different
column in the friends
dataframe. Re-read the above
Explanation of command section if you’re stuck.
# EXERCISE 1
# Table 'age' column of 'friends'
# Table 'gender' column of 'friends'
# Table 'culture' column of 'friends'
Enter the above comments into your script, and fill in and run the correct command underneath each comment.
Do childrens’ ideas about friendship differ across cultures? We can
use the table
command to look at this, too. We use it to
produce a frequency table for each of the different cultures in
our sample, like this:
Add the following comments and commands to your script and run them:
# Produce culture x coded contingency table, put into 'cont'
cont <- table(friends$culture, friends$coded)
# Display contingency table
cont
activity feelings helping length norms trust
China 47 33 52 28 28 8
East Germany 75 19 26 58 6 12
Iceland 78 17 13 62 6 20
Russia 55 27 42 44 13 15
Here’s an explanation of each part of that command:
cont <-
Store this table as cont
, so
we can use it later. The command <-
stores the thing on
its right in the thing on its left.
table(rows, columns)
- The R command for producing
tables. We replace the word rows
with the name of the
variable we want to appear on the rows of the table, and we replace the
word columns
with the name of the variable we want to
appear in the columns of the table.
friends$culture
- The culture
column of
the friends
data frame. We’ve put this first in our
table
command, so culture
appears as
rows.
friends$coded
- The coded
column of the
friends
data frame. This appears second in our
table
command, so coded
appears as
columns.
cont
- Lastly, we type cont
on its own
to display the contingency table in the Console (clicking on
cont
in the Environment tab in RStudio won’t work in this
case).
R gives us a table, showing how many of each response were made in each culture. This is called a contingency table. The name contingency table comes from the word contingent, as in, for example “Getting your degree is contingent on passing your exams”. A contingency table gives the frequencies for one variable (e.g. the interview responses) contingent on another variable (e.g. the culture of the participants).
Close inspection of the contingency table reveals that, for example, the “helping” response is more common in China than in Iceland. The “activity” response is more common in Iceland than in Russia. So, it does look like childrens’ conceptions of friendship vary between cultures. Of course, not everyone in the same culture responded the same way but, overall, some types of response are more or less likely in some cultures than others.
Some people find it quite hard to notice these kinds of patterns in contingency tables, and the patterns are certainly harder to spot in a table than in a good visualization. The visualization we’re going to use here is called a mosaic plot. The command to do this in R is as follows:
Add the following comment and command to your script and run it:
# Display mosaic plot of 'cont'
mosaicplot(cont)
It’s called a mosaic plot because it’s made up of tiles.
In the above example, the width of each tile represents the
number of participants from each culture
. We collected data
from approximately the same number of people from each culture, so all
tiles are approximately the same width.
The height of each tile is determined by the frequency of each of the responses (feelings, helping, etc.) within each culture – the more common a response within a particular culture, the taller the tile.
Looking at this mosaic plot, it’s visually obvious that “length” is a less common response in China than in other countries.
So, it looks like there’s some kind of relationship between culture and conceptions of friendship … but how good is the evidence that this is a real result, and not just some kind of fluke we can put down to chance? As we covered in the Evidence worksheet, the best way to answer this question is to calculate a Bayes Factor (BF). In R, we can calculate the Bayes Factor for a contingency table like this:
Add the following cooments and commands to your script and run them:
# Load the BayesFactor package
library(BayesFactor, quietly = TRUE)
# Calculate Bayes Factor for contingency table 'cont'
contingencyTableBF(cont, fixedMargin = "rows", sampleType = "indepMulti")
Bayes factor analysis
--------------
[1] Non-indep. (a=1) : 107633530 ±0%
Against denominator:
Null, independence, a = 1
---
Bayes factor type: BFcontingencyTable, independent multinomial
The Bayes Factor is reported on the third line, towards the right. The Bayes Factor in this example is about 107.6 million. This means it’s more than 100 million times more likely that there is a relationship between culture and friendship concepts, than there isn’t.
Psychologists generally agree to believe the relationship is real if the Bayes Factor exceeds 3, and generally agree to believe the relationship is not real if the Bayes Factor is less than 0.33. So, in this example, we have very strong evidence for the existence of a relationship.
If you’re curious about what the rest of the output means, see more on relationships.
The first line, library(BayesFactor, quietly = TRUE)
loads the BayesFactor package, which is a set of extra commands
that allows R to calculate Bayes Factors.
contingencyTableBF()
- The command for calculating a
BF (Bayes Factor) for a contingency table.
cont
- Our contingency table (we stored it in
cont
earlier on in this worksheet).
fixedMargin = "rows", sampleType = "indepMulti"
-
This tells R that the different groups in your sample (in this case,
different cultures) appear as the rows
of your contingency
table. If you’d put them as the columns (e.g. if you’d used
table(friends$coded, friends$culture)
then you would change
this to fixedMargin = "cols"
. For a more detailed
explanation, see more on
relationships.
There’s a long history in psychology of performing a contingency-table chi-square test to examine the level of evidence for a relationship. The results of such tests are widely misinterpreted by psychologists, but some still like to see them anyway. Here’s how to calculate one for these data:
Add the following comment and command to your script and run it:
# Calculate traditional chi-square test on 'cont' contingency table
chisq.test(cont)
Pearson's Chi-squared test
data: cont
X-squared = 89.169, df = 15, p-value = 1.417e-12
The key result here is the p-value
. It’s important to
emphasize that this p value is not the
probability that the observed relationship is due to chance. As we
covered in the Evidence worksheet, there is no way to explain
this p value that is simple, useful, and accurate.
Nonetheless, the convention is that if the p value is less than 0.05, psychologists will generally believe you when you assert that the relationship is not due to chance. If the p value is greater than 0.05, they will generally be skeptical.
The p value in this example is very small, so has been reported in standard form, and is read as 1.417 x 10-12. You would have been taught standard notation in school but, as a reminder, 1.417 x 10-12 = .000000000001417. See this BBC bitesize revision guide on standard form if you need a bit more explanation than that.
The reported p value is less than .05 in this example, and so psychologists will generally believe your result is real.
In addition to the p value, psychologists will generally
record at least two further numbers in their articles. The first is the
chi-square value, written as X-squared
in the above output,
but as \(\chi^2\) in articles.
The second is the degrees of freedom (df
in the
above output). In this case, degrees of freedom relates to the
size of the contingency table, and is the number of columns, minus one,
multiplied by the number of rows, minus one
(i.e. (rows - 1) x (cols -1)
).
In you were writing up this analysis in a report, you would write something like:
The coded friendship concepts occurred with different frequency across cultures, BF = 1.08 x 108, \(\chi^2\)(15) = 89.2, p < .001, see Table 1.
“Table 1” would be the contingency table you’d produced with the
table
command.
As discussed in the Evidence worksheet, it is also important to report the method by which you calculated your Bayes Factor. So, somewhere in your report, you should say something like:
Bayes Factors were calculated using the BayesFactor package (Morey & Rouder, 2022), within the R environment (R Core Team, 2022).
You can get the references for these citations by typing
citation("BayesFactor")
and citation()
.
Each step in this exercise can be completed by slightly modifying a command you have already used.
#EXERCISE 2
Add these modified commands, along with the above comment, to your script and run them.
Here are the things you should do:
cont <- table(friends$culture, friends$coded)
appropriately.If your modified command still uses cont
, the commands
you used before should now work without having to modify them:
Produce a mosaic plot from this contingency table.
Calculate the Bayes Factor for the relationship. Enter your Bayes Factor into your lab book.
Perform a contingency chi-square test.
When you write up an experiment, you often need to provide some summary information about the sample, including the exact number of participants, and the gender balance. R makes it easy to work these things out, as this worksheet shows: sample characteristics.
For more detailed information on the analyses covered in this worksheet, see more on relationships.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0. It is part of Research Methods in R, by Andy Wills