Before starting this exercise, you should have completed all the previous Absolute Beginners’ workshop exercises. Each section below indicates which of the earlier worksheets are particularly relevant.
Relevant worksheet: Introduction to RStudio
Use this example CSV file for the workshop and assessment.
Create or open an appropriate project on RStudio Server for
this analysis (Plymouth University students: use the
project ‘psyc414’ created in the inter-rater
reliability worksheet), upload your CSV to your
project, and create a new R script called
corr.R
.
Relevant worksheet: Exploring data
Now, add these comments and commands to your script and run them; they will load the tidyverse package, and load your data.
## Correlations
# Load package
library(tidyverse)
# Load data
data <- read_csv("corr.csv")
Note: In the example above, you’ll need to replace
corr.csv
with the name of the CSV file you just copied into
your RStudio project.
Look at the data by clicking on it in the Environment tab in RStudio. Each row is one participant in one group. Here’s what each of the columns in the data set contain:
Column | Description | Values |
---|---|---|
SRN | ID number of participant | a number |
grp | ID number of the group that this participant was in | a number |
ingroup | Participant’s rating of ingroup closeness | 1 (low) - 10 (high) |
outgroup | Participant’s rating of outgroup distance | 1 (low) - 10 (high) |
dominance | Participant’s rating of the dominance of their group leader | 1 (low) - 10 (high) |
This is a large dataset comprising over 200 participants.
Relevant worksheets: Group differences, face recognition
The data from this study is different to other data you have looked at so far in this course. In particular, the participants worked as a group, rather than individually. This means that, for example, the ratings of ingroup closeness are likely to be more similar within a group, than between groups. For example, one group might have got on really well with one another, so they all gave quite high closeness ratings. Another group might not have ‘gelled’, and went on to all give quite low ratings. Of course, even within groups, some ratings may be higher than others (some members of group X might feel closer to Group X than others), but it’s likely that ratings within the same group will be more similar than ratings across groups. We call this sort of data hierarchical data.
We won’t cover how to make the most out of hierarchical data until a
later course. For this introductory course, we’re going to take the
simple approach of averaging ratings within each group. So, for example,
if the group had two members, one who gave a rating of 5 and the other a
rating of 7, we would average these and record the group’s score as 6.
As we covered in the group
differences worksheet, we can do this using the
group_by
and summarise
commands.
Add the following comment and commands to your script and run them (CTRL+ENTER):
# Group 'data'; take means of 'ingroup', 'outgroup', 'dominance' columns; put results in 'gdata'
gdata <- data %>%
group_by(grp) %>%
summarise(ingroup = mean(ingroup),
outgroup = mean(outgroup),
dominance = mean(dominance))
We’ve put our answers into a new data frame, gdata
, so
go to the Environment window of RStudio and click on
gdata
to see your summarized data. You’ll now see one line
for each group in your study. As before, you can safely ignore
the “ungrouping” message that you receive.
Most of the above command is the same as in the group differences worksheet, and the
face recognition worksheet — take a
look back at those sheets if you need a reminder. The new thing here is
that we are calculating the mean for more than one variable. In fact,
we’re calculating it for three variables
(ingroup, outgroup, dominance
). The summarise
command can do this, as long as there is a comma (,
)
separating the things you want a summary of.
Relevant worksheets: Group differences, facial attractiveness
Did every group give basically the same rating of ingroup closeness, or did closeness vary a lot between groups? One way to take a look at this is to produce a density plot, as we covered in the group differences and facial attractiveness worksheets.
Add the following comment and commands to your script and run them:
# Produce density plot of 'ingroup'
gdata %>% ggplot(aes(ingroup)) + geom_density(aes(y = ..scaled..)) + xlim(1, 10)
In the example above, the most common (modal) rating of ingroup closeness is between 7 and 8. So, on average, people rated the ingroup closeness as quite high. However, there were quite a range of ratings, both above and below this modal rating. Your data may be different.
We can ask the same question about outgroup distance. Did everyone
give basically the same rating, or did outgroup distance vary a lot
between groups? Changing ingroup
to outgroup
in the above command gives us the answer.
Add the following comment and commands to your script and run them:
# Produce density plot of 'outgroup'
gdata %>% ggplot(aes(outgroup)) + geom_density(aes(y = ..scaled..)) + xlim(1, 10)
In the example above, most groups gave close to the lowest possible rating (1), so we see a large peak in the plot at around 1. We also see a series of much smaller peaks, indicating that a few groups gave much higher ratings. It is possible that these mostly low ratings are due to social desirability bias – the phenomenon that people are reluctant to give answers that their social group would view negatively.
As in the last example, your data may look different.
Relevant worksheet: Face recognition
So, ingroup closeness varies between groups, as does outgroup distance (at least to some extent). Are these two sorts of variability related? For example, does high ingroup closeness tend to be associated with high outgroup distance – perhaps feeling close to your ingroup is associated with feeling distant from your outgroup?
Or perhaps high ingroup closeness is associated with low outgroup distance — feeling close to your own group also makes you feel close to other groups? Or, a third option, perhaps the two things are unrelated — whether you have high or low ingroup closeness does not predict your outgroup distance.
One way to look at this question is to produce a scatterplot. On a scatterplot, each point represents one group. That point’s position on the x-axis represents their ingroup closeness, and that point’s position on the y-axis represents their outgroup distance.
The command to produce a scatterplot in R is much like the command
for a bar graph, as you used in, for example, the face recognition worksheet. The only
difference is that we use the geom_point()
command (because
the graph is a set of dots or points) rather than the
geom_col()
command we used for bar (column)
charts.
Add the following comment and commands to your script and run them:
# Produce scatterplot of 'ingroup' against 'outgroup'
gdata %>% ggplot(aes(x = ingroup, y = outgroup)) + geom_point()
In the above example, many of the points are close to the x axis. This is because, as we saw above, most groups gave a rating close to 1 for outgroup distance. However, once we get to an ingroup closeness above 8, an interesting pattern starts to emerge. As ingroup closeness increases from 8 to 10, outgroup distance rises from around 1 to around 7 or 8.
So it seems that, in this example dataset, ingroup closeness and outgroup distance are related. We call this type of relationship a correlation.
Relevant worksheets: Group differences
Sometimes, it’s useful to have a single number that summarizes how
well two variables are correlated. We can calculate this number, called
a correlation co-efficient, using the cor
command
in R.
Add the following comment and command to your script and run it:
# Calculate correlation co-efficient between 'ingroup' and 'outgroup'
cor(gdata$ingroup, gdata$outgroup)
[1] 0.6641777
The command is used in a similar way to the cohen.d
command you used to calculate effect size in the group differences worksheet:
cor()
- The command to calculate a correlation
co-efficient.
gdata$ingroup
- One variable is in the
ingroup
column of the gdata
data frame.
,
- this comma needs to be here so R knows where one
variable ends and the other begins.
gdata$outgroup
- The other variable is in the
outgroup
column of the gdata
data frame.
In the above example, the correlation co-efficient was about 0.66. By tradition, we use a lower case r to represent a correlation co-efficient, so here r = 0.66. In order to make sense of this number, you need to know that the biggest r can ever be is 1, and the smallest it can ever be is -1.
Where r = 1: A correlation of 1 means a perfect linear relationship. In other words, there is a straight line you can draw that goes exactly through the centre of each dot on your scatterplot. The line can be shallow, or steep. Here are some examples:
Where r = 0: A correlation of zero means there is no relationship between the two variables. Here are some examples:
Where r is between 0 and 1: As the correlation co-efficient gets further from zero, the relationship between the two variables becomes more like a straight line. Here are some more examples:
Where r is less than 0: A negative correlation co-efficient just means that, as one variable gets larger, the other gets smaller:
Relevant worksheets: Group differences
A correlation co-efficient is much like an effect size, which we covered in the group differences worksheet. More specifically, it measures the strength of the relationship between the two variables (sometimes called the covariance), relative to the variance of each variable considered on its own.
Jacob Cohen suggested the following conventions in describing correlation co-efficients: a co-efficient of 0.1 is described as a weak relationship, a correlation of 0.3 is described as a moderate association, and a correlation of 0.5 is described as a strong relationship. Not all psychologists agree with these descriptions.
Relevant worksheet: Evidence
So far, we’ve produced a scatterplot of ingroup closeness versus outgroup distance, and we’ve calculated a correlation co-efficient for that relationship ( r=0.66 in the example above ). But is the relationship between these two variables real, or a fluke? Much like the Bayesian t-test we calculated in the evidence worksheet, we can calculate a Bayes Factor for the relationship between two variables.
The first step is to load the BayesFactor package, which we previously used in the evidence worksheet.
Add the following comment and command to your script and run it:
# Load BayesFactor package
library(BayesFactor, quietly = TRUE)
Then, we use the correlationBF
command, which has a
similar format to the cor
command above.
Add the following comment and command to your script and run it:
# Calculate Bayes Factor for correlation between 'ingroup' and 'outgroup'
correlationBF(gdata$ingroup, gdata$outgroup)
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 89.70525 ±0%
Against denominator:
Null, rho = 0
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
The Bayes Factor is reported on the third line, towards the right. In this example, our Bayes Factor is about 89.71. This means it’s about ninety times as likely there is a relationship between these two variables as there isn’t. This is larger than the conventional threshold of 3, so psychologists will generally believe you when you claim that there is a relationship between ingroup closeness and outgroup distance. If the Bayes Factor had been less than 0.33, this would have been evidence that there was no relationship.
As we covered in the Evidence worksheet, psychologists have typically reported p values, despite the fact that p values are widely misinterpreted. If you want to calculate a p value for a correlation co-efficient, you can use the following command.
Add the following comment and command to your script and run it:
# Traditional test for correlation between 'ingroup' and 'outgroup'
cor.test(gdata$ingroup, gdata$outgroup)
Pearson's product-moment correlation
data: gdata$ingroup and gdata$outgroup
t = 4.2608, df = 23, p-value = 0.0002939
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3647781 0.8390981
sample estimates:
cor
0.6641777
The p value in this case is about .00029. The p value is not the probability that the null hypothesis is false, nor is it anything else that is both clear and useful (see the Evidence worksheet for more details). However, the value of .00029 is lower than the conventional .05 cutoff. This means psychologists will generally believe you when you claim that there is a relationship between ingroup closeness and outgroup distance.
In this exercise, you’ll apply what you’ve learned to the relationship between ingroup closeness, and group-leader dominance. Do each of the following analyses, and include them as part of your report. In order to get graphs from RStudio into your word processor, follow these instructions.
Hint: Most of these steps can be completed by
copying the commands you used earlier, and replacing
outgroup
with dominance
.
# EXERCISE
Add the above to your script. Then, add comments and commands to your script to do the following, and run those commands:
Make a density plot of the dominance scores.
Make a scatterplot with ingroup closeness on the x-axis, and group-leader dominance on the y-axis.
Calculate the correlation co-efficient for ingroup versus dominance.
Calculate the Bayes Factor for this correlation.
For more detailed information on the analyses covered in this worksheet, see more on relationships, part 2.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.