Before starting this exercise, you should have had a brief introduction to using RStudio – Introduction to RStudio. You should also have also completed the workshop exercises for Exploring Data. If not, take a look these earlier worksheets before continuing.
The first few steps are the same as in the Exploring Incomes
workshop. First, log in to
RStudio
server, and make sure you are in your psyc411
project.
Next, create a new script file in your project,
called groupdiff.R
.
Now, you load tidyverse and you load the income data frame, by adding the following comments and commands to your R script, and using CTRL+ENTER to run each command in turn:
# GROUP DIFFERENCES
# Load packages
library(tidyverse)
# Load data
cpsdata <- read_csv("https://www.andywills.info/cps2.csv")
As a reminder, here’s what each of the columns in the income data frame contain:
Column | Description | Values |
---|---|---|
ID | Unique anonymous participant number | 1-10,000 |
sex | Biological sex of participant | male, female |
native | Participant born in the US? | foreign, native |
blind | Participant blind? | yes, no |
hours | Number of hours worked per week | a number |
job | Type of job held by participant: | charity, nopay, private, public |
income | Annual income in dollars | a number |
education | Highest qualification obtained | grade-school, high-school, bachelor, master, doctor |
One of the most widely discussed issues concerning income is the difference between what men and women, on average, get paid. Let’s have a look at that difference in our teaching sample of 10,000 US participants.
We’ll start by calculating median income, irrespective of biological sex. You already did this in the Exploring data workshop, so this is revision. If you need to, take a look back at that last worksheet to remind yourself how this works:
# Calculate median income
cpsdata %>% summarise(median(income))
What we need, though, are two median incomes – one for males and one
for females. In R, the command group_by
allows us to do
this. In this case, we want to group the data by biological sex, so the
command is group_by(sex)
. We pipe
(%>%
) the data in cpsdata
to the
group_by
command in order to group it, and then we
pipe (%>%
) it to summarise
to get
a summary for each group (a median, in this case).
So, the full command to add to your script and run (CTRL+ENTER) is:
# Calculate median income by sex
cpsdata %>% group_by(sex) %>% summarise(median(income))
# A tibble: 2 × 2
sex `median(income)`
<chr> <dbl>
1 female 52558.
2 male 61746.
Looking at your output, you’ll be able to see that median income for males is around $62,000, while for females it’s around $53,000. (If you’re curious what the rest of the output means, including the word “tibble”, see more on tibbles). The message concerning “ungrouping output” just means that after R gives you the means, it forgets that you grouped the data (so if you want to use these groups again, you’ll have to ask for them again, see later).
R also works as a calculator, so we can work out female median income as a percentage of male median income – this is a standard way of expressing the gender pay gap:
# Calculate gender pay gap
52558 * 100 / 61746
[1] 85.11968
In conclusion, women in our made-up sample get paid, on average, around 85% what men get paid. This is about the same percentage as we see in the analyses of the gender pay gap in the US, using real data sets.
Of course, not every male gets $62k a year in the US, and not every female gets $53k. It seems very likely that the range of incomes earned by men and women overlap – meaning that if you picked one man and one woman at random, there’s a reasonable chance that the woman earns more than the man. This variation in pay is the topic of the next exercise.
Standard deviation is a number that basically represents how
far, on average, people are from the mean. If the standard deviation of
income is large, it’s quite likely that a person, picked at random, will
have an income that’s a lot different to the mean. You can see this in
the first few lines of our cpsdata
data set – the
participant with ID 5 has an annual income of over $700k; more than 12
times the median income!
What is the standard deviation in pay for males, and for
females? We can calculate this in R with a minor change to the commands
we’ve previously used. Specifically, we want a by-group summary, not of
median
income, but of the standard deviation,
sd
, of income.
So, the command to add to your script and run is:
# Calculate standard deviation of income by sex
cpsdata %>% group_by(sex) %>% summarise(sd(income))
# A tibble: 2 × 2
sex `sd(income)`
<chr> <dbl>
1 female 130449.
2 male 127601.
The standard deviations for both men and women are very large (around $130k) compared to the size of the average pay gap (around $9k). This means there is very considerable overlap in salaries between these two groups. It’s quite hard to get an intuitive sense of the size of this overlap just from the standard deviations, but a graph can help to make things clearer.
In the Exploring data worksheet, we used R to produce a histogram of incomes. The first thing we’re going to do now is to produce a scaled density plot of incomes. Scaled density plots can be interpreted much the same way as a histogram - the higher the curve is at a particular income, the more people who have that income.
The main difference between a scaled density plot and a histogram is that the highest point on a scaled density plot is always one. This can make it easier to compare two groups, particularly if one group has fewer people in it than the other.
So here’s the command to add to your script, that
will do a scaled density plot for incomes, ignoring biological sex. It
works the same way as the histogram command from last time, except that
we’ve replaced geom_histogram
with
+ geom_density
.
# Produce a scaled density plot for income
cpsdata %>% ggplot(aes(income)) + geom_density(aes(y=..scaled..))
The part of the command aes(y=..scaled..))
says that the
aesthetic (‘aes’) we want for the y-axis of the graph is “scaled” … in
other words we want a scaled density plot.
Next, we’re going to produce two density plots, on the same axes – one for males, and one for females. This is so we can look at how much the two sets of incomes overlap.
We want these two density plots to be different colours, to make it
easy for us to tell them apart. We do this by adding
colour=factor(sex)
to the aesthetics (aes
) of
the plot.
Here’s the command to add to your script and run:
# Produce a scaled density plot for income, by sex
cpsdata %>% ggplot(aes(income, colour=factor(sex))) + geom_density(aes(y=..scaled..))
Oh dear! That was not very revealing! These two lines seem basically on top of each other … but they can’t be because we know the two groups differ in median income by several thousand dollars. We have a problem to solve…
The problem is one of scale. As we discussed in Exploring data, there are a small
number of people who earn very high salaries. In fact, both the
highest-paid man, and the highest-paid woman in our sample earn
considerably more than $1m. We can find their exact salaries using the
max
(short for “maximum”) summary; here’s the
command to add to your script and run:
# Find highest income, by sex
cpsdata %>% group_by(sex) %>% summarise(max(income))
# A tibble: 2 × 2
sex `max(income)`
<chr> <dbl>
1 female 1908742
2 male 1508971
These figures relate to another phenomenon in the US (and many other countries) – income inequality. When we work out income inequality for a company (e.g. Starbucks), we take the salary of the CEO and divide it by the median salary of the workers. The CEO of Starbucks earns about 700 times as much as the average worker at Starbucks. In our sample of 10,000 people, the best paid woman earns about 36 times more than the average (median) woman:
# Calculate income inequality
1908742 / 52558
[1] 36.31687
Somehow, we need to deal with the fact that a few people in our sample are very well paid, which makes the difference between men and women hard to see on our graph, despite the difference being in the range of several thousand dollars a year. One of the easiest ways around this is to exclude these very high salaries from our graph.
Looking at our previous density plots, we can see that the vast majority of people are paid less than $150k a year. So, let’s restrict our plotting to just those people.
We do this using the filter
command. It’s called
filter because it works a bit like the filter paper in a
chemistry lab (or in your coffee machine) – stopping some things, while
letting other things pass through. We can filter our data by telling R
what data we want to keep. Here, we want to keep all people who
earn less than £150k, and filter out the rest. So the filter we need is
filter(income < 150000)
, where <
means
“less than”.
We’ll be using this dataset of people with <$150k incomes a few
times, so we’re going to give it a new name, cpslow
(or any
other name you want, e.g. angelface )
So, what we need to do is pipe (%>%
) our
cpsdata
data to our
filter(income < 150000)
, and use an arrow,
<-
, to send this data to our new data frame,
cpslow
. Recall that <-
sends the thing on
its right to the thing on its left, so the full command to add
to your script and run is:
# Select people with incomes under £150K, put into 'cpslow'
cpslow <- cpsdata %>% filter(income < 150000)
We can take a look at this new data frame by clicking on it in RStudio’s Environment window. By looking at the ID numbers, you can see that some people in our original sample have been taken out, because they earned at least $150k.
Now, we can plot these filtered data in the same way as before, by
changing the name of the dataframe from cpsdata
to
cpslow
.
So start with the command
cpsdata %>% ggplot(aes(income, colour=factor(sex))) + geom_density(aes(y=..scaled..))
,
make that change, add it to your script (after copying the
comment below), and run it..
# Produce a scaled density plot for income, by sex (for incomes < £150k)
If you’ve got it right, your graph will look like this:
At first glance, the two distributions of incomes still look quite similar. For example, the modal income – the point where the graph is highest – is at quite a low income, and that income is quite similar for both men and women. However, on closer inspection, you’ll also see that the red line (females) is above the blue line (men) until about $60k, and below the blue line from then on. This means that more women than men earn less than $60k, and more men than women earn more than $60k.
So, the gender pay gap is visible in this graph. The graph also illustrates that the difference in this sample is small, relative to the range of incomes.
Note: This doesn’t mean that the gender pay gap is less (or more) important than income inequality. These kinds of questions of importance are moral, philosophical, and political. Statistics cannot directly answer these kinds of questions, but they can provide information to inform the debate.
Effect size is a way of talking about the size of the difference between group means, relative to the standard deviations of those groups. We often use the letter d to stand for effect size.
If a difference has an effect size of 1, the difference in means is equal to the standard deviation. In social science, an effect size of 1 is considered “large” – in other words, many of the things we are interested in have an effect size smaller than 1.
At the other end of the scale, an effect size of 0.2 is considered to be “small”. Note that “small” refers to the size of the effect relative to the standard deviation, not its importance to society. If you have an effect size of 0.2, the difference between your groups is one-fifth the size of the standard deviation.
Below are some examples of small, medium, and large effect sizes. In these examples, each group has 100 participants.
In this section, we’re going to calculate the effect size for the
gender pay gap, in our sample of people who earn less than $150k. We
could do this using the mean
and sd
summaries
we used previously to calculate means and standard deviations. However,
there is a quicker way, using the effsize package.
Recall that packages in R are a way to add new commands. So,
the first thing we need to do is load the effsize
package
using the library
command.
So, add this comment and command to your script and run it:
# Load effect size package
library(effsize)
If you get an error here, please see common errors.
Next, we calculate effect size. Note: Take care if
typing this in by hand. $
is the US dollar symbol.
~
is a tilde. On a standard UK keyboard, the tilde
is immediately to the right of :
. On a UK Mac, it’s
immediately left of Z
.
Here’s the comment command to add to your script and run:
# Calculate effect size for income gender gap
cohen.d(cpslow$income ~ cpslow$sex)
Cohen's d
d estimate: -0.2024276 (small)
95 percent confidence interval:
lower upper
-0.2433951 -0.1614601
Here’s a step-by-step explanation of how the above command works. You’ll need this in a moment to calculate effect sizes for yourself.
cohen.d()
- Effect size can be calculated in a number
of different ways. The method we’re using in this class is one of the
most common. It’s called Cohen’s d, and is named after Jacob
Cohen. This is why the command to calculate effect size we used is
called cohen.d()
.cpslow$income
- We need to tell
cohen.d()
where to find the numbers we are interested in.
In this case, its the income
column of the
cpslow
dataframe that we made earlier. We tell R
this by typing cpslow$income
. Yes, that’s $
,
the same symbol as we use to indicate US Dollars. However, it doesn’t
mean “dollars” in R. It means column. So,
cpslow$income
means the income
column of the
cpslow
dataframe.
cpslow$sex
- We also need to tell
cohen.d()
whether each income was generated by a man or a
woman. This information is in the sex
column of the
cpslow
dataframe, so we type
cpslow$sex
.
~
- This symbol, called a tilde, means “as
a function of”. So, cpslow$income ~ cps$sex
means “income
as a function of sex”, and hence
cohen.d(cpslow$income ~ cps$sex)
means “give me the effect
size of biological sex on annual income in the cpslow
dataset.
The effect size is reported on the second line of text. The minus
sign is there because “female” is earlier in the dictionary than “male”.
R takes the mean incomes in the order their names appear in the
dictionary, and so calculates female - male
. Female mean
income is lower than male mean income, so the number is negative.
Generally speaking, we can ignore the minus sign when reporting effect
sizes.
So the effect size is about 0.20. This is typically described as a small effect size — although recall that this just means the standard deviation is about five times larger than the difference between the groups. It does not mean “small” in the sense of “unimportant”. Statistics cannot tell you which group differences are socially or morally important.
The other lines of output report the 95% confidence interval for the effect size. This tell us that while our best estimate of the effect size on the basis of the data we have is about 0.202, it might not be exactly 0.202. However, we can be pretty confident the effect size is somewhere between 0.16 and 0.24. If we had more data, we could be more precise.
In the R output, lower and upper indicate the two ends of the range of likely effect sizes.
In this exercise, you’ll consolidate what you’ve learned so far.
The task is to further examine this sample of participants
who are living in the US, and earning less than $150k
(cpslow
).
Specifically, the question to answer is whether people born in the US earn more. In order to do this, you should calculate the mean income for each group, produce a density plot with one line for each group, and calculate the effect size.
# EXERCISE 1
After adding the above comment to your script, add commands and comments to answer the question to your script, and run them. Below are the answers you are aiming for:
# A tibble: 2 × 2
native `mean(income)`
<chr> <dbl>
1 foreign 51423.
2 native 58905.
Cohen's d
d estimate: -0.1960726 (negligible)
95 percent confidence interval:
lower upper
-0.2593175 -0.1328277
This second exercise is similar to the first, but a bit more challenging.
The task is to calculate mean income, a density plot, and
effect size, for private versus public sector workers in the
cpslow
dataset.
The thing that makes this slightly harder is that the datatset has
more than two job types, so you’re going to use the filter
command to get just the participants in private or
public jobs. Here’s how…
You can list the values you want to keep with filter
,
using |
to seperate them. For example, if you wanted to
keep just the master
and doctor
levels of the
dataset you’d write:
cpslow %>% filter(education == "master" | education == "doctor")
The symbol |
is read “or”, so the command is “give me
the people who have a Master’s degree or a Doctoral degree”. On a
standard UK keyboard you’ll find |
to the immediate left of
Z
. On a UK Mac keyboard, it’s immediately to the right of
"
.
If you get an error, please see common errors.
# EXERCISE 2
After adding the above comment to your script, add commands and comments to answer the question to your script, and run them. Below are the answers you are aiming for:
# A tibble: 2 × 2
job `mean(income)`
<chr> <dbl>
1 private 63453.
2 public 70435.
Cohen's d
d estimate: -0.1856093 (negligible)
95 percent confidence interval:
lower upper
-0.2673074 -0.1039113
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.