Before starting this exercise, you should have had a brief introduction to using RStudio. If not, take a look at the Using RStudio worksheet.
Throughout this worksheet, you’ll see the commands you should type into RStudio inside a grey box, followed by the output you should expect to see in one or more white boxes. Any differences in the colour of the text can be ignored.
Each command in this worksheet is followed by one or more explanation sections - those are there to help you understand how the commands work and how to read the output they produce.
First, we need to load a package called tidyverse. A package is an extension to R that adds new commands. Nearly everything we’ll do in this course uses the tidyverse package, so pretty much every project starts with this instruction.
Type (or copy and paste) the following from the grey box into the Script window of RStudio, starting at line 1. Now, with your cursor on line 3, press CTRL+ENTER (i.e. press the key marked ‘Ctrl’ and the RETURN or ENTER key together).
# Exploring data (briefly)
# Load package
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
When you do this, line 3 is automatically copied to your Console window and run. Then, RStudio will print some text to the Console (shown in the white box, above). This text tells you that the tidyverse package has loaded (“attached”) some other packages (e.g. dplyr). It also tells you that the dplyr package changes the way some commands in R work (“conflicts”). That’s OK.
If you get an output that includes the word ‘error’, please see common errors.
Note: The first two lines are comments. Any
line starting with a #
is a comment. These are ignored by
Rstudio, but they are make it easier for humans to work out what is
going on!
You should notice that the name Untitled1
on the
Script window has now gone red. This is to remind you that your
script has changed since the last time you saved it. So, click on the
“Save” icon (the little floppy disk) and save your R script with some
kind of meaningful name, for example vbgr.R
(Plymouth University students: Please use this exact
name). The .R
indicates that it is an R script.
Re-save your script each time you change something in it; that way, you won’t lose any of your work.
Now, we’re going to load some data on the income of 10,000 people in the United States of America. I’ve made up this dataset for teaching purposes, but it’s somewhat similar to large open data sets available on the web, such as US Current Population Survey). Here’s how you get a copy of this data into RStudio so you can start looking at it:
Download a copy of the data, by clicking here and saving it to the Downloads folder of your computer.
Go to RStudio in your web browser.
Click on the ‘Files’ tab in RStudio (bottom right rectangle)
Click the ‘Upload’ button.
Click ‘Browse…’
Go to your Downloads folder, and select the file you just saved there.
Click “OK”.
Copy or type the following comment and command into your RStudio script window, and run it (i.e. press CTRL+ENTER while your cursor is on that line)
# Load data into 'cpsdata'
cpsdata <- read_csv("cps2.csv")
Rows: 10000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): sex, native, blind, job, education
dbl (3): ID, hours, income
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There are three parts to the command
cpsdata <- read_csv("cps2.csv")
:
The first part of the command is cpsdata
. This gives
a name to the data we are going to load. We’ll use this name to refer to
it later, so it’s worth using a name that is both short and meaningful.
I’ve called it cpsdata
because it’s somewhat similar to
data from the US Current Population Survey, but you can give data pretty
much any name you choose (e.g. fart).
The bit in the middle, <-
, is an arrow and is
typed by pressing <
and then -
, without a
space. This arrow means “put the thing on the right of the arrow into
the thing on the left of the arrow”.
The last part of the command is
read_csv("cps2.csv")
. It loads the data file into
cpsdata
. The part inside the speech marks,
cps2.csv
, is the name of the file you just uploaded to your
RStudio project. This command can also download data directly from the
web, for example
read_csv("https://andywills.info/cps2.csv")
. This would
have been a quicker way to do it in this case, but of course not all
data is on a web page.
R likes to print things in red sometimes – this does not mean there’s
a problem. If there’s a problem, it will actually say ‘error’. The
output here tells us that R has loaded the data, which has eight parts
(columns, or cols
). It gives us the name of the columns
(ID, sex, ...
) and tells us what sort of data each column
contains: character
means the data is words
(e.g. ‘female’), double
means the data is a number
(e.g. ‘42.78’).
If you get an error here, please see common errors.
Next, we’ll take a peek at these data. You can do this by clicking on the data in the Environment tab of RStudio, see Using RStudio.
We can now see the data set (also known as a data frame). We
can see that this data frame has 8 columns and 10000 rows. Each row is
one person, and each column provides some information about them. Below
is a description of each of the columns. Where you see NA
this means this piece of data is missing for this person – quite common
in some real datasets.
Here’s what each of the columns in the data set contains:
Column | Description | Values |
---|---|---|
ID | Unique anonymous participant number | 1-10,000 |
sex | Biological sex of participant | male, female |
native | Participant born in the US? | foreign, native |
blind | Participant blind? | yes, no |
hours | Number of hours worked per week | a number |
job | Type of job held by participant: | charity, nopay, private, public |
income | Annual income in dollars | a number |
education | Highest qualification obtained | grade-school, high-school, bachelor, master, doctor |
Now we have these data, one question we can ask is “what is the average income of people in the U.S.?” (or, at least, in this sample). In this first example, we’re going to calculate the mean income.
I’m sure you learned about means in school but, as a reminder, you calculate a mean by adding up all the incomes and dividing by the number of incomes. Our sample has 10,000 participants, so this would be a long and tedious calculation – and we’d probably make an error.
It would also be a little bit tedious and error prone in a spreadsheet application (e.g. Excel, Libreoffice Calc). There are some very famous cases of these kinds of “Excel errors” in research, e.g. genetics, economics.
In R, we can calculate the mean instantly, and it’s harder to make the sorts of errors that are common in Excel-based analysis.
To calculate mean income in R, we add the following comment and command to our script, and press CTRL+ENTER:
# Display mean income
cpsdata %>% summarise(mean(income))
# A tibble: 1 × 1
`mean(income)`
<dbl>
1 87293.
Your output will tell you the mean income in this sample – it’s the last number on the bottom right, and it’s approximately $87,000.
If you’re happy with the output you’ve got, move on to the next section. If you would like a more detailed explanation of this output, see more on tibbles.
If you get an error here, please see common errors.
This command has three components:
The bit on the left, cpsdata
, is our data frame,
which we loaded and named earlier.
The bit in the middle, %>%
, is called a
pipe. Its job is to send data from one part of your command to
another. It is typed by pressing %
then >
then %
, without spaces. So cpsdata %>%
sends our data frame to the next part of our command.
The bit on the right, summarise(mean(income))
is
itself made up of parts. The command summarise
does as the
name might suggest, it summarises a set of data (cpsdata
in
this case) into a single number, e.g. a mean. The mean
command indicates that the type of summary we want is a mean (there are
also a number of other types of summary, as you’ll see in other
courses). Finally, income
is the name of the column of
cpsdata
we want to take the mean of – in this case, the
income of each individual.
To calculate the mean number of hours worked per week, we have to deal with the fact that there is some missing data - we don’t know for all 10,000 people how many hours they work in a week, because they didn’t all tell us. To get a mean of those who did tell us, we tell R to ignore the missing data, like this:
# Calculate mean hours per week
cpsdata %>% summarise(mean(hours, na.rm = TRUE))
# A tibble: 1 × 1
`mean(hours, na.rm = TRUE)`
<dbl>
1 38.9
rm
is short for ‘remove’, but ‘ignore’ would be a more
accurate description, as this command doesn’t delete the NA
entries in cpsdata
, it just ignores them. So
na.rm = TRUE
means “ignore the missing data”.
If you get an error here, please see common errors.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.