Exploring data

Before you start…

Before starting this exercise, you should have had a brief introduction to getting and using RStudio. If not, take a look at the Introduction to RStudio.

How to use these worksheets
Loading a package
Loading data
Inspecting data
Calculating a mean
Calculating a median
Dealing with missing data
Introduction to graphs
Customising graphs
Lab book writing exercise

How to use these worksheets

Throughout this worksheet, you’ll see the commands you should type into RStudio inside a grey box, followed by the output you should expect to see in one or more white boxes. Any differences in the colour of the text can be ignored.

Each command in this worksheet is followed by one or more explanation sections - those are there to help you understand how the commands work and how to read the output they produce.

Loading a package

First, we need to load a package called tidyverse. A package is an extension to R that adds new commands. Nearly everything we’ll do in this course uses the tidyverse package, so pretty much every project starts with this instruction.

Here’s how you do this:

Type (or copy and paste) the comments and command in the grey box below into the script you have already created on RStudio (top left window, exploring.R). Use line 2 of the script. (Line 1 contains the comment you entered in the last worksheet).
Save your script again (click the Save icon), so you don’t lose anything. Do this each time you add something important to your script.
Now ask RStudio to run the command library(tidyverse). You do this by putting your cursor on the line of the script window containing the command, and pressing CTRL+ENTER (i.e. press the key marked ‘Ctrl’ and the RETURN or ENTER key together). The line is automatically copied to your Console window and run.

# EXPLORING INCOMES
# Load packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

When you do this, RStudio will print some text to the Console (shown in the white box, above). This text tells you that the tidyverse package has loaded (“attached”) some other packages (e.g. dplyr). It also tells you that the dplyr package changes the way some commands in R work (“conflicts”). That’s OK.

If you get an output that includes the word ‘error’, please see common errors.

Loading data

Now, we’re going to load some data on the income of 10,000 people in the United States of America. I’ve made up this dataset for teaching purposes, but it’s somewhat similar to large open data sets available on the web, such as US Current Population Survey.

Copy the comment and command in the grey box to your script in RStudio and then press CTRL+ENTER to run it: (don’t forget to save your script)

# Load data
cpsdata <- read_csv("https://www.andywills.info/cps2.csv")

Rows: 10000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): sex, native, blind, job, education
dbl (3): ID, hours, income

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explanation of command

There are three parts to the command cpsdata <- read_csv("https://www.andywills.info/cps2.csv"):

The first part of the command is cpsdata. This gives a name to the data we are going to load. We’ll use this name to refer to it later, so it’s worth using a name that is both short and meaningful. I’ve called it cpsdata because it’s somewhat similar to data from the US Current Population Survey, but you can give data pretty much any name you choose (e.g. fart).
The bit in the middle, <-, is an arrow and is typed by pressing < and then -, without a space. This arrow means “put the thing on the right of the arrow into the thing on the left of the arrow”.
The last part of the command is read_csv("https://www.andywills.info/cps2.csv"). It’s a way of downloading data from the Internet. The part inside the speech marks, https://www.andywills.info/cps2.csv, is a web address, such as you’d use to access any other web page (e.g. http://www.twitter.com/ajwills72)

Explanation of output

R likes to print things in red sometimes – this does not mean there’s a problem. If there’s a problem, it will actually say ‘error’. The output here tells us that R has loaded the data, which has eight parts (columns, or cols). It gives us the name of the columns (ID, sex, ...) and tells us what sort of data each column contains: character means the data is words (e.g. ‘female’), integer means the data is a whole number (e.g. ‘42’).

If you get an error here, please see common errors.

Inspecting data

Next, we’ll take a peek at these data. You can do this by clicking on the data in the Environment tab of RStudio, see Introduction to RStudio.

We can now see the data set (also known as a data frame). We can see that this data frame has 8 columns and 10000 rows. Each row is one person, and each column provides some information about them. Below is a description of each of the columns. Where you see NA this means this piece of data is missing for this person – quite common in some real datasets.

Here’s what each of the columns in the data set contains:

Column	Description	Values
ID	Unique anonymous participant number	1-10,000
sex	Biological sex of participant	male, female
native	Participant born in the US?	foreign, native
blind	Participant blind?	yes, no
hours	Number of hours worked per week	a number
job	Type of job held by participant:	charity, nopay, private, public
income	Annual income in dollars	a number
education	Highest qualification obtained	grade-school, high-school, bachelor, master, doctor

Calculating a mean

Now we have these data, one question we can ask is “what is the average income of people in the U.S.?” (or, at least, in this sample). In this first example, we’re going to calculate the mean income.

I’m sure you learned about means in school but, as a reminder, you calculate a mean by adding up all the incomes and dividing by the number of incomes. Our sample has 10,000 participants, so this would be a long and tedious calculation – and we’d probably make an error.

It would also be a little bit tedious and error prone in a spreadsheet application (e.g. Excel, Libreoffice Calc). There are some very famous cases of these kinds of “Excel errors” in research, e.g. genetics, economics.

In R, we can calculate the mean instantly, and it’s harder to make the sorts of errors that are common in Excel-based analysis.

To calculate mean income in R, add the following comment and command to your script and run it:

# Calculate mean income
cpsdata %>% summarise(mean(income))

# A tibble: 1 × 1
  `mean(income)`
           <dbl>
1         87293.

Your output will tell you the mean income in this sample – it’s the last number on the bottom right, and it’s approximately $87,000.

If you’re happy with the output you’ve got, move on to the next section. If you would like a more detailed explanation of this output, see more on tibbles.

If you get an error here, please see common errors.

RECORD YOUR ANSWER - Type the exact mean income into your Data Exercise template (Word document available from the DLE).

Explanation of command

This command has three components:

The bit on the left, cpsdata, is our data frame, which we loaded and named earlier.
The bit in the middle, %>%, is called a pipe. Its job is to send data from one part of your command to another. It is typed by pressing % then > then %, without spaces. So cpsdata %>% sends our data frame to the next part of our command.
The bit on the right, summarise(mean(income)) is itself made up of parts. The command summarise does as the name might suggest, it summarises a set of data (cpsdata in this case) into a single number, e.g. a mean. The mean command indicates that the type of summary we want is a mean (there are also a number of other types of summary, as we’ll see later). Finally, income is the name of the column of cpsdata we want to take the mean of – in this case, the income of each individual.

Calculating a median

Now we’re going to calculate the median income of the people in this sample. As you learned in school, you calculate a median by putting all the numbers into rank order and then picking the number in the middle. As with the calculation of mean outcome, R allows us to calculate the median quickly and without error.

Add this comment to your script:

# Calculate median income

This time, I haven’t given you the command you need to type – your task is to work out what you need to type. Re-read the explanation above for clues if you need them. The way to indicate that the summary you want is a median is to use the command median.

If you’ve entered the correct command, you’ll get this answer:

# A tibble: 1 × 1
  `median(income)`
             <dbl>
1           56952.

RECORD YOUR ANSWER - Type the exact command you used to calculate median income into your Data Exercise template.

Dealing with missing data

To calculate the mean number of hours worked per week, we have to deal with the fact that there is some missing data - we don’t know for all 10,000 people how many hours they work in a week, because they didn’t all tell us. To get a mean of those who did tell us, we tell R to ignore the missing data, like this:

# Calculate mean hours worked
cpsdata %>% summarise(mean(hours, na.rm = TRUE))

# A tibble: 1 × 1
  `mean(hours, na.rm = TRUE)`
                        <dbl>
1                        38.9

Explanation

rm is short for ‘remove’, but ‘ignore’ would be a more accurate description, as this command doesn’t delete the NA entries in cpsdata, it just ignores them. So na.rm = TRUE means “ignore the missing data”.

If you get an error here, please see common errors.

Introduction to graphs

In the last two exercises, we found that median US income was much lower than mean US income. To help us explore why that might be, we’re going to look at the distribution of incomes. Again, we’re going to use a concept you learned in school – we’re going to produce a histogram.

A histogram is a graph that shows us how many people have an income that is within a number of equal, consecutive, ranges (also called bins). In this example, we’re going to count how many people earn $0-19,999, how many earn $20,000-39,999, and so on, until we reach the highest income in the sample. So, our bin width will be $20,000. We’re then going to represent these counts by the height of a series of bars.

In R, we can do this with a single command:

# Plot histogram of incomes
cpsdata %>% ggplot(aes(income)) + geom_histogram(binwidth=20000)

Explanation of command

The first part, cpsdata %>% works the same way as in the previous examples. It takes the data from cpsdata and pipes it to our graphing command ggplot.
The command ggplot needs to know which of the columns you want to show on your graph, and we use the command aes (short for “aesthetics”) to specify this. In this case, we want to plot the incomes, so the command is ggplot(aes(income))
We also have to tell R how we want to graph the income data. The command geom_histogram says that we want a histogram (geom_ in this context just means graph). We also need to specify the bin width for our histogram, using binwidth.

Customising graphs

There are a lot of ways to modify the standard graphs produced by R, to make them look the way you want. Here, we’re going to try just a few of them.

Changing the theme

We can change the overall look of a graph by changing the theme. In this example, we use a lighter background by adding the command + theme_light():

# Plot histogram with a white background
cpsdata %>% ggplot(aes(income)) + geom_histogram(binwidth=20000) + theme_light()

There are quite a lot of different themes. Try replacing light above with one of the following: bw, classic, gray, linedraw, light, minimal, void, dark

Changing the colour

We can also change the way particular parts of the graph look – for example, changing the colour of the bars on the histogram, using the command fill. Here’s one particularly nasty-looking example:

# Plot yellow histogram on a grey background
cpsdata %>% ggplot(aes(income)) + geom_histogram(fill = 'yellow', binwidth=20000) + theme_dark()

Try replacing yellow with some other colour name. R knows quite a lot of colour names.

Changing the labels

We can also change the labels that appear on the x-axis and on the y-axis. Here’s an example with some not-very-helpful labels:

# Plot a histogram with axis labels
cpsdata %>% ggplot(aes(income)) + geom_histogram(binwidth=20000) + 
  xlab('insert a x-axis label here') + ylab('insert a y-axis label here')

Lab book exercise - Upload a graph

Enter this comment into your script:

# Plot lab-book-exercise histogram

Now, write a command that generates a histogram of these data with a bin width of $50,000. The histogram bars should be blue, and you should use theme_bw. Give your x-axis and y-axis meaningful labels. If you get it right, your graph should look something like this (without the words “EXAMPLE PLOT”, of course):

Now, export your histogram, using the Export icon on RStudio’s Plots window, and selecting “Save as image…”. Give it a meaningful file name (e.g. “ExploringIncomes”) and click ‘Save’.

RECORD YOUR ANSWER - Psyc:EL and RStudio Online are entirely separate web pages that don’t talk to each other, so you’ll need to download your graph from RStudio Online and then upload it to Psyc:EL. Here’s how:

Adding graph to your Data exercise template

Instructions on how to do download a file from RStudio online are on the Analyzing Your Project Data worksheet. Read those instructions then return to this worksheet.

Open your computer’s Downloads folder (e.g. on a Windows 10 machine, click on the little yellow folder at the bottom of the screen, and then click on “Downloads”).
Find your file in that Downloads folder, and drag it to the appropriate part your Data Exercise template.

Writing exercise

In class, we discussed how the histogram helped us to understand why the mean and median incomes were so different, and whether the mean or the median gave a better account of average US income. Here are the slides, if you’d like to look at them again.

On your Data Exercise template, you’ll find the question “Does the mean or the median give a better indication of average salary in this case?” Write a short answer to this question. You need not write more than a few sentences, but you should explain your reasoning.

Extension exercise

Take a look at the R Graph Gallery for lots of other examples of how to make graphs in R. Make a pretty graph of some aspect of this data set that interests you.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.

Exploring data

Andy Wills

Before you start…

Contents

How to use these worksheets

Loading a package

Loading data

Explanation of command

Explanation of output

Inspecting data

Calculating a mean

Explanation of command

Calculating a median

Dealing with missing data

Explanation

Introduction to graphs

Explanation of command

Customising graphs

Changing the theme

Changing the colour

Changing the labels

Lab book exercise - Upload a graph

Adding graph to your Data exercise template

Writing exercise

Extension exercise