More on regression

Introduction
Getting started
Multiple regression with more than two predictors
Hierarchical regression
Evidence for individual predictors
Exercise 1
Exercise 2

Introduction

Previous worksheets introduced linear regression using a single predictor variable, and multiple regression with two predictors. This worksheet builds on this foundation, by explaining how to build models with more than two predictors.

Getting started

Load data

Open the rminr-data project we used previously.

Ensure you have the latest files by asking git to “pull” the repository. Select the Git tab, which is located in the row of tabs which includes the Environment tab. Click the Pull button with a downward pointing arrow. A window will open showing the files which have been pulled from the repository. Close the Git pull window. The going-further folder should contain the file religion-preproc.csv.

Next, create a new, empty R script and save it in the rminr-data folder as more-regression.R. Put all the commands from this worksheet into this file, and run them from there. Save your script regularly.

We’ll use some data from an undergraduate student dissertation which explored relationships between Polish (n=93) and British (n=104) people’s religious orientation, spiritual beliefs and emotional intelligence. We start by loading the data.

Enter these comments and commands into your script, and run them:

# Clear the environment
rm(list = ls()) 
# Load tidyverse
library(tidyverse)
# Load data
data <- read_csv('going-further/religion-preproc.csv')

Inspect and preprocess data

In order to follow the rest of this worksheet, you need to understand the data you are analyzing. Open the data data frame you have just loaded, by clicking on it in the Environment pane of RStudio. Here’s what each of the columns contains:

Demographics

Column	Description	Values
subj	Unique anonymous participant number	1-197
age	Age of participant	18-74
sex	Biological sex of participant	male, female
education	Highest education level of participant	no formal quals, GCSE or equiv, A level or equiv, degree, Technical, HNC, FD or equiv
religious	Did the participant identify with a recognised religion?	yes, no
nationality	Nationality of participant	British, Polish

We need to do a little bit of preprocessing of the education variable. Specifically, education is an ordered variable – people in this sample have a reported level of education which ranges from low (no formal quals) to high (degree). R does not know what order to put these levels in unless we tell it. The easiest way to do this is to recode the data, as you did previously in the cleaning up questionnaire data worksheet.

Enter these comments and commands into your script, and run them:

# Recode data
edu_map <- c('no formal quals' = 1, 
             'GCSE or equiv' = 2, 
             'A level or equiv' = 3,
              'Technical, HNC, FD or equiv' = 4, 
             'degree' = 5)
data <- data %>% mutate(edu_sc = recode(education, !!!edu_map))

Explanation of commands: As we have done previously, we define a set of mappings between text in the data frame (e.g.GCSE or equiv) and a number (for GCSEs, 2). We then use mutate and recode to create a new variable called edu_sc which gives us a score for educational level that ranges from 1 to 5.

Questionnaire responses

We now turn to the three scales that the participants completed. In the the data we loaded, the scale scores have already been calculated. In practice, you would need to calculate the scores for each scale from the raw data. This is explained in the Preprocessing scales worksheet.

Religious Orientation

Religious orientation (RO) was was measured with the amended Religious Orientation Scale (ROS). According to the ROS, people who are intrinsically religious treat religion as a spiritual end in and of itself. Those who are extrinsically religious practise religion for self-serving reasons, such as social status. Therefore, the ROS has two subscales.

Column	Description	Values
ro_i	Intrinsic relgious orientation	1-3
ro_e	Extrinsic relgious orientation	1-3

Spirituality

Spirituality was measured with the Spiritual Connection Questionnaire (SCQ). The SCQ has five subscales:

Column	Description	Values
happiness	Extent to which spirituality brings the participant happiness	1-7
places	Extent to which the participant feels spiritually connected to places	1-7
others	Extent to which the participant feels spiritually connected to others	1-7
nature	Extent to which the participant feels spiritually connected to nature	1-7
universe	Extent to which the participant feels spiritually connected to the universe	1-7

Trait Emotional Intelligence

In this study, emotional intelligence was treated as a trait; a personality factor relating to various aspects of emotions. This was measured using the Short Form Trait Emotional Intelligence Questionnaire (TEIQue-SF). The TEIQue-SF has four subscales: wellbeing, self-control, emotionality and sociability.

Column	Description	Values
tei	Total trait emotional intelligence score	1-7
wellbeing	Wellbeing	1-7
self_control	Self-control	1-7
emotionality	Emotionality	1-7
sociability	Sociability	1-7

Multiple regression with more than two predictors

The linear model you built in the multiple regression worksheet worksheet used two predictors. It’s straightforward to add additional predictors to a model. We’ll start with a model which predicts intrinsic religious orientation from the demographics variables in our Polish sample.

Fitting the model

Enter these comments and commands into your script, and run them:

# Select Polish participants; remove NA entries
polish <- data %>% filter(nationality == 'Polish')
polish <- polish %>% drop_na()
# Perform regression
oi_lm1 <- lm(ro_i ~ age + sex + edu_sc + religious, data = polish)
# Display results of regression
oi_lm1


Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious, data = polish)

Coefficients:
 (Intercept)           age       sexmale        edu_sc  religiousyes  
    1.675430      0.008393      0.006171      0.017407     -0.693304

Explanation of commands:

Command line 1 filters the data to only include Polish participants.

Command line 2 removes any participants for which we have missing data (shown as NA in the data frame). We have to do this because the lmBF function we’ll use later cannot cope with missing data.

Command line 3 builds a regression model that predicts intrinsic religious orientation, defined by ro_i ~. The variables on the right of the ~ are the predictors. As you can see, these are the four demographics variables age + sex + edu_sc + religious.

Command line 4 prints the result of fitting the regression model.

Explanation of output: We are given the equation of the line for the regression model. We can see from this output that the best fitting line involves a positive relation between age and ro_i, i.e. older people score higher on intrinsic religious orientation in this sample. For binary predictors, for example religious, which only have two levels in our dataset (yes or no for religious), the output indicates which way round R has decided to set these up. In this case, it says religiousyes, meaning that yes has been taken as the positive value. The co-efficient of -0.69 is negative, so this means that people who answer yes to this question scored lower on intrinsic religious orientation that those who replied no. Similarly, sex was set up with male as the positive value (sexmale), and the co-efficient (.006) is positive, so men scored higher than women on intrinsic religious orientation in this sample.

Proportion of variance explained

How much of the variance in intrinsic religious orientation is explained by the demographic variables? As explained in the how good was our model? worksheet, we can assess this using the \(R^2\) statistic.

Enter these comments and commands into your script, and run them:

# Load 'broom' package, for 'glance' command
library(broom)
# Display further results of regression
glance(oi_lm1)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.306         0.273 0.509      9.46 0.00000219     4  -65.0  142.  157.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Explanation of commands: Command line 1 loads the broom library which provides the glance() function. Command line 2 uses glance() to print some statistics for our model.

Explanation of output:

As explained in previous worksheets, the value of 0.273 in adj.r.squared tells us how much of the variance in ro_i is explained by the demographics variables. A model containing demographics variables explains 27.3% of the variance in intrinsic religious orientation. The other values in this output are beyond the scope of this worksheet and are not discussed here.

Evidence for the model

The multiple regression worksheet also showed you how to calculate a Bayes factor to decide whether a regression model is any better than a simpler model with no predictors.

Enter these comments and commands into your script, and run them:

# Load BayesFactor package
library(BayesFactor, quietly = TRUE)
# Calculate BayesFactor for regression model
oi_lmbf1 <- lmBF(ro_i ~ age + sex + edu_sc + religious,
                 data = data.frame(polish))
# Display results
oi_lmbf1

Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 8813.328 ±3.27%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

Explanation of commands:

Command line 1 loads the BayesFactor package. Command lines 2-3 build a Bayesian model using the same outcome and predictors as above. Command line 4 displays information about the model.

Explanation of output:

The output [1] age + sex + edu_sc + religious reminds you of the predictors in your model. The number after the : is a Bayes Factor for the hypothesis that your model is a better predictor of the outcome than a simpler model with no predictors. That is, a model which just computes the average of all outcome scores (sometimes called the ‘Intercept only’ model). The Bayes Factor for this hypothesis is almost 9,000, which is strong evidence that these demographics variables help predict intrinsic religiosity.

Note: This Bayes Factor is for the whole model (ro_i ~ age + sex + edu_sc + religious), not for the individual predictors that make it up (e.g. age). So, this analysis does not tell you that, for example, that there is evidence that age on its own predicts intrinsic religiosity. We’ll return to this point towards the end of this worksheet.

Hierarchical regression

One use of multiple regression is to test a sequence of related hypotheses. Groups of variables (sometimes referred to as ‘blocks’) are added to a model in steps. At each step, a comparison is made to see if the model with more variables explains more or less variance than the simpler model from the previous step.

The order in which variables are added to the model is not arbitrary. Based on previous research, variables known to predict the outcome are entered first. Variables which test new hypotheses are added in subsequent steps. Each step builds on the previous one, which is why the approach is known as ‘hierarchical regression’.

So far, we have a model which shows that demographics help to predict intrinsic religiosity. According to the definition we gave above, intrinsically religious people treat religion as a spiritual end in and of itself. If that’s true, we would expect a model that includes spirituality predictor variables to be better than one with just the demographics variables. Again, by ‘better’, we mean ‘explain more variance’ in the outcome variable.

We can test this hypothesis by extending the demographics model to include the spirituality variables measured by the SCQ.

Fit the model and report \(R^2\)

Enter these comments and commands into your script, and run them:

# Perform second regression
oi_lm2 <- lm(ro_i ~ age + sex + edu_sc + religious +
               happiness + universe + others + nature + places,
             data = polish)
# Display results of regression
oi_lm2


Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness + 
    universe + others + nature + places, data = polish)

Coefficients:
 (Intercept)           age       sexmale        edu_sc  religiousyes  
    1.349620      0.003387     -0.037815     -0.046594     -0.586144  
   happiness      universe        others        nature        places  
    0.295375      0.004245     -0.043822     -0.045118     -0.020146

glance(oi_lm2)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.622         0.580 0.387      14.8 6.90e-14     9  -37.4  96.8  124.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Explanation of commands:

Command lines 1-3 build our new model. This includes the demographics variables, and adds the spirituality variables happiness + universe + others + nature + places. Comman line 4 shows the output of the model, and command line 5 gives us the \(R^2\) value.

Explanation of output:

For this model, adj.r.squared = 0.58, which means it explains 58% of the variance in intrinsic religious orientation, an increase of 30.7% over the model from step 1.

Evidence for the new model

Earlier, we found that a model with demographics predictors was better than a model with no predictors. To test the hypothesis that the spirituality variables help to explain the variance in intrinsic religiosity, we can compare our second model against the first.

We’ve compared Bayesian models in a couple of previous worksheets. For example, in the multiple regression worksheet, we compared a model where the relationship between grades and study time was assumed to be different for men and women, with one where it was assumed to be the same. You were also comparing two Bayesian models when you tested for an interaction in a Bayesian ANOVA.

We can use the same approach here.

Enter these comments and commands into your script, and run them:

# Calculate Bayes Factor for second regression model
oi_lmbf2 <- lmBF(ro_i ~ age + sex + edu_sc + religious +
               happiness + universe + others + nature + places,
               data = data.frame(polish))
# Calculate Bayes Factor for model 2 versus model 1
oi_lmbf2 / oi_lmbf1

Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 26971381 ±5.05%

Against denominator:
  ro_i ~ age + sex + edu_sc + religious 
---
Bayes factor type: BFlinearModel, JZS

Explanation of commands:

Command lines 1-3 build a Bayesian regression model with the demographic and spirituality predictors. Command line 4 compares the model with the demographics and spirituality variables, oi_lmbf2, against the model with just the demographics variables, oi_lmbf1. You can think of this as dividing a more complex model by a simpler one, but remember that this is actually a comparison of the evidence for the two models.

Explanation of output:

The Bayes Factor for this hypothesis is over 25 million, so we can confidently claim that a model containing these spirituality variables explains more variance in intrinsic religiosity, than a simpler model with just demographics variables.

Adding yet more variables

In this study, the researchers also hypothesized that trait emotional intelligence and extrinsic religiosity would further explain the variance in intrinsic religiosity. These variables were all added to the third model.

Enter these commands into your script, and run them:

# Perform third regression
oi_lm3 <- lm(ro_i ~ age + sex + edu_sc + religious +
               happiness + universe + others + nature + places +
               wellbeing + self_control + emotionality + sociability + ro_e,
             data = polish)
# Display results
oi_lm3


Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness + 
    universe + others + nature + places + wellbeing + self_control + 
    emotionality + sociability + ro_e, data = polish)

Coefficients:
 (Intercept)           age       sexmale        edu_sc  religiousyes  
    0.386301      0.004410      0.044791      0.011721     -0.303004  
   happiness      universe        others        nature        places  
    0.219556      0.057779     -0.095376     -0.037245      0.024456  
   wellbeing  self_control  emotionality   sociability          ro_e  
   -0.002467     -0.091592     -0.004978     -0.006091      0.643421

glance(oi_lm3)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.746         0.699 0.327      15.9 3.20e-17    14  -19.3  70.6  111.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Explanation of commands:

Copmmand lines 1-4 build our third model. This contains all of the variables in oi_lmbf2, and adds the emotional intelligence variables wellbeing + self_control + emotionality + sociability, and ro_e for extrinsic religiosity. Command line 5 shows the best fitting line, and command line 6 gives us the \(R^2\) value.

Explanation of output:

For this model, adj.r.squared = 0.7, which means it explains 70% of the variance in intrinsic religious orientation, an increase of 12% over the model from step 2.

Finally, we compare the Bayesian model, with the one from step 2.

Enter these comments and commands into your script, and run them:

# Calculate Bayes Factor for third model
oi_lmbf3 <- lmBF(ro_i ~ age + sex + edu_sc + religious +
               happiness + universe + others + nature + places +
               wellbeing + self_control + emotionality + sociability + ro_e,
               data = data.frame(polish))
# Compute Bayes Factor for model 3 versus model 2
oi_lmbf3 / oi_lmbf2

Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places + wellbeing + self_control + emotionality +     sociability + ro_e : 3940.64 ±4.26%

Against denominator:
  ro_i ~ age + sex + edu_sc + religious + happiness + universe + others + nature + places 
---
Bayes factor type: BFlinearModel, JZS

Explanation of commands:

Command lines 1-4 build the Bayesian model. Command line 5 compares this new model against the previous one.

Explanation of output:

The Bayes Factor for this hypothesis is over 4,000, which is strong evidence that a model with the additional variables explains more variance than the simpler model in step 2.

A note about overfitting

All of the regression models we’ve been creating and comparing were built based on the data from our sample. However, we’re normally looking for a model that will predict outcomes for any sample. Adding variables to a model can explain more variance in a particular sample, at the expense of being able to explain data in other samples. This is known as overfitting. As a general principle, we can reduce the risk of overfitting by preferring a simpler model over a more complex one, when adding variables doesn’t increase R² by much. Other methods, such as cross-validation, provide more precise tests of overfitting, but these are outside the scope of this worksheet.

Evidence for individual predictors

We earlier saw that there was strong evidence (\(BF > 8000\)) for the model ro_i ~ age + sex + edu_sc + religious, and we noted that this is evidence for the whole model, not for any one predictor (e.g. age). Nonetheless, you might sometimes be asked to report whether there is evidence for a particular predictor within a model individually having an effect on the outcome variable. You can answer such questions using the same approach as above. In other words, calculate the evidence for a model omitting that one predictor – ro_i ~ sex + edu_sc + religious – and compare it to the original model.

Enter these comments and commands into your script, and run them:

# Calculate Bayes Factor for model1, minus age
oi_lmbf1_minus_age <- lmBF(ro_i ~ sex + edu_sc + religious,
                           data = data.frame(polish))
# Calculate Bayes Factor for age in model 1
oi_lmbf1 / oi_lmbf1_minus_age

Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 1.076371 ±4.92%

Against denominator:
  ro_i ~ sex + edu_sc + religious 
---
Bayes factor type: BFlinearModel, JZS

In this case, the result of our analysis is inconclusive. The Bayes Factor is close to 1, so the evidence for and against age being an effective predictor in the model is about evenly matched. In contrast, there is strong evidence for relgion being an effective predictor within the model.

Enter these comments and commands into your script, and run them:

# Calculate Bayes Factor for model 1, minus religion
oi_lmbf1_minus_rel <- lmBF(ro_i ~ age + sex + edu_sc, data = data.frame(polish))
# Calculate Bayes Factor for religion in model 1
oi_lmbf1 / oi_lmbf1_minus_rel

Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 61924.17 ±3.67%

Against denominator:
  ro_i ~ age + sex + edu_sc 
---
Bayes factor type: BFlinearModel, JZS

Exercise 1

Using just the data from British participants, build a model which predicts trait emotional intelligence from the demographics variables. The adjusted R² and Bayes Factor should look like this:


Call:
lm(formula = tei ~ age + sex + edu_sc + religious, data = data.frame(british))

Coefficients:
 (Intercept)           age       sexmale        edu_sc  religiousyes  
     4.02501       0.02486       0.62775      -0.08031       0.17461

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.164         0.130 0.934      4.76 0.00150     4  -135.  282.  298.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 7.198244 ±1.14%

Against denominator:
  Intercept only 
---
Bayes factor type: BFlinearModel, JZS

Note: For reasons previously explained, your Bayes Factor may not be exactly the same as the one shown here. Any value between about 7 and 8 is fine.

Copy the R code you used for this exercise into PsycEL.

Exercise 2

Build a second model which predicts trait emotional intelligence from the demographics and spirituality variables. Compare this against the model you built in the previous exercise. The adjusted R² and Bayes Factor should look like this:


Call:
lm(formula = tei ~ age + sex + edu_sc + religious + happiness + 
    universe + others + nature + places, data = data.frame(british))

Coefficients:
 (Intercept)           age       sexmale        edu_sc  religiousyes  
     4.26541       0.02580       0.62405      -0.05840       0.08868  
   happiness      universe        others        nature        places  
    -0.20410      -0.00543       0.21218       0.01457      -0.07620

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.196         0.118 0.940      2.50  0.0133     9  -133.  288.  317.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 0.03844367 ±2.33%

Against denominator:
  tei ~ age + sex + edu_sc + religious 
---
Bayes factor type: BFlinearModel, JZS

Copy the R code you used for this exercise, along with appropriate comments, into PsycEL.

Write a sentence interpreting R² and the Bayes Factor for the model comparison.

This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.

More on regression

Paul Sharpe and Andy Wills

Contents

Introduction

Getting started

Load data

Inspect and preprocess data

Demographics

Questionnaire responses

Religious Orientation

Spirituality

Trait Emotional Intelligence

Multiple regression with more than two predictors

Fitting the model

Proportion of variance explained

Evidence for the model

Hierarchical regression

Fit the model and report \(R^2\)

Evidence for the new model

Adding yet more variables

A note about overfitting

Evidence for individual predictors

Exercise 1

Exercise 2