Previous worksheets introduced linear regression using a single predictor variable, and multiple regression with two predictors. This worksheet builds on this foundation, by explaining how to build models with more than two predictors.
Open the rminr-data
project we used previously.
Ensure you have the latest files by asking git to “pull
”
the repository. Select the Git
tab, which is located in the
row of tabs which includes the Environment
tab. Click the
Pull
button with a downward pointing arrow. A window will
open showing the files which have been pulled from the repository. Close
the Git pull
window. The going-further
folder
should contain the file religion-preproc.csv
.
Next, create a new, empty R script and save it in the
rminr-data
folder as more-regression.R
. Put
all the commands from this worksheet into this file, and run them from
there. Save your script regularly.
We’ll use some data from an undergraduate student dissertation which explored relationships between Polish (n=93) and British (n=104) people’s religious orientation, spiritual beliefs and emotional intelligence. We start by loading the data.
Enter these comments and commands into your script, and run them:
# Clear the environment
rm(list = ls())
# Load tidyverse
library(tidyverse)
# Load data
data <- read_csv('going-further/religion-preproc.csv')
In order to follow the rest of this worksheet, you need to understand
the data you are analyzing. Open the data
data frame you
have just loaded, by clicking on it in the Environment pane of RStudio.
Here’s what each of the columns contains:
Column | Description | Values |
---|---|---|
subj | Unique anonymous participant number | 1-197 |
age | Age of participant | 18-74 |
sex | Biological sex of participant | male, female |
education | Highest education level of participant | no formal quals, GCSE or equiv, A level or equiv, degree, Technical, HNC, FD or equiv |
religious | Did the participant identify with a recognised religion? | yes, no |
nationality | Nationality of participant | British, Polish |
We need to do a little bit of preprocessing of the
education
variable. Specifically, education
is
an ordered variable – people in this sample have a reported
level of education which ranges from low (no formal quals
)
to high (degree
). R does not know what order to put these
levels in unless we tell it. The easiest way to do this is to recode the
data, as you did previously in the cleaning
up questionnaire data worksheet.
Enter these comments and commands into your script, and run them:
# Recode data
edu_map <- c('no formal quals' = 1,
'GCSE or equiv' = 2,
'A level or equiv' = 3,
'Technical, HNC, FD or equiv' = 4,
'degree' = 5)
data <- data %>% mutate(edu_sc = recode(education, !!!edu_map))
Explanation of commands: As we have done previously,
we define a set of mappings between text in the data frame
(e.g.GCSE or equiv
) and a number (for GCSEs,
2
). We then use mutate
and recode
to create a new variable called edu_sc
which gives us a
score for educational level that ranges from 1 to 5.
We now turn to the three scales that the participants completed. In the the data we loaded, the scale scores have already been calculated. In practice, you would need to calculate the scores for each scale from the raw data. This is explained in the Preprocessing scales worksheet.
Religious orientation (RO) was was measured with the amended Religious Orientation Scale (ROS). According to the ROS, people who are intrinsically religious treat religion as a spiritual end in and of itself. Those who are extrinsically religious practise religion for self-serving reasons, such as social status. Therefore, the ROS has two subscales.
Column | Description | Values |
---|---|---|
ro_i | Intrinsic relgious orientation | 1-3 |
ro_e | Extrinsic relgious orientation | 1-3 |
Spirituality was measured with the Spiritual Connection Questionnaire (SCQ). The SCQ has five subscales:
Column | Description | Values |
---|---|---|
happiness | Extent to which spirituality brings the participant happiness | 1-7 |
places | Extent to which the participant feels spiritually connected to places | 1-7 |
others | Extent to which the participant feels spiritually connected to others | 1-7 |
nature | Extent to which the participant feels spiritually connected to nature | 1-7 |
universe | Extent to which the participant feels spiritually connected to the universe | 1-7 |
In this study, emotional intelligence was treated as a trait; a personality factor relating to various aspects of emotions. This was measured using the Short Form Trait Emotional Intelligence Questionnaire (TEIQue-SF). The TEIQue-SF has four subscales: wellbeing, self-control, emotionality and sociability.
Column | Description | Values |
---|---|---|
tei | Total trait emotional intelligence score | 1-7 |
wellbeing | Wellbeing | 1-7 |
self_control | Self-control | 1-7 |
emotionality | Emotionality | 1-7 |
sociability | Sociability | 1-7 |
The linear model you built in the multiple regression worksheet worksheet used two predictors. It’s straightforward to add additional predictors to a model. We’ll start with a model which predicts intrinsic religious orientation from the demographics variables in our Polish sample.
Enter these comments and commands into your script, and run them:
# Select Polish participants; remove NA entries
polish <- data %>% filter(nationality == 'Polish')
polish <- polish %>% drop_na()
# Perform regression
oi_lm1 <- lm(ro_i ~ age + sex + edu_sc + religious, data = polish)
# Display results of regression
oi_lm1
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
1.675430 0.008393 0.006171 0.017407 -0.693304
Explanation of commands:
Command line 1 filters the data to only include Polish participants.
Command line 2 removes any participants for which we have missing
data (shown as NA
in the data frame). We have to do this
because the lmBF
function we’ll use later cannot cope with
missing data.
Command line 3 builds a regression model that predicts intrinsic
religious orientation, defined by ro_i ~
. The variables on
the right of the ~
are the predictors. As you can see,
these are the four demographics variables
age + sex + edu_sc + religious
.
Command line 4 prints the result of fitting the regression model.
Explanation of output: We are given the equation
of the line for the regression model. We can see from this output
that the best fitting line involves a positive relation between
age
and ro_i
, i.e. older people score higher
on intrinsic religious orientation in this sample. For binary
predictors, for example religious
, which only have two
levels in our dataset (yes
or no
for
religious
), the output indicates which way round R has
decided to set these up. In this case, it says
religiousyes
, meaning that yes
has been taken
as the positive value. The co-efficient of -0.69
is
negative, so this means that people who answer yes
to this
question scored lower on intrinsic religious orientation that those who
replied no
. Similarly, sex
was set up with
male
as the positive value (sexmale
), and the
co-efficient (.006
) is positive, so men scored higher than
women on intrinsic religious orientation in this sample.
How much of the variance in intrinsic religious orientation is explained by the demographic variables? As explained in the how good was our model? worksheet, we can assess this using the \(R^2\) statistic.
Enter these comments and commands into your script, and run them:
# Load 'broom' package, for 'glance' command
library(broom)
# Display further results of regression
glance(oi_lm1)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.306 0.273 0.509 9.46 0.00000219 4 -65.0 142. 157.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Explanation of commands: Command line 1 loads the
broom
library which provides the glance()
function. Command line 2 uses glance()
to print some
statistics for our model.
Explanation of output:
As explained in previous worksheets, the value of 0.273 in
adj.r.squared
tells us how much of the variance in
ro_i
is explained by the demographics variables. A model
containing demographics variables explains 27.3% of the variance in
intrinsic religious orientation. The other values in this output are
beyond the scope of this worksheet and are not discussed here.
The multiple regression worksheet also showed you how to calculate a Bayes factor to decide whether a regression model is any better than a simpler model with no predictors.
Enter these comments and commands into your script, and run them:
# Load BayesFactor package
library(BayesFactor, quietly = TRUE)
# Calculate BayesFactor for regression model
oi_lmbf1 <- lmBF(ro_i ~ age + sex + edu_sc + religious,
data = data.frame(polish))
# Display results
oi_lmbf1
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 8813.328 ±3.27%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
Explanation of commands:
Command line 1 loads the BayesFactor
package. Command
lines 2-3 build a Bayesian model using the same outcome and predictors
as above. Command line 4 displays information about the model.
Explanation of output:
The output [1] age + sex + edu_sc + religious
reminds
you of the predictors in your model. The number after the :
is a Bayes Factor for the hypothesis that your model is a better
predictor of the outcome than a simpler model with no predictors. That
is, a model which just computes the average of all outcome scores
(sometimes called the ‘Intercept only’ model). The Bayes Factor for this
hypothesis is almost 9,000, which is strong evidence that these
demographics variables help predict intrinsic religiosity.
Note: This Bayes Factor is for the whole
model (ro_i ~ age + sex + edu_sc + religious
), not for the
individual predictors that make it up (e.g. age
). So, this
analysis does not tell you that, for example, that there is evidence
that age
on its own predicts intrinsic religiosity. We’ll
return to this point towards the end of this worksheet.
One use of multiple regression is to test a sequence of related hypotheses. Groups of variables (sometimes referred to as ‘blocks’) are added to a model in steps. At each step, a comparison is made to see if the model with more variables explains more or less variance than the simpler model from the previous step.
The order in which variables are added to the model is not arbitrary. Based on previous research, variables known to predict the outcome are entered first. Variables which test new hypotheses are added in subsequent steps. Each step builds on the previous one, which is why the approach is known as ‘hierarchical regression’.
So far, we have a model which shows that demographics help to predict intrinsic religiosity. According to the definition we gave above, intrinsically religious people treat religion as a spiritual end in and of itself. If that’s true, we would expect a model that includes spirituality predictor variables to be better than one with just the demographics variables. Again, by ‘better’, we mean ‘explain more variance’ in the outcome variable.
We can test this hypothesis by extending the demographics model to include the spirituality variables measured by the SCQ.
Enter these comments and commands into your script, and run them:
# Perform second regression
oi_lm2 <- lm(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places,
data = polish)
# Display results of regression
oi_lm2
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
1.349620 0.003387 -0.037815 -0.046594 -0.586144
happiness universe others nature places
0.295375 0.004245 -0.043822 -0.045118 -0.020146
glance(oi_lm2)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.622 0.580 0.387 14.8 6.90e-14 9 -37.4 96.8 124.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Explanation of commands:
Command lines 1-3 build our new model. This includes the demographics
variables, and adds the spirituality variables
happiness + universe + others + nature + places
. Comman
line 4 shows the output of the model, and command line 5 gives us the
\(R^2\) value.
Explanation of output:
For this model, adj.r.squared
= 0.58, which means it
explains 58% of the variance in intrinsic religious orientation, an
increase of 30.7% over the model from step 1.
Earlier, we found that a model with demographics predictors was better than a model with no predictors. To test the hypothesis that the spirituality variables help to explain the variance in intrinsic religiosity, we can compare our second model against the first.
We’ve compared Bayesian models in a couple of previous worksheets. For example, in the multiple regression worksheet, we compared a model where the relationship between grades and study time was assumed to be different for men and women, with one where it was assumed to be the same. You were also comparing two Bayesian models when you tested for an interaction in a Bayesian ANOVA.
We can use the same approach here.
Enter these comments and commands into your script, and run them:
# Calculate Bayes Factor for second regression model
oi_lmbf2 <- lmBF(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places,
data = data.frame(polish))
# Calculate Bayes Factor for model 2 versus model 1
oi_lmbf2 / oi_lmbf1
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 26971381 ±5.05%
Against denominator:
ro_i ~ age + sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
Explanation of commands:
Command lines 1-3 build a Bayesian regression model with the
demographic and spirituality predictors. Command line 4 compares the
model with the demographics and spirituality variables,
oi_lmbf2
, against the model with just the demographics
variables, oi_lmbf1
. You can think of this as dividing a
more complex model by a simpler one, but remember that this is actually
a comparison of the evidence for the two models.
Explanation of output:
The Bayes Factor for this hypothesis is over 25 million, so we can confidently claim that a model containing these spirituality variables explains more variance in intrinsic religiosity, than a simpler model with just demographics variables.
In this study, the researchers also hypothesized that trait emotional intelligence and extrinsic religiosity would further explain the variance in intrinsic religiosity. These variables were all added to the third model.
Enter these commands into your script, and run them:
# Perform third regression
oi_lm3 <- lm(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places +
wellbeing + self_control + emotionality + sociability + ro_e,
data = polish)
# Display results
oi_lm3
Call:
lm(formula = ro_i ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places + wellbeing + self_control +
emotionality + sociability + ro_e, data = polish)
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
0.386301 0.004410 0.044791 0.011721 -0.303004
happiness universe others nature places
0.219556 0.057779 -0.095376 -0.037245 0.024456
wellbeing self_control emotionality sociability ro_e
-0.002467 -0.091592 -0.004978 -0.006091 0.643421
glance(oi_lm3)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.746 0.699 0.327 15.9 3.20e-17 14 -19.3 70.6 111.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Explanation of commands:
Copmmand lines 1-4 build our third model. This contains all of the
variables in oi_lmbf2
, and adds the emotional intelligence
variables
wellbeing + self_control + emotionality + sociability
, and
ro_e
for extrinsic religiosity. Command line 5 shows the
best fitting line, and command line 6 gives us the \(R^2\) value.
Explanation of output:
For this model, adj.r.squared
= 0.7, which means it
explains 70% of the variance in intrinsic religious orientation, an
increase of 12% over the model from step 2.
Finally, we compare the Bayesian model, with the one from step 2.
Enter these comments and commands into your script, and run them:
# Calculate Bayes Factor for third model
oi_lmbf3 <- lmBF(ro_i ~ age + sex + edu_sc + religious +
happiness + universe + others + nature + places +
wellbeing + self_control + emotionality + sociability + ro_e,
data = data.frame(polish))
# Compute Bayes Factor for model 3 versus model 2
oi_lmbf3 / oi_lmbf2
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places + wellbeing + self_control + emotionality + sociability + ro_e : 3940.64 ±4.26%
Against denominator:
ro_i ~ age + sex + edu_sc + religious + happiness + universe + others + nature + places
---
Bayes factor type: BFlinearModel, JZS
Explanation of commands:
Command lines 1-4 build the Bayesian model. Command line 5 compares this new model against the previous one.
Explanation of output:
The Bayes Factor for this hypothesis is over 4,000, which is strong evidence that a model with the additional variables explains more variance than the simpler model in step 2.
All of the regression models we’ve been creating and comparing were built based on the data from our sample. However, we’re normally looking for a model that will predict outcomes for any sample. Adding variables to a model can explain more variance in a particular sample, at the expense of being able to explain data in other samples. This is known as overfitting. As a general principle, we can reduce the risk of overfitting by preferring a simpler model over a more complex one, when adding variables doesn’t increase R2 by much. Other methods, such as cross-validation, provide more precise tests of overfitting, but these are outside the scope of this worksheet.
We earlier saw that there was strong evidence (\(BF > 8000\)) for the model
ro_i ~ age + sex + edu_sc + religious
, and we noted that
this is evidence for the whole model, not for any one predictor
(e.g. age
). Nonetheless, you might sometimes be asked to
report whether there is evidence for a particular predictor within a
model individually having an effect on the outcome variable.
You can answer such questions using the same approach as above. In other
words, calculate the evidence for a model omitting that one predictor –
ro_i ~ sex + edu_sc + religious
– and compare it to the
original model.
Enter these comments and commands into your script, and run them:
# Calculate Bayes Factor for model1, minus age
oi_lmbf1_minus_age <- lmBF(ro_i ~ sex + edu_sc + religious,
data = data.frame(polish))
# Calculate Bayes Factor for age in model 1
oi_lmbf1 / oi_lmbf1_minus_age
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 1.076371 ±4.92%
Against denominator:
ro_i ~ sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
In this case, the result of our analysis is inconclusive. The Bayes
Factor is close to 1, so the evidence for and against age
being an effective predictor in the model is about evenly matched. In
contrast, there is strong evidence for relgion
being an
effective predictor within the model.
Enter these comments and commands into your script, and run them:
# Calculate Bayes Factor for model 1, minus religion
oi_lmbf1_minus_rel <- lmBF(ro_i ~ age + sex + edu_sc, data = data.frame(polish))
# Calculate Bayes Factor for religion in model 1
oi_lmbf1 / oi_lmbf1_minus_rel
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 61924.17 ±3.67%
Against denominator:
ro_i ~ age + sex + edu_sc
---
Bayes factor type: BFlinearModel, JZS
Using just the data from British participants, build a model which predicts trait emotional intelligence from the demographics variables. The adjusted R2 and Bayes Factor should look like this:
Call:
lm(formula = tei ~ age + sex + edu_sc + religious, data = data.frame(british))
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
4.02501 0.02486 0.62775 -0.08031 0.17461
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.164 0.130 0.934 4.76 0.00150 4 -135. 282. 298.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious : 7.198244 ±1.14%
Against denominator:
Intercept only
---
Bayes factor type: BFlinearModel, JZS
Note: For reasons previously explained, your Bayes Factor may not be exactly the same as the one shown here. Any value between about 7 and 8 is fine.
Copy the R code you used for this exercise into PsycEL.
Build a second model which predicts trait emotional intelligence from the demographics and spirituality variables. Compare this against the model you built in the previous exercise. The adjusted R2 and Bayes Factor should look like this:
Call:
lm(formula = tei ~ age + sex + edu_sc + religious + happiness +
universe + others + nature + places, data = data.frame(british))
Coefficients:
(Intercept) age sexmale edu_sc religiousyes
4.26541 0.02580 0.62405 -0.05840 0.08868
happiness universe others nature places
-0.20410 -0.00543 0.21218 0.01457 -0.07620
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.196 0.118 0.940 2.50 0.0133 9 -133. 288. 317.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Bayes factor analysis
--------------
[1] age + sex + edu_sc + religious + happiness + universe + others + nature + places : 0.03844367 ±2.33%
Against denominator:
tei ~ age + sex + edu_sc + religious
---
Bayes factor type: BFlinearModel, JZS
Copy the R code you used for this exercise, along with appropriate comments, into PsycEL.
Write a sentence interpreting R2 and the Bayes Factor for the model comparison.
This material is distributed under a Creative Commons licence. CC-BY-SA 4.0.