Evaluating experiments
Andy Wills
Correlational research
Correlational research in psychology is fundamentally limited. By correlational research, I mean that approach to psychology where one discovers that two variables co-vary, and conclude from this evidence that one may cause the other. As the website ‘Spurious Correlations’ illustrates, this method of research can lead to some daft conclusions. For example:
-
Increased suicide rates in the US may be caused by the increase in science spending.
-
Nicholas Cage’s appearance in films may increase deaths by drowning.
-
Eating cheese may increase the risk you will be strangled by your bed sheets.
Piracy and global warming (not in lecture)
Another example, not covered in lectures, is the correlation between piracy and global warming. It is demonstrably the case that average global temperatures have been rising over the last few hundred years. It is also demonstrably true that the number of cases of piracy on the high seas has dropped over the same period. There is clearly an association — a correlation — between these two variables.
Does this mean that global warming is caused by the absence of sea pirates? Put another way, would an effective way of reducing global warming be to encourage piracy? Or perhaps it means that the absence of sea pirates is caused by global warming? Put another way, would an effective way of reducing piracy be to increase our carbon emissions?
Depression and memory
The previous examples were deliberately ridiculous, but the same problematic inferences are made in much of psychology. Any study that is purely correlational is not demonstrating cause. For example, it’s well documented that depression is associated with over-general memory. Specifically, those with a history of depression seem to be worse at recalling the specifics of episodes (what, where, when) than those without such a history.
Does this mean that depression causes memory problems? Or perhaps that memory problems cause depression? Or perhaps both causal directions are in force? Depression causes memory problems which makes it difficult to recall information about the past which in turn leads to becoming more depressed? Or perhaps depression and memory problems are both caused by some third factor? (e.g. childhood trauma). All of these theories are to be found in the literature, but they cannot be distinguished on the basis of correlational data.
Longitudinal data
Using longitudinal data does not solve the problem. For example, there’s a famous result that the use of night lights in a baby’s room is associated with myopia in their later life. Babies with night lights are more likely to be short-sighted in later life than babies without night lights.
This feels causal, because the assumed cause (night lights) occurs earlier in time that the assumed effect (myopia). Causes must precede effects, so one can eliminate the alternative explanation that myopia in later life causes the presence of night lights in infancy. It feels like there’s only one option left — night lights cause myopia. We should ban night lights, because this would reduce myopia. Indeed, such recommendations were made on the basis of this research (Quinn et al., 1999).
What longitudinal research does not rule out is the presence of a third factor that causes both the night lights and myopia. For example, myopia has a genetic component. Thus, if your parents are myopic, there’s a good chance you will become so, too. In addition, myopic people tend to prefer more brightly-lit environments (perhaps because this compensates for the myopia somewhat). So, having myopic parents could be the cause of both the presence of night lights and the later myopia (see e.g. Gwiazda et al., 2000).
Correlation does not imply causation
In summary, correlational research — including longitudinal correlation — cannot establish causation. It is thus fundamentally limited. The limitation is much bigger than is sometimes realised, because it is extremely unlikely that any two variables are completely unrelated. It’s a complex, inter-related world, and we’re complex, inter-related beings. Many of the correlations we observe in psychology are extremely small. For example, in the study of personality, it is not unusual for a measure such as extroversion to explain as little as 5% of the variation in another variable. This is detectably different from no association (with methods you’ll be taught about later in your degree), but it’s frankly silly to have ‘no association’ as your competing theory because more-or-less everything is associated to more-or-less everything else to some degree.
The Experimental Method
The best way we have of establishing causation is through the Experimental Method. In its simplest form, we take two groups of people. We do different things to those two groups, and measure something. The difference in what we do is the called the independent variable. What we measure is called the dependent variable.
Depression therapy example
For example, let’s say we’re testing a new cognitive therapy for depression. We take two groups of people. One group gets six weeks of the new therapy. The other group get nothing. At the end of six weeks, we take some measure of depression (e.g. the Beck Depression Inventory). The group that get the therapy are less depressed than those who do not.
Such an approach has the POTENTIAL to show that the therapy CAUSES a reduction in depression. But there are a number of ways in which a conclusion of causation might be unsound.
Pre-existing differences
The first is possibility of pre-existing differences. What if the group who received therapy were happier from the outset than those who were not? We can address this problem in two ways - detection, or prevention.
Detection — We could (and should) have taken BDI ratings before the therapy started. If the two groups were comparable on BDI before the treatment, this rules out a pre-existing difference in happiness.
Prevention — We construct our groups such that we eliminate the possibility of pre-existing differences. There are two main approaches to prevention — matching, or randomisation.
Matching — Take BDI measures for everyone. Allocate people to groups in such a way that the average BDI for the two groups is identical (or at least, minimally different).
Randomisation — Allocate people to groups randomly.
Matching is technically superior, if you are reasonably confident you know what the relevant variables are, and if there are relatively small number of them. Here, we just used BDI, but what about age? Number of previous episodes of depression? All these things might affect how much BDI varies over six weeks (with or without treatment). In practice, we often randomise. Random allocation has the advantage that all variables (measured or otherwise) will be well-matched if your groups are large enough. The problem is that your groups have to be very large for this to be likely to be true.
Back to our therapy experiment
So, if we have very large randomised groups, no pre-treatment difference in BDI, but a post-treatment difference — can we then conclude that the treatment CAUSED the reduction in depression (as measured by BDI)?
Not quite yet. Next, we have to look long and hard for possible CONFOUNDING VARIABLES.
Attrition
Let’s start with attrition. Attrition is where some participants drop out before the end of the study. This is not in itself a critical problem, but it becomes one if the attrition rates are different between your conditions.
Returning to our depression therapy example, let’s say that 20% of people receiving the therapy decide it is not for them, and drop out of the study. We therefore do not have post-treatment BDI for these people. Our control group — who do not receive treatment — are likely to have a different attrition rate. Let’s assume in this example that the attrition rate in the control group is 0%.
Let’s further assume, and this is critical, that attrition is not random. For example, let’s assume that the more depressed you are to start with, the more likely you are to not complete the therapy.
In the example, the 20% most depressed of the therapy group drop out, and both the therapy and the control have no effect on BDI. Yet the therapy group still looks like it’s happier post-treatment than the control group, but it’s an illusion caused by differential, non-random, attrition.
Placebo effect
The classic example of a placebo effect is that a pill with no active ingredients, which a participant believes to be a headache remedy, reduces headache symptoms. The participant’s expectation that the treatment will be effective is sufficient to reduce symptoms.
The lesson, often not heeded in drug testing, is that in order to assess whether your drug is effective, you need to compare against a placebo control. So, taking the drug versus taking the placebo. Rather than taking the drug versus not taking it. It is now widely believed that the effects of anti-depressant medication are almost entirely placebo. The question is harder to address, but no less important, for psychological treatments.
Returning to our therapy example, one possibility is that the therapy itself is basically ineffective. The treatment group are happier than the control group because the treatment group have the expectation they are receiving something that will work, whilst the control group — having received nothing — will not have this expectation.
In psychological therapy, there is no equivalent to the sugar pill — no treatment that will consensually be endorsed as entirely inert. However, one can (and should) attempt to show the new treatment works better than existing treatment (or, as well as existing treatment, if the new treatment is better in some other way e.g cheaper).
Unfortunately, this is seldom examined. Where it has been examined, the results are that pretty much anything works better than no treatment, but treatments do not differ from each other. For example, posting people a short DIY pamphlet on cognitive behavioral therapy appears to work as well as six weeks of one-to-one sessions with highly trained therapists.
Demand characteristics (not in lecture)
Another concept related to the placebo effect is that of demand characteristics. This is just a name for the idea that participants’ responses are affected by the desire to comply with what they think the experimenters want to see.
For example, there’s an effect called evaluative conditioning. The idea is that pairing something neutral with something people already like increases their liking of the neutral item. Belief in evaluative conditioning underlies much advertising. For example, I show you a picture of a soft drink can, and follow it by pictures of beautiful smiling people. Assuming you like beautiful smiling people, the idea is that this makes you like the drink more. This can be shown experimentally — get liking ratings of the drink, pair it repeatedly with something positive, take liking ratings again. Liking ratings go up, and do not go up in a control condition where drink and smiles are presented, but unpaired.
The claim often made for this kind of experiment is that evaluative conditioning increases the liking of neutral stimuli. An alternative explanation is that participant thinks - “What’s going on here? The experimenter is showing me this coke can and then smiley faces. I think they expect me to like coke more as a result. I wouldn’t want to disappoint them so sure, let’s give it a higher rating than I did last time”.
Experimenter Effects
Hawthorne effect (not in lecture)
Although not covered in the lecture, it’s well worth spending some time to read up about the Hawthorne effect, which is an early example of how experimenter effects can occur in surprising ways. Read the wikipedia article and/or the relevant section of the NOBA book chapter on organizational psychology.
Therapy example (not in lecture)
We considered an experiment that compared meditation-based therapy with relaxation training, and looked at effects on the BDI. The effect is that the meditation-based therapy improves happiness more than relaxation training. Pre-existing differences have been controlled for. There’s no differential attrition. Sounds compelling?
However, in this study, the meditation-based therapy is delivered by the people who developed it. The relaxation therapy is delivered by people who have no particular investment in relaxation therapy, but have been on a one-week training course in relaxation therapy.
An alternative hypothesis is that it is not the type of therapy that matters but some combination of how much the therapist believes in the treatment, how much experience they have in delivering it, and possibly their level of generalised expertise at making people feel happier.
You could control for this (or at least get someway towards controlling for it) by ensuring the levels of expertise, ideological commitment, and years of therapeutic experience were matched between conditions.
This is seldom done. What we do know, however, is that the effectiveness of a therapy is often revised steadily downwards as more and more studies are performed. The first studies are typically performed by those with a strong belief in the effectiveness of the treatment, the later studies by researchers who are more agnostic.
Data analysis
Another form of experimenter effect is bias in data analysis. This is perhaps most acute when the dependent measure is subjective. For example, say you use diary entries as a measure of happiness. Participants write about their feelings and the experimenter rates these entries for level of happiness. If the experimenter knows which condition the participant is in, this may bias their assessment of happiness.
Although not so often talked about, the problem can still occur with entirely machine-collected measurements (e.g. reaction time). The problem occurs because (as you will increasingly see throughout your degree), data analysis involves a number of steps, all of which have the potential for bias. If the experimenter knows what condition the participants are in, this knowledge could bias their decisions in a way that favours the experimenter’s hypothesis.
For example, say my theory predicts RT is higher in condition A than condition B. I find this result if I exclude people with reaction times >3 seconds, but not if I keep everyone in, and not if I exclude everyone with reaction times of less than 100 ms. I go with the >3 second exclusion. Am I sure that decision was unaffected by the fact it’s the only decision that results in a conclusion favourable to my theory?
Blind testing
There’s a relatively straight forward answer to experimenter-effect problems in data analysis — blind testing. When analysing data, make sure you do not know which group is which. This is easy to achieve — get someone else to replace the meaningful labels in your data set with meaningless ones, and ask them to withhold the mapping from you until the analysis is complete.
The central concept here is called ``blind testing’’. Single-blind testing refers to experiments where the participant does not know which condition they are in. For example, the experiment has one condition with an active drug, another with a placebo. Participants are not told which condition they are in. Double-blind testing is single-blind testing plus the experimenters do not know which condition is which until after they have completed their analysis.
Pre-registration
Pre-registration is the process by which we record our hypotheses, and our analysis plan, before we analyse the data. Often, this is done in a way that can be verified by a third-party. By pre-registering, we can help to ensure that we’re not fooling ourselves, changing our hypotheses or analyses techniques after we’ve seen the data, and then inadvertently presenting what we now think as what we thought all along.
In the lecture, I used a Richard Feynman quote as context. This is an endorsement of some of this views about science, not an endorsement of the man himself. Feynman was misogynistic, and may have been physically abusive to his ex-wife. You can read more here.
Order effects (not in lecture)
Everything we’ve discussed so far has made the assumption that the two groups (treatment and control) contain different participants. This is a between-subjects design. Experiments can also be within-subject designs. Where one employs within-subject designs, it’s critical to consider order effects.
For example, imagine an experiment to assesses whether people react more quickly to visual or auditory alarm signals. For a within-subjects design, you might do something like study shown on the slides. From the results on the slide, it looks like people can react to auditory signals faster. But auditory signals also come later in the experiment. So perhaps people have had more practice of the general experimental set up, and are getting faster for that reason.
So, we now try the experiment the other way around. Looked at in isolation, this second set of results could be a fatigue effect. It’s a long, boring experiment and people slow down later on because they are bored.
However, if we control for order effects properly, we randomly allocate half the participants to each condition and (in this example) find auditory signals are faster, irrespective of order.
Difference versus no-difference designs
These are common in psychology, but they are still weak ways to run experiments. Avoid them if at all possible. The term refers to experiments where the alternative against which your preferred theory is tested is a null (no difference). For example, your preferred theory is that people differ in how quickly they react to auditory and visual alarm signals. Your alternative is that there is no difference. Often called a “null” hypothesis, but perhaps more properly a “nil” hypothesis — a hypothesis of no difference.
Difference versus no-difference designs are problematic because experimental control is never perfect. For example, it is never going to be the case that modality (auditory versus visual) is literally the only thing differing between the conditions of your experiment. So, the null/nil hypothesis is almost certainly wrong, and detectably so if you test enough people. Thus the result of the experiment is known before you run it, and thus there was no point running it at all.
Having a directional test is a bit better. Your preferred theory here is that e.g. auditory is faster than visual. Your alternative is that there is no difference. It’s possible your experiment would disprove your preferred theory — if visual turns out to be faster than auditory. So, there was a point to running this study. Even better, compare theories that make opposite predictions. For example, one well-established theory predicts auditory > visual whilst the other predicts visual > auditory. This is known as strong inference.