Important Note

Tufts ended funding for its Open Courseware initiative in 2014. We are now planning to retire this site on June 30, 2018. Content will be available for Tufts contributors after that date. If you have any questions about this please write to

Tufts OpenCourseware
  • Desribe the basic methodology of conducting a RCT
  • Highlight the RCT as the gold standard of epidemiological studies
  • Introduce students to a methodology for critiquing a RCT
Color Key
Important key words or phrases.
Important concepts or main ideas.

1. Introduction

1.1. Of Mice and First Year Medical Students

In an attempt to impress Medical School Admissions Committees, you spent your last two years in college giving a cola to mice and seeing how well they performed on the mouse intelligence grid. And after all this research, The New York Journal of Mice published your abstract. In it you demonstrated that the a cola mice outperformed the control mice on the mouse intelligence grid. You showed that mouse epinephrine levels were higher in the cola mice, which may have caused them to be more alert than the non a cola mice and hence do better on the grid. Big stuff!

Now as an already accepted Tufts Medical student, you don’t need to impress the Admissions Committee anymore. (You never really liked them anyway.) But you’ve got this theory that if medical students would drink a cola, they too would do great on the mouse intelligence grid.......I mean exams.

So how would you test your theory? Would you get your best friends to drink a cola and see how well they scored? Well, that probably wouldn’t help either way as it depends on who your best friends are. Or maybe you’d try to get the whole class to drink a cola and see what the average exam score was. Well, that doesn’t help either because you wouldn’t know how they would have done if they didn’t drink a cola. Or maybe you’d just ask people if they drank a cola and what their last exam score was and compare their average to the non cola drinkers. Face it. The only ones who’d want to tell you their true exam scores are the ones who aced the exam, i.e. those who went to UMASS/Amherst, root for a favorite football team and favorite baseball team, think snowboarders should not be allowed to ride lifts with skiers, etc.

1.2. Intervention Trial

After consultations with the learned, intelligent and gifted faculty, you decide to do a Randomized Controlled Trial, or what some call an intervention trial. And in your context, drinking a cola might be considered a new “medical therapy” for high exam scores. Since you’re studying (experimenting) on people, you’ll need to make a presentation before the Tufts Medical Student Institutional Review Board. Without their approval, the law says you can’t do anything. Done. Approved.

The next thing you might want to do is to carefully define your research hypothesis: First Year Tufts Medical Students who drink at least one 12-ounce can of a cola within seven days prior to the epi-bio exam will have a higher average exam score than those who do not.

You might ask yourself if you really want to let all students into your trial. Maybe some students are allergic to a cola or some have medical conditions that are contraindications to drinking a cola. Maybe you have reasons not to study senior citizens, defined by you as students older than - forbid the thought – age 24. In short, you need very specific exclusion criteria. When that’s said and done, your potential eligible number of volunteers might be down to 130 students.

You put up flyers and only 50 eligible students volunteer. These 50 students must then be told what would be expected of them in the trial and the potential pros and cons of participating, a process called informed consent. Ten decide not to sign the informed consent form so you’re left with 40 students (subjects). Elements of informed consent appear in the table below.

1.2.1. Federal Requirements for Informed Consent

Federal Requirements for Informed Consent*
Introduction noting the research nature of the study
Purpose of study
Description of study procedures (identifying any that are experimental)
Duration of subject involvement
Potential risks or discomforts of participation
Potential benefits of participation
Alternatives (medical treatments or other courses of action, if any)
Confidentiality of records statement Compensation for injury statement (for greater than minimal risk studies)
Contact persons Statement of voluntary participation Unforeseen risks statement (if applicable)
Reasons for involuntary termination of participation (if applicable)
Additional costs to participate (if any) Consequences for withdrawal (adverse health/welfare effects if any)
New findings statement (to be provided if relevant) Number of subjects (if it may have an impact on the decision to participate)
Payments (incentives and/or expense reimbursements if any)

*Adapted from Protecting Study Volunteers In Research, A Manual for Investigative Sites by Cynthia Dunn, MD and Gary Chadwick, PharmD, MPH, University of Rochester Medical Center, 1999; permission granted by CenterWatch,

Here’s the key: you want the results from your sample of 40 (n = 40) to hopefully reflect truth, that is, the result you’d obtain if everyone eligible for the study had actually participated in the study. In other words, you want to make an inference about the population from the sample result. (Much of statistics is about making an inference about a population from a sample selected from that population. Statistics is not about a baseball player's batting average, or a soccer player's attempts on goal.)

Proceeding with the trial, you’ve determined that you’d like 20 subjects to follow the cola protocol and 20 control subjects to go about their normal routine except they’re not allowed to drink any cola, controls.

Your first thought might be to hand-pick 20 students for each study arm, but your enthusiasm for your research hypothesis might subconsciously lead to you selecting the smarter students for the cola arm. In that case, even if the cola arm did better than the control arm, you’d be criticized for stacking the deck against the control arm. In this situation, your intervention might have nothing to do with the result and it could therefore lead to an erroneous conclusion. When confounding – an unequal distribution of independent risk factors for the outcome – is not addressed, and there are ways to do this that you’ll learn about, the study’s conclusion might be erroneous.

The study protocol should also include the stop study criteria. During the course of the study, a group of independent analysts called the data safety monitoring board (DSMB) periodically, according to a predefined scheduled, review the data. If the data suggest that it is unethical to continue the trial in that one study arm is clearly having better outcomes than the other – and “better” must be precisely defined before the study begins - then the study is terminated.

Trials can also be terminated because there is clearly no difference between trial arms. This is a futility stopping rule. For example, the “Higher versus Lower Positive End-Expiratory Pressures in Patients with ARDS” trial in the July 22, 2004 issue of the New England Journal of Medicine was stopped according to this rule. “The data and safety monitoring board stopped the trial at the second interim analysis, after 549 patients had been enrolled, on the basis of the specified futility stopping rule,” wrote the authors. “At this time it was calculated that if the study had continued to the planned maximal enrollment of 750 patients, the probability of demonstrating the superiority of the higher peep strategy was less than 1% under the alternative hypothesis based on the unadjusted mortality difference.”

Getting back to your trial, you randomize the subjects, a process by which each subject has an equal probability of being assigned to either study arm. To assure that the process is truly random, a computer randomly determines which subjects are assigned to each arm. Randomization tends to produce treatment arms that are comparable meaning the baseline characteristics (intelligence, study habits, previous epi-bio courses, etc.) are about the same for each arm. The roll of the dice should accomplish this for you...usually.

Point:Only an RCT randomizes. There is always a heightened concerned with other study designs that the process of selecting subjects has led to confounding and an erroneous conclusion. The RCT is the gold standard because it alone randomizes, a process by its very nature that should (no guarantees) lead to comparable study arms.

By the way, you really wouldn’t care if by rotten luck of randomization one study arm had subjects who ate a lot more grapes than those in the other study arm. Because there’s no evidence that eating grapes is associated with an epi-bio exam score, the outcome of interest, it doesn’t matter. Under this assumption, eating grapes is not a potential confounder.

You would, however, care if one study arm had subjects who studied epi-bio an average of three hours each day while subjects in the other study arm only studied an average of 20 minutes each day. The arm with the higher study group would be expected to do better on the exam. (You will learn later in the course that there are ways to fix confounding including a process called adjustment.)

So the trial begins and along the way you might want to know if subjects are really adhering to their treatment protocol, a process called assessing compliance. An efficacy analysis would only analyze those subjects that were compliant with their assigned protocol. Alternatively, in an intention-to-treat analysis subjects are analyzed according to the study arm to which they were assigned.

Some subjects will quit the study for reasons that may or may not be related to their assigned treatment arm. These subjects are lost to follow-up. If possible, you’d like to know why each subject left the study and what their exam scores were.

After the exam you’ll calculate the average (mean) test score in each arm. Using the appropriate statistical test, you’ll make a determination as to whether or not an observed difference between the two study arms is likely or unlikely due to chance alone. (More on hypothesis testing and statistical tests later in the course.)

In the end you’ll form your conclusions, write your paper, and submit it for publication in the Annals of Tufts Medical Students. That Nobel Prize in Medicine is around the corner! And to think it all started with mice and first year medical students.

1.3. Evidence Based Learning

The above example aside, properly conducting a RCT is serious business. Depending on a trial’s results, patient care treatment decisions might be recommended. To make such decisions, physicians would like evidence, preferably evidence from RCTs. But that’s not always possible. Sometimes patients are treated based on decisions not initially supported by RCTs or even other studies. That’s the art of medicine. When possible, however, physicians would like to know the evidence supporting various patient care decisions.

1.4. Evaluating a RCT

The staff of McMaster’s University in Hamilton, Canada has published methods that can be used to evaluate articles in the medical literature. The method described below is adapted from a seminar I attended at McMaster’s about 10 years ago. The critique can be divided into two main categories:

1.4.1. internal validity

Internal validity concerns itself with the study design, i.e. could the study design have led to an erroneous conclusion. This occurs due to errors of chance, bias and/or confounding. Critiquing the internal validity is about assessing the wicked ways of chance, bias and confounding.

1.4.2. external validity, which some authors call generalizability

External validity, on the other hand, concerns itself with the clinical usefulness of the results. This assumes, of course, that the internal validity passes muster. If the internal validity is rotten, i.e. the conclusions are probably erroneous, then there can be no meaningful external validity.

1.5. Internal Validity

1.5.1. Do the subjects likely meet the entry criteria as specified by the authors?

Example: Authors may say they intend to study patients with angina. The reader would want to know how the diagnosis of angina was made? Was it based on a physician’s opinion based on the patient’s description of chest pain? If yes, then one might wonder how accurate the diagnoses were, meaning there might be a material number of subjects in the study who really don’t have angina. Or, did the entry criteria include an abnormal test, such as an abnormal exercise stress test, or a cardiac catheterization. In this case, there is fairly objective evidence to support the fact that the subjects really had angina. This is an important consideration because the authors might conclude their results are referable to angina patients, but if they had a lot of non-angina patients in the trial the results are really referable to some angina patients and some non-angina patients.

1.5.2. Are the study arms comparable?

This is the point where the reader must ask if there was a level playing field, i.e. was one group more likely to get the outcome not because of their intervention, but because their baseline characteristics made them more likely to get the outcome. This point forces the reader to make a determination as to whether or not there might be confounding in the study and whether or not the authors addressed it by performing an adjustment, for example.

Typically, an article will show the baseline characteristics in Table I. The readers should make a determination as to whether or not the authors obtained all of the important baseline data on the subjects - potential confounders - and whether or not randomization worked to equally distribute the potential confounders.

This usually requires an “eyeball test”. Look across the groups and form an opinion as to whether or not the study arms appear to be comparable. If not, look to see if the authors considered this potential confounding and then performed an adjustment. If they did, you can conclude the authors adequately addressed the potential confounding problem even if you don’t fully understand the mechanics of adjustment.

By the way, an advantage of randomization is that it not only tends to equally distribute known potential confounders, but also unknown potential confounders. Suppose three years into a study someone publishes a blockbuster article showing that some factor is important to the very outcome the investigators are assessing. Even if that occurs, it’s likely that by chance alone these initially unknown confounders were equally distributed, but you’ll never know for sure because they won’t appear in Table 1.

1.5.3. What was the loss to follow-up?

Concerns can be raised if the loss to follow-up is high such as greater than 20%. The concern is that people might be leaving the study based on a factor related to the treatment protocol of the outcome. If this occurs, especially in one study arm more than the other(s), it can lead to erroneous conclusions. Ideally, the loss to follow-up should be low, less than 10%. However, there may be some studies that one could not realistically expect a low loss to follow-up.

1.5.4. Was compliance assessed?

For obvious reasons, the researchers want subjects to adhere to their assigned protocols. It would be unrealistic, however, to expect 100% compliance with all protocols. This would even hold for trials that are conducted on hospitalized subjects where decision trees are potentially complicated and ongoing patient care decisions are not so simple to sort out in the context of the trial.

The researchers should therefore make some attempt to assess compliance, even though this can be difficult. If the trial involves subjects taking medications as outpatients, the investigators might give a specific number of pills to the subjects and ask them to return with the pill bottle at their next visit. A simple subtraction can help the investigators determine if the proper amount of pills have been taken. Of course there is still no guarantee that the patients took the pills exactly as told. Subjects might also catch on why a pill count is being done in which case they might show up with the correct number of pills. The notion here is that subjects are volunteers and by their very nature they may not want to fess up to not adhering to the protocol out of a desire to please the investigators.

Sometimes blood and/or urine tests can be done to determine if pills have been taken. As above, subjects might only take their pills a few days before the visit in which case the investigators might be misled about compliance. In short, assessing compliance can be difficult. Nevertheless, the reader will want evidence that the researchers made a reasonable attempt to assess compliance, and what the data showed.

As mentioned earlier in these notes, researchers must make a decision on how they will analyze non-compliers. In an intention to treat analysis subjects are analyzed according to the groups to which they were assigned. This type of analysis is intended to yield study results that might also be expected in real life since patients seen in physician offices might be noncompliant too. In an efficacy analysis non-compliers are not analyzed. In actuality there are many ways for researchers to crunch these data, and the reader should look for the particular details on how this was handled.

1.5.5. Was the study blinded?

In a single blinded study the subjects do not know their study arm assignment but the investigators do know. In a double-blinded study neither the investigators nor subjects know the assignments. In an unblinded study both subjects and investigators know the study arm assignments.

Why does blinding matter? Assume, for example, that a study is assessing pain and the subject knows the medication to which she has been assigned. If the subject has a bias about the medication, a bias that could arise from such sources as an internet search or opinions of others, it could cause the subject to subconsciously over estimate or under estimate the true pain level.

An investigator too can have a bias. If investigators are themselves determining the outcome, such as the subject’s level of joint inflammation, they might subconsciously over estimate or under estimate the true level of inflammation based on this bias. Sometimes however it’s medically necessary for physicians to know which study arm the subject is in. In this case the investigators might have an outcome assessment committee that determines the outcomes in a blinded fashion, a committee separate from the treating physicians.

It’s particularly important to blind studies where the outcome is subjective, such as pain level. If the study’s outcome is death, obviously an objective outcome, it would be difficult to conclude that an unblinded study might have led to an over or under estimate of death.

Sometimes studies can’t be blinded, such as a trial comparing a surgical treatment to a medical treatment for some disease. Even in these circumstances the reader should made a determination as to whether the lack of blinding could have led to an erroneous study conclusion.

So a bias can occur when unblinded subjects themselves determine or have a significant role in determining the outcome of interest, such as pain level. If the subjects have a bias about the assigned treatment, this bias can cause the outcome to be erroneously classified in one direction, be it the over direction or the under direction. This can lead to differential misclassification of the outcome. Some authors call this non-random misclassification of the outcome; the two terms are synonymous. The purpose of blinding, therefore, is to avoid differential misclassifications of outcomes.

Misclassification bias can still occur even though the study is double-blinded. In this scenario it might not be clear to the investigators how to classify an outcome; the more subjective the outcome, the larger this potential problem. For example, it might seem to you that a heart attack (myocardial infarction, MI) is an objective outcome, like death. However, there are situations when it is not clear whether or not a subject has had an MI. In these situations, the outcome may be misclassified, but it is not misclassified because of a subconscious bias. It’s misclassified because the investigators aren’t sure how to classify the outcome, MI or not.

Such misclassified outcomes should be randomly misclassified, some misclassified one way and some misclassified the other way. This type of bias is therefore called random misclassification of the outcome, which some authors call non-differential misclassification of the outcome. You will learn later in the course that this type of bias tends to push the study’s observed results towards the null hypothesis.

(These concepts will be reviewed in more detail in Lecture 4 - Threats to Validity.)

1.5.6. Were the outcomes, as defined by the study, correctly assessed? Do you agree with the study’s definition of the outcomes?

If the outcome is MI, how is it defined? Is it symptoms alone, or symptoms plus EKG abnormalities, and if so what kind of EKG abnormalities? Are abnormal cardiac enzymes required for the diagnosis of MI, and if so in what pattern?

The more subjective the outcome the more important it is to precisely define the outcome. For example, we have a common understanding of death but we might disagree on what defines pain or a urinary tract infection.

1.5.7. If the study results were not statistically significant, did the study report the power?

If the study concluded there was no difference in outcomes between the study arms, how likely was the study to detect a true difference of some magnitude in the outcomes?

Said differently, what was the study’s power to detect a true difference in outcomes if there really was a true difference? Was the power 50%, meaning if there were a true difference of a certain magnitude between the study arms they only had a 50% chance of detecting it? Or was the power 80%? The reader’s interpretation of a negative study result should be based in part on the study’s power. As it turns out, small studies are more prone to false negative conclusions than larger studies.

The lectures on statistics will review this concept in more detail.

1.5.8. Forming an Opinion about the Internal Validity

After a review of these seven points, the reader must form an overall opinion about the trial’s internal validity. If you raised concerns about some of these points, then you must decide whether these concerns are or are not a big deal, i.e. could they have led to an erroneous conclusion. Sometimes you won’t be totally sure and you’ll give the trial the benefit of the doubt.

If you conclude the internally validity is seriously flawed, you should state why and how the study’s conclusions might be erroneous. It’s not enough for the educated reader to simply say the results were probably biased because confounding or misclassification, for example, could have occurred. You should state how such confounding or misclassification could have led to an erroneous conclusion.

For example, you might conclude that because a trial was not blinded, subjects assigned to treatment A might have overestimated its pain reduction benefits. In such a case you should say that this misclassification bias likely led to an overestimation of the treatment’s true benefit. As you learn about relative risks and odds ratios later in the course, you’ll be able to frame your critique in terms of an over or underestimation of the true relative risk or odds ratio.

If the internal validity fails, there is no external validity. Assuming however that you think the internal validity is fine, then a review of external validity is warranted.

1.6. External Validity

The points to consider for external validity are not as concrete as those for internal validity. You should try to think through the various scenarios under which the results might or might not be applicable. Following are some basic considerations:

1.6.1. To whom are the study results generalizable?

The notion here is that a trial only studies a sample of subjects from a population and one must question whether the subjects are representative of the population at large. For example, people around the world have heart disease, so are the results of patients studied in the U.S. likely generalizable to everyone else in the world with heart disease? Suppose the trial only had a few patients with diabetes. Would the results be applicable to diabetics? The reader should not automatically conclude that because the trial did not have wide representation from all possible subgroups of subjects, that the results are not referable to all of these subgroups. However, if the reader believes that such a lack of representation might be important, then the reader should offer an opinion as to why the results might be different for patients in these subgroups.

It is clearly important for physicians to ask themselves if the patients in the trial are “like” the patients they treat in clinical practice. If not, the physician must decide if it is appropriate to apply the treatment to her or his own patients. An article in the July 5, 2004 edition of the New England Journal of Medicine touches on the point about generalizing results from subjects in a trial to patients in the “real world.” In “Treatment of Heart Failure with Spironolactone – Trial and Tribulation,” Drs. John McMurray and Eileen O’Meara comment on a difference in adverse events – high potassium levels and kidney dysfunction – between subjects who were in the trial and observations made in the general public subsequent to the release of the trial’s data and apparent physician practice changes. “How can we explain these findings, assuming that they reflect cause and effect, and why has the experience in clinical practice been so different from that in the clinical trial?” they write. “A number of causes are likely,” they continue. “One generic explanation is the clear difference between the patients in the RALES trial and those in the ‘real world.’ In part, these differences reflect the restrictive criteria for inclusion and exclusion that are common to all clinical trials. Inevitably, these conditions mean that physicians must extrapolate beyond trials to their own practice. When this extraction is appropriate and when it becomes inappropriate is a matter of educated clinical judgment.”

1.6.2. Could my patients obtain the treatment?

You might be convinced that a surgical intervention is really better than medical treatment, but is this intervention only available in a few hospitals?

1.6.3. Do the subjects in the trial resemble my patients?

There have been trials that have shown some benefits to very closely controlling blood sugar levels in patients with diabetes mellitus. Patients in such trials are likely to be highly motivated. Your question, therefore, might be how many of your patients would be as motivated to adhere to such a protocol.

1.6.4. What’s the treatment cost?

You might argue that society could not afford to pay for a treatment that you feel is too costly for the apparent benefit.

1.6.5. Are the trial results believable?

You might legitimately conclude that the trial’s internal validity was good, but you’re not sure if the conclusions should yet be believed. As you will learn later in the course, not all associations are causations. This seems like a funny concept, but trials look for associations between the study arms and the outcomes. It’s another quantum leap to assume that the association is a cause and effect relationship. (More on this cause and effect topic in Lecture 2 - Observational Studies.) One of the causality criteria, for example, is consistency with other studies. Do these results concur with other similar studies? If yes, it’s another brick in the wall in making the case for causality and not just an association.

Chapter 13 of the Gordis text, From Association to Causation: Deriving Inferences from Epidemiologic Studies, discusses this concept in detail.

1.6.6. Are the results clinically significant?

A trial may show that some small difference in treatment arms is unlikely due to chance alone - statistically significant - but does this mean that patients really benefit? For example, a very large trial might correctly conclude that antihypertensive A lowers diastolic blood pressure 1 mm of Hg more then antihypertensive B. Even if that average difference of 1 mm of Hg is real, does this mean that patients will get any benefit? The message here is don’t automatically equate statistical significance with clinical significance. (The concept of statistical significance will be taught in more detail later in the course.)

1.6.7. Was the result a surrogate end point?

Another type of outcome to carefully consider is a so-called surrogate end point. A recent article in the Journal of the American Medical Association (JAMA) defines a surrogate end point as a laboratory measurement or a physical sign used as a substitute for a clinically meaningful end point that measures how a patient feels, functions or survives. The authors of this paper note that the use of surrogate end points may be beneficial or harmful. They note that the use of a surrogate end point may lead to rapid and appropriate dissemination of new treatments.

For example, a trial may demonstrate that regimen A increases bone density more than regimen B in postmenopausal women. What one is really interested in is not bone density per se, but the prevention of osteoporosis complications. Bone density is a surrogate end point for reducing osteoporosis complications.

The authors of this JAMA paper suggest the reader consider some points when evaluating a study with a surrogate end point, such as: Is there a strong, independent, consistent association between the surrogate end point and the clinical end point?

1.6.8. Were there differences in outcomes for sub-groups in the trial?

The New England Journal of Medicine article of December 21, 2000 titled Phenlpropanolamine and the Risk of Hemorrhagic Stroke abstract follows: “Phenlypropanolamine is commonly found in appetite suppressants and cough or cold remedies. Case reports have linked the use of products containing phenylpropanolamine to hemorrhagic stroke, often after the first use of these products. To study the association, we designed a case-control study.” The study of roughly 2,000 consisted of 55% women and 45% men.

The authors’ conclusions were: “The results suggest that phenylpropanolamine in appetite suppressants, and possibly in cough and cold remedies, is an independent risk factor for hemorrhagic stroke in women.” The authors also wrote that, “No significantly increased risk of hemorrhagic stroke was observed among men who used a cough or cold remedy that contained phenlypropanolamine. Because no male subject reported the use of appetite suppressants containing phenylpropanolamine and only two reported the first use of a product containing phenylpropanolamine, we could not determine whether men are at increased risk for hemorrhagic stroke under these conditions.”

Note that the author reported a different risk for men vs. women who used a cough or cold remedy containing phenylpropanolamine. This is an example of interaction, also called effect modification. This is not an internal validity bias. Rather, it is an observed risk difference between different groups in the study, in this case men vs. women.

(This concept will be discussed in more detail in Lecture 4 - Threats to Validity.)

2. Ancillary Material

2.1. Readings

2.1.1. Required

  • Read Chapter 6, Assessing the Efficacy of Preventive and Therapeutic Measures: Randomized Trials, Gordis Text