- Define commonly used statistical terms for hypothesis testing
- Discuss the role of chance and assessing its role in internal validity
- Contrast clinical significance with statistical significance

Color Key | |

Important key words or phrases. | |

Important concepts or main ideas. |

# 1. Overview

## 1.1. ASA vs. Med B

The evidence is quite convincing that certain patients with coronary artery disease (CAD) have a reduced incidence of subsequent cardiac events when they take an aspirin (ASA) a day. Suppose, however, that you have good reason to believe that hypothetical Med B, which has already itself shown to provide protective effects in CAD patients, might actually be better than ASA. After obtaining all the required approvals, you conduct a trial in which 100 CAD patients are randomly assigned to take an ASA each day and another 100 CAD patients are randomly assigned to take a compound B each day. The trial will be conducted over a five-year period. The outcome of interest is MI (Myocardial Infarction).

Assume that all patients completed the trial. At the end of the trial, 20% of the patients assigned to the ASA arm had an MI, and 17% of the patients assigned to the Med B arm had an MI. Should you now conclude that Med B is really better than ASA?

The trial’s internal validity must be assessed, of course. Were biases introduced into the study that may have lead to an erroneous conclusion? Another internal validity consideration pertains to the role of chance in obtaining these trial results, i.e. could the difference between 20% and 17% be due to chance alone and not represent any real (true) difference? That is, if the same study were repeated many times, perhaps the outcomes would tip a little bit in each direction such that in totality, no difference between ASA and hypothetical Med B would be seen.

When evaluating the role of chance, you might think of a coin flip. If a coin is tossed the same way 50 times, and one gets 28 heads and 22 tails, does that mean that the coin is weighted to yield a head more often than a tail? Or is it possible that if you flip the coin a bunch more times, you’re likely to pretty much get half heads and half tails suggesting the coin is not weighted to yield heads more than tails. One way to look at this question is to quantify probabilities of outcomes.

## 1.2. P value

In reviewing the results of ASA vs. Med B, the statistician tells you that if there were no true difference between the protective effects of ASA and Med B, the probability of getting a 3% or more difference (20% - 17% = 3%) due to chance alone is 0.10. That mathematically derived 0.10 is the so-called P value. Statistical tests yield P values.

The statement “if there were no true difference between ASA and compound B” refers to the null hypothesis. The null hypothesisusually states there is no difference between study arms.

The null hypothesis for this trial would be: “In patients with CAD, there is no difference in the five year cumulative incidence of MIs between patients assigned to take an ASA a day and patients assigned to take hypothetical Med B a day.” Researchers also form an alternative hypothesis, or what some authors call a research hypothesis, which disagrees with the null hypothesis. If the null hypothesis states that there is no difference between ASA and Med B, the alternative hypothesis would state that there is a difference between ASA and Med B.

### 1.2.1. This alternative hypothesis can actually be defined in two ways:

- Two-Tailed Alternate Hypothesis: There is a difference between ASA and Med B re the outcome. (Perhaps ASA is better than Med B, or perhaps Med B is better than ASA. The two-tailed alternative hypothesis does not specify the direction of difference.)
- One-Tailed Alternate Hypothesis: ASA is better than Med B re the outcome.
- One-Tailed Alternate Hypothesis: Med B is better than ASA re the outcome.

Before the trial begins (a priori), the researchers must decide whether they will have a two-tailed alternative hypothesis or a one-tailed alternative hypothesis. Of course, if they choose a one-tailed alternative hypothesis, they need to specify the direction of the tail. Whether investigators choose a two-tailed or a one-tailed alternative hypothesis matters. The process of calculating the P value incorporates whether or not a two-tailed or a one-tailed alternative hypothesis has been chosen. P values will have a different value when calculated under a two-tailed alternative hypothesis rather than a one-tailed alternative hypothesis.

Most authorities advocate using a two-tailed alternative hypothesis unless there is strong evidence that the one-tailed alternative hypothesis could not go in one of the two potential directions. For example, if there is strong evidence that Med B could not be better than ASA re the outcome, it might be acceptable to state a one-tailed alternative hypothesis that ASA is better than Med B.

Returning to the P value, it’s an important concept for you to appreciate that the P value is calculated under the assumption that the null hypothesis is correct. A little history lesson on the P value follows:

R.A. Fisher first proposed the P value in the 1920s. Writing in the June 15, 1999 edition of the Annals of Internal Medicine, Dr. Steven N. Goodman of John Hopkins Medical School notes that, “Fisher proposed it as an informal index to be used as a measure of discrepancy between the data and the null hypothesis...Fisher suggested that it be used as part of the fluid, non-quantifiable process of drawing conclusions from observations, a process that included combining the P value in some unspecified way with background information.”

Apparently, not everyone embraced the P value Fisher proposed. “Perhaps the most powerful criticism was that it was a measure of evidence that did not take into account the size of the observed effect,” Goodman writes. “A small effect (difference in outcome...mdk) in a study with a large sample size can have the same P value as a large effect in a small study.” Goodman notes that this criticism became the foundation for investigators reporting results as confidence intervals “which represent the range of effects that are compatible with the data.” (The confidence interval is the subject of a future lecture.)

So the P value addresses probabilities, the likelihood of chance outcomes occurring under the assumption that the null hypothesis is true. The P value does not comment on how likely it is that the null hypothesis is incorrect, e.g. a P value of 0.30 does not mean there is a 70% chance that the null hypothesis is incorrect. Why is that? Notes Goodman, “....the P value is calculated on the assumption that the null hypothesis is true. It cannot, therefore, be a direct measure of the probability that the null hypothesis is false.”

## 1.3. Level of Significance (Alpha)

You may have already reasoned that simply knowing the P value does not tell you whether a result is or is not due to chance alone. In fact, chance is always a potential explanation for an observed result. It is theoretically possible to flip an evenly weighted coin the same way a million times in a row and get a million heads. The probability of that occurring is ½ to the millionth power, a probability approaching infinity at zero. But it’s not an absolute 0% probability.

What you need, therefore, is a rule, or better yet a guidepost, to help you decide if a result is likely due to chance alone, meaning it is not statistically significant, or if it is unlikely due to chance alone, meaning it is statistically significant:

- Statistically Significant = unlikely due to chance alone
- Not Statistically Significant = likely due to chance alone

This is not just a math game. If you report that the difference is likely real (statistically significant), then patients might be placed on Med B by their physicians. If you report that it is unlikely that the difference is real (is not statistically significant), then patients might not be placed on Med B. There are clinical implications to being wrong either way.

So how do you make these conclusions equipped only with a probability? It’s the ole line in the sand routine. To make a decision, you need a point such that any P value larger than (above) that point will lead to one conclusion, and a P value smaller than (below) that point will lead to the opposite conclusion. That point, or line in the sand, is most often set at 0.05, and it’s called Alpha. When reading the statistical section of an article in the medical literature, the statement often reads, “Alpha was set at 0.05.”

When alpha is set at 0.05, a P value of 0.10, for example, indicates the results of the study are not statistically significant and the conclusion (right or wrong) will be that chance is the likely explanation for the observed results. Said differently, the conclusion will be that the observed results are unlikely to represent real treatment differences.

When alpha is set at 0.05, a P value of 0.02, for example, indicates the results of the study are statistically significant and the conclusion (right or wrong) will be that chance is an unlikely explanation for the observed results. Said differently, the conclusion will be that the observed results are likely to represent real treatment differences.

Here are a few more examples to demonstrate the point when alpha is set at 0.05:

P value | Conclusion |

0.001 | Statistically Significant |

0.60 | Not Statistically Significant |

0.04 | Statistically Significant |

0.20 | Not Statistically Significant |

0.02 | Statistically Significant |

Assume Alpha is now set at 0.01 and see how some conclusions (in bold) change:

P value | Conclusion |

0.001 | Statistically Significant |

0.60 | Not Statistically Significant |

0.04 | Not Statistically Significant |

0.20 | Not Statistically Significant |

0.02 | Not Statistically Significant |

You will recall that the null hypothesis states that there is no true treatment difference. When the P value is less than or equal to Alpha, and the conclusion is that the result is statistically significant, the conclusion is that the null hypothesis should be rejected. When the P value is greater than Alpha, there is insufficient evidence to reject the null hypothesis.

It is possible, therefore, to make the following statements:

- When the P value is less than or equal to Alpha, the null hypothesis is rejected and the conclusion is that chance is an unlikely explanation for the observed results.
- When the P value is greater than Alpha, one fails to reject the null hypothesis and the conclusion is that chance is a likely explanation for the observed results.

# 2. Type I Error

As an astute clinician, you recognize that conclusions about the role of chance are based on where Alpha is set. Even though most of the medical literature sets Alpha at 0.05, it’s a subjective dividing point, even if everyone more or less agrees with this subjective dividing point. Simply saying a priori where you will set Alpha doesn’t mean that your resulting conclusions, after the P value is obtained, will be correct. In fact, your conclusions may be wrong.

When a P value is less than or equal to Alpha, the conclusion is that chance is an unlikely explanation of the observed results. Suppose that in reality chance was the cause of the observed result. In this situation the conclusion is wrong. You concluded chance probably didn’t do it but truth is that chance did do it. It’s a false positive result; false because the conclusion is wrong, and positive because you concluded the difference was likely real. This type of mistake in conclusion is called a Type I Error, sometimes called an Alpha Error.

## 2.1. Multiple Comparisons

Hypotheses should be established a priori. But given that studies collect a lot of data concerning the subjects, there may be temptations by the investigators to assess other hypotheses after the data are in, post hoc analyses. The investigators are presumably searching for unsuspected relationships and associations. This is often called a “fishing expedition” or a “data-dredging” exercise. At other times investigators conduct sub-group analyses pertaining to the hypothesis. For example, if compound B is better than ASA re the outcome, does the same hold true for subsets of individuals, such as those over 55 years old and those younger, or men vs. women, or smokers vs. non-smokers, or patients with CAD for more than five years vs. less than five years, etc.

Whether an investigator is on a fishing expedition or conducting sub-group analyses, the point is that multiple statistical tests, multiple comparisons if you will, are being performed. The concern with multiple comparisons is that it increases the probability of a Type I error.

If the null hypothesis is true, then up to 5% of the time a Type I error will occur when Alpha is set at 0.05. That 0.05 is the amount of error an investigator is willing to accept if the null hypothesis is true. If one performs five statistical tests pertaining to a null hypothesis that is true, the probability that at least one of the tests will have a false positive conclusion is more than 0.05.

Given the null hypothesis is true, there is a 95% chance of not committing a Type I error on any one study. If two tests are done, the probability that a Type I error will not occur at least once is (0.95)(0.95), 90%, which means that there is a 10% probability that a Type I error has occurred in at least one of the two analyses. Do five tests and the probability of not committing a Type I error at least once is 77%, meaning there is a 23% probability that a Type I error occurred at least once.

One statistics textbook notes that, “If you torture the data long enough it will eventually confess!” That text recommends that the results of subsidiary hypotheses be presented in an exploratory manner as results needing further testing with other studies.

One way to reduce the probability of committing a Type I error when multiple comparisons are done is to divide Alpha by the number of tests, and using that number as the new Alpha. So if Alpha is originally set at 0.05, and five tests are being done, the new Alpha is 0.05/5 = 0.01. So a P value from any of the five comparisons must be less than or equal to 0.01 to be considered statistically significant.

# 3. Type II Error

If there’s something called a Type I Error, chances are there must be something called a Type II error. When a P value is greater than Alpha, the conclusion is that chance is a likely explanation for the observed results. Suppose truth is that chance wasn’t the cause of the observed results, i.e. there really is a true difference between the treatment arms. This is a false negative result; false because the conclusion is wrong, and negative because you concluded the difference in the observed results doesn’t represent a true difference between the treatment arms.

This type of mistake in conclusion is called a Type II Error, sometimes called a Beta Error, or simply Beta. The following statements should make sense to you:

- If Alpha is set at 0.05, and the resulting P value is 0.04, a Type II error can’t occur
- If Alpha is set at 0.05, and the resulting P value is 0.02, a Type I error can occur
- If Alpha is set at 0.05, and the resulting P value is 0.25, a Type I error can’t occur
- If Alpha is set at 0.05, and the resulting P value is 0.40, a Type II error can occur
- There is no such thing as a study result which has a 0% probability of leading to a false conclusion

## 3.1. Did a Mistake In Conclusion Occur in the Study I Just Read?

The difficult question becomes whether or not an error actually occurred with a single study’s conclusion. A meta-analysis is a type of study which groups the results of different studies that have addressed the issue with the same type of study design. A meta-analysis may be particularly beneficial if the results of individual, similar studies form different conclusions. By combining all the individual study data into one big bucket as if one larger study were done instead of several smaller ones, the authors hope to form a conclusion that avoids Type I or Type II errors from the individual studies.

There are, however, many potential concerns with the process used in a meta-analysis, concerns that will be addressed in later course lectures. (Hints: Were the subjects the same in each study? Were there different inclusion and exclusion criteria for the subjects? Were the subjects followed in the same way? Were the treatment protocols exactly identical? Were the studies performed at large teaching hospitals and small community hospitals? Were the outcomes measured in exactly the same ways?)

## 3.2. Power

Studies with small sample sizes are prone to Type II errors, false negative conclusions. You, of course, want to minimize the probability of a Type II error, so you might ask the jolly statistician “how many patients do I need.” Knowing the best answer to a question is to ask another question, the party-animal statistician will respond, “need for what?”

That response hinges on how much comfort you want to avoid a Type II error. If there really is a different in the treatment arms, do you want a 50% chance of avoiding a Type II error, or a 75% chance or a 99.99% chance. It’s unfortunately a statistical fact-of-life that the more comfort you want to avoid a Type II error, the more subjects you need in your study, subjects that may not be easy to identify and recruit. It’s the classic trade-off.

It doesn’t end there. The Delta Chi statistician has another question for you: how big of a true difference do you want to detect? Are you interested in detecting a true 1% difference, or a true 5% difference, or what? Statistical fact-of-life number two is that the smaller the difference you want to detect....you guessed it....the more subjects you need. It’s another classic trade-off. The MOST CLASSIC trade-off involved Babe Ruth being traded by the Boston Red Sox to those despicable, low-life New York Yankees.

Researchers answer the latter question by deciding upon a clinically important difference that they would want to detect. For example, is it really clinically important to detect a true 1% difference between ASA and compound B, or is it 5% or whatever. It’s a subjective clinical call that balances with the reality of requiring more subjects to detect smaller differences.

On occasion, you might read an article that says something like, “To obtain a power of 80%, it was determined that 200 patients were needed to participate in the study.” This is an incomplete power sentence. It doesn’t say what difference the power is calculated to detect. We should see something like this, “To obtain an 80% power, it was determined that 200 patients were needed to detect a difference of at least 10% between the comparison groups.” More unfortunate still, sometimes an article makes no statement at all concerning power.

Returning to definitions, a study’s power is its ability to detect a true difference of at least a certain magnitude. It is always calculated under the assumption that the alternative hypothesis is true. It can be quantified as 1 minus the probability of a Type II (Beta) Error, or sometimes simply stated as 1 minus Beta. If the probability of a Type II error occurring - which can be calculated but not taught in this course - is 40%, the power is 60%. If the probability of a Type II error is 80%, the power is 20%. If the power is 30%, Beta is 70%. Again, the researchers must specify the minimum amount of true difference they want to detect.

### 3.2.1. The following statements should now make sense to you:

- Power is about avoiding Type II errors
- Power is about avoiding false negative conclusions
- Power is about getting a P value less than or equal to Alpha when there really is a true treatment difference of a certain magnitude
- Power is not about avoiding Type I errors
- Keeping all other factors (heretofore undefined) constant, the best way to increase power is to increase the sample size
- Studies with smaller sample sizes have a lower power than studies with larger sample sizes • If Beta is 20%, the power is 80%
- If Beta is 20%, the power is 80%
- If the power is 10%, Beta is 90%
- It will take more subjects to have 90% power to detect a 5% difference than it will take to have a 90% power to detect a 20% difference

Researchers generally want at least an 80% power to detect the difference they think is clinically important. The study’s power should be noted in the statistical section of the study.It’s especially important for the authors to describe the study’s power whenever observed differences are not statistically significant. Why? Because the possibility of a Type II error exists, and you’d like to have that possibility quantified.

Suppose, for instance, you read an article that concludes there is no five-year survival difference between medicine A and medicine B in the treatment of lung cancer, with a P value of 0.30. Would you feel differently about this conclusion if the power were 90% vs. 10%. Of course you would. A 10% power would mean that the study only had a 10% probability of detecting a true difference of the specified magnitude, which means it had a 90% chance of not detecting a true difference of the specified magnitude. You might legitimately ask yourself why the authors would even do a study with a power of only 10%?

The following interpretations should make sense to you:

Statement in article: “The trial had an 80% power to detect a 10% difference between medication D and medication E.”

Interpretation: If there were a true difference of 10% or more between medication D and medication E, the trial would have an 80% probability of detecting that difference. (By the way, how would this difference be detected? It would be detected by obtaining a P value less than or equal to Alpha.

Statement in article: “There was a 90% power to detect a 20% difference in five-year survival between patients placed on chemotherapy protocol A or chemotherapy protocol B.”

Interpretation: If there were a true difference of 20% or more in five-year survival between protocol A and protocol B, the trial had a 90% probability of detecting that difference. (Again, how is that true difference detected? It’s detected by obtaining a P value less than or equal to alpha.)

You’ll recall that a P value is described under the assumption that the null hypothesis is true, i.e. “if there were no true difference between the treatment arms, the probability..........” Power is described under the assumption that the null hypothesis is false, i.e. “if there were a true difference of a certain magnitude between the treatment arms..........”

# 4. Hypothesis Testing

The process described above to arrive at a conclusion about the study results is called hypothesis testing, which can be detailed as follows:

- State the null hypothesis
- State the alternative hypothesis
- Set Alpha
- Choose the appropriate statistical test to analyze the study’s data
- Compare the P value resulting from the statistical test to Alpha
- State a conclusion about the role of chance in obtaining the observed results