- Discuss the relationship between p-values and confidence intervals
- Review concepts of hypothesis testing, the p-value, Type I and II Errors and power using examples

Color Key | |

Important key words or phrases. | |

Important concepts or main ideas. |

# 1. Hypothesis testing

P-values are generated by a formal process called hypothesis testing. The process begins by developing a research question. For example, does the new medication, Lovastatin, reduce cholesterol levels? The research question is converted into a formal scientific hypothesis, which has two parts: The null hypothesis and the alternate hypothesis. The null hypothesis is stated suggesting that the medication has no effect on cholesterol. In a setting of a clinical trial with treatment and placebo groups, the null hypothesis would be phrased, “Persons (i.e. a population of persons) treated with lovastatin have the same cholesterol levels as persons not treated with lovastatin. The alternate hypothesis would be stated, “Persons treated with lovastatin have different (higher or lower) cholesterol levels than persons not treated with lovastatin. This alternate hypothesis is stated as a 2-tailed hypothesis, which considers it possible that lovastatin has the opposite effect of that anticipated by the researchers. Statistical testing is always done as an exercise to disprove the null hypothesis – in this case of no difference in cholesterol levels between persons treated with lovastatin and persons not treated with lovastatin.

Note that we are not hypothesizing whether there is a difference in cholesterol levels between the treatment and placebo groups in our trial. It is very unlikely, even if lovastatin had no effect at all on cholesterol that the mean cholesterol level of the “treatment” group would be exactly the same as the mean cholesterol of the placebo group. We are really asking whether the between-group difference in cholesterol we observe in the study could be explained simply by sampling error or not.

The concepts underlying statistical testing can seem daunting to the novice, but once you understand the concepts of the central limit theorem, the rest is really rather simple.

If there were no effect of lovastatin on cholesterol, we would expect, on average, no difference in cholesterol levels between populations of persons treated with lovastatin and populations of persons not treated – or a zero difference. Thus, we would expect that a sampling distribution of the difference in cholesterol between patients treated with lovastatin and patients treated with placebo would be Gaussian and distributed around a mean of zero (or that is what the central limit theorem would predict). You would be able to plot the sampling distribution of the difference in cholesterol levels between treatment groups only if you were able to do the exactly the same clinical trial in a several hundred hospital across the US simultaneously. We would plot the average difference in cholesterol between the treated and placebo groups from the hundreds of clinical trials and see if the average result fell on zero.

In theory we could do that, but in practice we are usually only dealing with one study and therefore only one between-group difference in cholesterol. (That’s pretty humbling when you realize that some of these trials can cost several millions of dollars). To the novice, it would seem that this one number (a difference in mean cholesterol values between two groups) might be pretty useless, but in fact it is not at all useless. We can use our knowledge of the central limit theorem to test whether the difference in average cholesterol values between the treated and placebo groups was too big (positive or negative) to be consistent with the null hypothesis of no difference in average cholesterol. This is another way of saying that the average difference we found could not have come from a theoretical sampling distribution of differences with a mean of zero.

But how far away from zero difference does the difference we observe between our study sample groups have to be before we say it is indicative of a true difference in the effect of treatment vs. placebo on cholesterol. We use a value called “alpha “ to make that judgement call. We say that if the result we obtained falls in the extreme tails of a theoretical sampling distribution with a mean of zero, we will reject the null hypothesis that there is no difference in cholesterol levels between groups.The extreme tails is usually the lowest 2.5% or highest 2.5%. This corresponds to a two-tailed alpha of 5%.

# 2. The alpha region

The investigator establishes the criterion for rejection of the null hypothesis before the study is initiated. This is to keep the investigator from changing the criterion after the data have been examined. The most commonly accepted alpha criterion is 5%. If the study result-in this case a between-group difference in mean cholesterol levels falls in the most extreme 5% of the theoretical sampling distribution corresponding to the null hypothesis, then the null hypothesis is rejected.

# 3. The p-value

The p-value is calculated from the difference in mean cholesterol between the treatment and placebo groups. We usually use a t-test for a study of this design. The t-test turns the difference in mean cholesterol into a p-value.The p-value is the probability of obtaining by random sampling error (or chance) alone a result as extreme or more extreme than that obtained, if the null hypothesis were true. Using our example of a clinical trial of lovastatin, the p-value would be interpreted as the chance of obtaining a between-group difference in mean cholesterol levels as large or larger than that which was observed solely through sampling error from a theoretical distribution of between group differences that had a true mean of zero (i.e. the null hypothesis).

# 4. The statistical test

We have a p-value calculated from our study result, we have a criterion for rejection of the null. If the p value we obtained from our study result is smaller than the alpha value, we reject the null hypothesis. If the p-value calculated from our study result is greater than or equal to the alpha value, we do not reject the null hypothesis.

Note: The p-value is not a magic number. It only tells you how likely you were to have found a result as extreme or more extreme than the one you obtained by chance alone, if the null hypothesis were true.

The p value does not tell you if lovastatin really does
lower cholesterol

The p-value does not tell you if lovastatin does
not lower cholesterol

The p-value does not tell you whether
lovastatin is a good medication for people with high cholesterol.

The p-value does not tell you if there was bias in you
study.

# 5. Type I and type II error

When the results of our statistical test lead us to arrive at a false conclusion we have made a type I or type II error.When we falsely reject the null hypothesis, we have committed a type I error. The type I error corresponds to the alpha, or the criterion for rejection of the null. It is logical that this is the case. We know that 5% of samples will fall in the tails of theoretical sampling distribution when the null hypothesis is true. Thus, we would expect that 5% of clinical trials of lovastatin would find a between group difference in mean cholesterol that fell in the tails of the sampling distribution even when lovastatin had no effect on cholesterol. When we falsely fail to reject the null hypothesis, we have committed a type II error. This can happen when our sample size is too small to detect the difference in cholesterol that really exists between those treated and those not treated. This brings us to the concept of power.

# 6. Power

Statistical power is a good thing. You get more power with a larger sample size in your study. Power means that you can more readily detect true differences when they exist. In other words, you are less likely to falsely fail to reject the null hypothesis.

# 7. The relation between the p-value and the confidence interval

The p-value and the confidence interval are mathematically related to on another. Simply put, when the p value is less than 5%, the 95% confidence interval will exclude the null value. Therefore, the confidence interval will tell you if the null hypothesis is rejected or not.

Note: The value of the “null” depends on the study design and the type of analysis done. In the example used in this lecture, the null value is zero, corresponding to zero difference in cholesterol levels between those persons treated and not treated with lovastatin. If we were calculating a relative risk or odds ratio, the null value would be 1, not zero.

# 8. SUMMARY OF HYPOTHESIS TESTING

## 8.1. Basic principles:

- Every time you see a p-value, someone tested a hypothesis
- Your data comes from a sample you select; you hope your sample represents a larger population you’d like to draw conclusions about.
- A statistic is a number (i.e. a proportion, an average or a disease rate) you calculate from your data sample. The statistic is an attempt to describe the truth (a parameter) about a larger population using your sample.
- Statistics (for example, from two arms of a RCT) may be “statistically significant” for a few reasons: 1. Random variation: by chance the groups’ results were different 2. True difference: the two treatment arms really had different results 3. Sampling error or bias: the two groups were chosen poorly and are different for reasons that have nothing to do with the treatment
- Hypothesis testing allows a researcher to decide whether a difference between numbers is more likely due to random variation (1) or a true difference (2). We can’t get around sampling error (3) with statistics.

## 8.2. How to test a hypothesis:

- Write a null hypothesis
- Write an alternative hypothesis: determines if your test will be one-sided or two-sided (usually two).
- Set alpha (that is, the level of probability you will accept as unlikely related to chance alone), usually=0.05 (“the line in the sand”)
- Generate a test statistic: a number you calculate to describe your sample results. There are many types of test statistics. The type of test statistic used depends on the type of data you have. More about test statistics later.
- Compare your test statistic to a known/published distribution for that type of test statistic in order to find the p-value, or the probability that your test statistic occurred by chance alone, given that the null hypothesis is true.
- If the p-value is less than alpha (usually .05), then the numbers in the null hypothesis differed more than expected due to chance alone. Reject the null hypothesis. Conclude that the groups are different. Or, if the p-value is greater than or equal to alpha, then the difference between the numbers could be due to chance alone. Cannot reject the null hypothesis. Conclude that the differences between the groups may be due to chance alone.