Important Note

Tufts ended funding for its Open Courseware initiative in 2014. We are now planning to retire this site on June 30, 2018. Content will be available for Tufts contributors after that date. If you have any questions about this please write to

Tufts OpenCourseware
Author: Janet, E.A. Forrester, Ph.D.
Color Key
Important key words or phrases.
Important concepts or main ideas.

1. Introduction

The purpose of this lecture is to familiarize you with correlation and regression techniques seen in the medical literature. Regression is a whole family of statistical techniques to deal with data of many types. The overall objective of this lecture is to teach you the fundamentals of regression, and to introduce you to the names and uses of the regression techniques seen in the medical literature.

2. Scatterplots

A scatterplot is a plot of points that represent the values of the two continuous variables for each individual in the data set. Scatterplots provide a visual description of the relation between two continuous variables. Extreme values of either variable are called outliers.

3. Correlation

3.1. Pearson Correlation Coefficient

The Pearson correlation coefficient is a measure of the association between two continuous variables. When you calculate the Pearson correlation coefficient, your answer is a single number that provides information on the strength of the linear association of two variables.

The correlation coefficient ranges from -1 to +1.A correlation of zero indicates no linear association

The sign of the correlation coefficient provides important information.If the correlation coefficient is positive, high values of one variable are associated with high values of the other variable. If the correlation coefficient is negative, low values of one variable are associated with high values of the other variable. For example, diastolic blood pressure tends to rise with age: thus the two variables are positively correlated. Conversely, because heart rate tends to be lower in persons who exercise frequently, heart rate will correlate negatively with exercise frequency.

The correlation coefficient, r, reflects the strength of the linear relationship between two variables.To get a feeling for strong or weak correlations, correlations from 0 to 0.25 (or -0.25) indicate little or no relationship, those from 0.25 to 0.50 (or -0.25 to -0.50) reflects a fair degree of relationship; and those from 0.50 to 0.75 (or -0.50 to -0.75) a moderate degree of relationship, while correlations over 0.75 (or -0.75) reflect a strong relationship

The Pearson correlation coefficient provides the best estimate of the linear association between two variables only when the joint distribution of the variables is Gaussian. When the joint distribution is Gaussian we can correctly test hypotheses based on the data. This requirement will not be the case if either variable does not follow a Gaussian distribution.

If you are the data analyst, you must always look at your data. If the distribution of either variable is not Gaussian, there are two options: 1) transform the variable so that it is normally distributed on the transformed scale, and use the transformed value as the analysis variable. (Often the natural logarithm is a good choice), Alternatively, you can calculate the Spearman correlation coefficient (see below).

Hypothesis testing with the correlation coefficient If the two variables have a Gaussian joint distribution, we can proceed with hypothesis testing.

H0: There is no linear relationship between variable A and variable B (r=0).

HA: There is a linear relationship between variable A and variable B (r0).

3.2. Spearman correlation coefficient

Sometimes the variables are not normally distributed, such as skewed data. It is then appropriate to use the Spearman correlation coefficient.The Spearman correlation coefficient is simply a Pearson correlation done on variables that have been ranked from highest to lowest.The calculation is done on the rank rather than the original value of the variable.Ranking effectively handles outliers. The Spearman correlation coefficient ranges from -1 to +1, and is interpreted in the same way as the Pearson coefficient: negative values indicate an inverse relationship and positive values indicate a direct relationship. Calculating a Pearson correlation coefficient on data that are non-Gaussian in distribution may lead to false conclusions.

Remember, estimates of correlation coefficients, either Pearson or Spearman, reflect the "closeness" to a linear relationship between two variables. One must look at the data in a scatterplot to be sure that the data suggest a linear (vs. non-linear) relationship between the two variables.

4. Simple Linear Regression

When a relationship between two variables exists, one can predict the value of one variable when the value of the other variable is known. The idea behind simple linear regression is to draw the best-fitting line through the points of the scatter plot to represent the relationship and then use this line for prediction. The regression line, like any other straight line, can be represented by a simple equation which should be familiar to you, y = mx +b, where y is called the dependent variable, x is the independent or explanatory variable, m is the slope and b is the y intercept, or the value of y when x = 0.

Linear regression analysis quantifies how well the data fit a particular equation. This equation is a model that the investigator develops based on subject matter knowledge. In regression analysis, the model will have a single variable to be predicted from one or more explanatory variables. There are many kinds of regression that differ based on how the variables are expressed, whether they are continuous or categorical, and whether or not the equation is linear. If one understands simple linear regression, its extensions follow readily.

The research question will often dictate which of a pair of variables is the explanatory (independent) variable and which is the predicted (dependent) variable.Exposure variables are independent and outcome variables are dependent. For example, let us say one is interested in systolic blood pressure in relation to age. In this example, the research question could be the prediction of systolic blood pressure from age. Thus, systolic blood pressure would be the predicted variable and age would be the explanatory variable.

In the simple two variable (bivariate) situation, the model statement is just the equation of a line. The line describes the linear relationship between the two variables.The magnitude of slope describes the amount that the dependent variable changes for each unit change in the independent variable. The slope depends on the scale used. If age were measured in the example above in months, the slope of the relation with systolic blood pressure would be less than if weight were expressed in years. The sign of the slope indicates the direction of the relationship. If the slope is positive, the predicted variable increases as the explanatory variable increases. If the slope is negative, the predicted variable decreases as the explanatory variable increases.

Most often you will see regression equations written as below. The population parameters are depicted with Greek letters.In real life, of course, the relationship is never perfect because we are generally using statistics from a sample to estimate parameters, and statistics have some error. Because of this error it is unlikely that the points will fall exactly on the regression line, so we include an error term in our model statement:

Y = σ0 + σ1X +δ,

Where σ0 is the population intercept andσ1 is the population slope.

An alternate form of the equation uses the sample statistics (Roman letters).There is no need for an error term here because the error is implicit in the use of the statistics (as opposed to the parameters).

Y' = b0 + b1X

The estimates of the slope and intercept are derived by the method of least squares to estimate the best-fit line to the data.

Statisticians call the slope the "regression (or beta) coefficient". It tells you the unit change in the predicted variable you can expect for each unit change in the explanatory variable. The sign of the regression coefficient tells you about the direction of the relationship. For example:

Systolic blood pressure in mm Hg = constant + (slope)(age in years)

is the regression equation for the prediction of systolic blood pressure from age. A slope of +0.45 means that for each additional year of age there is a 0.45 mm increase in systolic blood pressure.

5. Hypothesis testing in regression:

H0: The slope of the line that relates variable X and variable Y is zero (no linear relationship or no change in Y with a change in X).

HA: The slope of the line that relates variable X and variable Y is not zero (i.e. some linear relationship).

The beta coefficient has a Gaussian sampling distribution, or a t distribution if the variance used is an estimate (vs. the population parameter). Thus, the theoretic underpinnings of the test of the null hypothesis of the slope go back to the Central Limit Theorem.

The beta coefficient is mathematically related to the correlation coefficient. If the beta coefficient in a simple linear regression is statistically significant, then the correlation coefficient will also be statistically significant. The strength of a relationship between Y and X cannot be evaluated using the beta coefficient because the slope of the line depends on the units in which X and Y are measured (for example height measured in inches vs. centimeters).The appropriate way to measure the strength of the linear association between X and Y is by using the correlation coefficient.

6. The coefficient of determination

The correlation coefficient does not have intuitive interpretation. However, its square, the coefficient of determination, does. The coefficient of determination, r2, is the amount of variation in the dependent variable accounted for by the independent variable.

For example, if the correlation between systolic blood pressure and age is r=0.6, r2 of the simple regression model would be 0.36. We would interpret r2 to mean that 36% of the variability in systolic blood pressure among people is accounted for by differences in their age. The remaining 64%, is variability in blood pressure that is unaccounted for by age differences. Other factors, such as weight, diet and exercise habits, may account for the remaining 64% variability in blood pressure among people.

7. Assumptions underlying linear regression

For the regression equation to be valid in prediction, certain assumptions must be met:

For each value of X, the Y variable must be normally distributed, with a mean at the predicted value, Y'. The SD of Y is assumed to be the same for every value of X (an attribute known as homoscedasticity). It is sometimes necessary to transform the X variable to meet these assumptions. Statisticians conduct regression diagnostics to assure themselves that the necessary assumptions are met.If the assumptions are not met, then the estimated slope and p value may be invalid.

8. Regression and correlation compared

Correlation and regression are closely related, but there are some important differences. The correlation coefficient is independent of the units of measurement. The regression coefficients (slope and intercept) will change as units of measurement change.Furthermore, the regression of X on Y is not the same as the regression of Y on X--- which variable is dependent and which is independent matters. In contrast, the correlation of Y with X is the same as the correlation of X with Y.

9. Other types of regression

9.1. Multiple regression analysis:

There are many situations that cannot be handled by simple linear regression. You may be interested in determining a set of variables that predict an outcome. Alternatively, you may wish to remove the confounding effects of a third variable, or evaluate a variable as an effect modifier. All of these situations (and some others) can be addressed by multiple regression analysis.

Multiple linear regression is just an extension of simple linear regression in which we have more than one independent variable. As an example, we might be interested in the relation of age and weight to systolic blood pressure, or, in examining the relation of age to systolic blood pressure, we may simply want to remove the confounding effects of weight. Weight can be thought of a confounder in the relationship of age to blood pressure, since weight tends to increase with age, and blood pressure increases with weight, independent of age. (That is the operational definition of confounding).

Written as a model:

Systolic blood pressure = constant + age + weight

Now we will be estimating two coefficients:

In equation form: Y =σ0+ σ1X1 + σ2X2 + σ,

Where X1 is body age, σ1 the coefficient for age, X2 is weight, σ2 is the coefficient for weight, and σ represents the error in the fit of the model.

The right side of the equation in a regression analysis can include nominal, ordinal and continuous variables. Multiple variables can be either additional explanatory variables or confounders. When confounding variables are introduced as independent variables, the model is said to be "adjusted" for the potentially confounding effects of these variables. As in the example above, age adjustment is accomplished by introducing age as an independent variable in a model that contains a specific exposure variable of interest and an outcome variable.

The coefficient for a given variable is interpreted in the same way as it is in simple regression. It is the amount by which the dependent variable changes for each unit change in the independent variable, while accounting for the other variables in the equation. In this example, σ1 represents the increase in systolic blood pressure for each unit change in age, independent of weight (or "controlling or adjusting for weight"). Another way of thinking of this is the amount by which systolic blood pressure increases for each unit increase in age in people who have the same weight.

There is a coefficient of determination in the multivariate context as well, R2. It describes the amount of variation in the dependent variable accounted for by the set of independent variables in the equation.

The process of building these models is not as easy as it sounds. It is necessary to examine the relationship between each of the in the independent variables and the dependent variable, and then examine the relations among the independent variables. If there is some relationship between independent variables, it will have an effect on the final model and its interpretation. Confounding is one example of dependency between independent variables. Effect modification is another example.

10. Other simple and multiple regression techniques common in the medical literature: Logistic Regression

In logistic regression, instead of fitting a continuous dependent variable, we fit a dichotomous dependent variable (died: yes/no).Because epidemiologists are often interested in the presence or absence of a health condition, logistic regression is widely used in medical epidemiology. Another reason for its widespread use is that the coefficients from logistic regression analysis can be interpreted as odds ratios. Logistic regression also allows control of confounding variables.

For example, one could use logistic regression to predict the probability of stroke in relation to gender while controlling for several possible independent variables such as age or hypertension. Where the independent variables are dichotomous, it will be possible to estimate odds ratios. In the stroke example, it might turn out that men are 3.5 times more likely than women to have a stroke. With the other variables in the logistic equation, this result would be stated as follows: "After controlling for age, and hypertension, men are 3.5 times more likely than women to suffer a stroke." If one were interested in expressing the relation between stroke and hypertension in terms of odds ratios, one could transform the continuous variable into quartiles of blood pressure and estimates odds ratios for the quartiles.

11. Survival analysis

Often we are interested in knowing how long people with a life-threatening disease survive following treatment. An example would be chemotherapy vs. radiation in the treatment of cancer. We could use the 5-year survival rate and calculate the proportion of people surviving in the two treatment groups with a simple relative risk calculation. However, that would not account adequately for loss-to-follow-up, which is common in studies of long follow-up. For the same reasons, mean and median survival rates are not adequate.

Life tables account for differences in follow-up time and also account for changes in survival rate over time. Life tables break the follow-up period of the study into pieces and then calculate the proportion of people at risk at the start of each interval who survive until the start of the next interval. This allows the use of information from all individuals as long as they are in the study.Loss-to-follow-up is reflected in fewer numbers of study participants at risk in the next interval. There are two ways to calculate life tables: the actuarial method and the Kaplan-Meier method. The fundamental difference between these two methods is the definition of the interval.The actuarial method uses a fixed interval, often the yearly anniversary following the end of treatment. The Kaplan-Meier method uses the next death, whenever it occurs, to define the end of the last interval and the start of the next interval.The two methods give similar results. These results can be displayed in survival curves.

In the above example, we would be interested to know whether the survival curve of cancer patients treated with chemotherapy differs from that of patients treated with radiation. The appropriate statistical test to answer that question is the Log-Rank Test. This test with a null hypothesis that there is no real difference in the survival rates of cancer patients treated with chemotherapy vs. those treated with radiation, and then tests how likely any differences observed in the curves generated by our study is due to chance alone.

The log rank test is a crude test that does not allow us to control for confounding or examine possible effect modification To do that we use Cox proportion hazards regression.. Detailed examination of these models is beyond the scope of this course. You only need to know that when you see a reference to the log rank test in a medical article, a crude test for differences in survival curves was conducted. If you see reference to a Cox regression model, you know that this analysis was done to examine the difference in survival between groups while controlling for confounding and also to look for possible effect modification.

12. Conclusion

The brief descriptions above are provided to give you a sense of the flexibility and utility of regression analysis for understanding health data. As you read the biomedical literature, the methods sections will often include descriptions of regression techniques used to analyze the study data. Even if you come upon some kind of regression you have never heard of, just remember that all regression procedures involve manipulating multiple variables simultaneously for three basic purposes: 1) to determine which variables predict a given outcome 2) to control confounding 3) to evaluate effect modification.

13. Ancillary Material

13.1. Readings

13.1.1. Required

  • Read Pagano 5.1, 5.2; 18.1, 18.2, 18.3; 19.1; 20.1, 20.2