Important Note

Tufts ended funding for its Open Courseware initiative in 2014. We are now planning to retire this site on June 30, 2018. Content will be available for Tufts contributors after that date. If you have any questions about this please write to

Tufts OpenCourseware
Author: P. Newby, Sc.D.,M.P.H.
  • To gain a deeper understanding of the epidemiologic research process by analyzing a dataset together
  • To build a regression model
Color Key
Important key words or phrases.
Important concepts or main ideas.

1. Introduction

At this point, all of the major concepts in the course have been covered. We have learned about different types of study designs and statistical tests used in biomedical research. We have practiced reading journal articles to examine whether the design and tests used in the study and whether the conclusions reached by the authors were sound.

But what if you were conducting your own study? Would you know how to begin? What statistical tests would you employ? Would you be able to identify a confounder or effect modifier? How would you build your regression model?

In this lecture, we will analyze a dataset together, in which regression is used to test a hypothesis. We hope that this real-life example will bring to life the concepts we’ve learned and provide a better understanding of what the process is actually like. By going through our example we will also have the chance to review some of the major concepts in epidemiology and biostatistics that we have learned in this course.

This will be an interactive lecture, in which I will ask questions throughout the lecture to help you review the concepts and apply them as we analyze our dataset and build our regression model. Our example dataset is a nutritional epidemiologic study, prospective cohort, examining the relation between diet and body mass index (BMI). [The original article was published in The American Journal of Clinical Nutrition 2003;77:6:1417-1425 (Newby et al).] -


In this lecture, we are focusing on the data analysis portion of the research process. This assumes that all of the other important steps in conducting the study have already been performed.

2.1. Review Questions:

  • What was the study design employed?
  • What was the population studied?
  • How was the sample selected?
  • Was a power calculation performed?
  • What types of data were collected and how?
  • What is the study’s hypothesis or hypotheses?
  • What are the main exposure and outcome variables and what types of data are they?

Sometimes these questions don’t impact the data analysis but most of the time they do. Practically speaking, the analyst will follow the protocol of the study, which has usually described the types of statistical tests that will be used to examine the data. However, other, routine, statistical tests are not always described in the protocol and that analyst uses common statistical tests to analyze the dataset.

Because the focus of this lecture is on regression and model building, we will simply review the initial steps used in analyzing a dataset but will not spend a lot of time discussing them. However, before we can begin analyzing the data, there are a few steps that need to be considered in the “pre-analysis” phase, briefly discussed below.


“What they don’t tell you in biostatistics class is that 85% of the work is actually in
creating an analytic dataset.”

Usually all an analyst receives is a dataset. Several important steps need to be taken before the data can be analyzed and the two most important steps are discussed below.

3.1. Make sure the dataset is clean!

The first question an analyst asks when they receive a dataset for analysis is “Are the data clean?” In other words, are the values valid and ready to be analyzed. This is actually THE most important step of the analysis.

Unfortunately, it’s arguably the least interesting. Returning to the “garbage in, garbage out” analogy, we can use all of the proper statistical tests but if our data were not “clean” from the outset, our results may be seriously biased or even completely invalid.

Even if the dataset is purported to be “clean,” a good analyst will still clean the data being used in their own study. Data cleaning generally involves carefully examining each of the variables being used in the analysis, including exposures, outcomes, and every variable that is being used in the analysis (especially the regression analysis). Therefore, we usually run descriptive statistics on each of these variables to examine the values. We often test (continuous) variables for normality, especially if normality is required in the statistical tests we will be later employing.

Review Questions:

  • What is an example of a continuous variable and how can we examine it to see if it is “clean”?
  • What is an example of a categorical variable and how can we examine it to see if it is “clean”?

Cleaning data can be incredibly time consuming and we will not further focus on it here. In brief, the analyst will first check for missing data. In almost all cases data are missing for your main exposure or outcome variables or your covariates, and decisions must be made as to how this should be handled. Once the “missing data” issue is resolved, variables will be checked to see whether values are biologically plausible.Even if values are biologically plausible they still may be statistical outliers.Data may need to be excluded or recoded to the highest plausible value, for example.

Many complicated questions arise in the data cleaning process and oftentimes an epidemiologist will work closely with a biostatistician to ensure that the methods used to clean the data are reasonable and do not compromise the validity of the data or introduce bias. In fact, the issue of missing data is so prevalent and problematic in research that there are many biostatisticians who make a career out of how to handle missing data!

3.2. Employ restrictions needed for your particular study question

Once all the data are clean, the final dataset will need to be created.The parameters of an individual dataset depend on the question being studied. For example, if you are analyzing data from a prospective study to see who develops cancer, you will probably exclude all individuals with cancer at baseline to start with a “healthy” cohort. Age restrictions may also be employed.A good analyst will have biological knowledge of the disease process that will help to guide what restrictions should be employed and will also look to other examples in the scientific literature to see how other analysts have created their dataset.

We now have our analytic dataset and are ready to begin our analysis.


4.1. Step 1: Conduct descriptive statistics for Table 1

Remember the “Table 1” from all of the journal articles we have read? The first step in the analysis is often to run “descriptive statistics” for the dataset, which are used in the first table that presents “Sample Characteristics.” This is not only necessary for the reader to help them better understand the sample being studied, but it is also helpful for the analyst…for the same reason! That is, it can alert the analyst to potential confounders or effect modifiers that s/he will want to look out for in the regression analysis.

Descriptive statistics can be reported in several different ways and it again depends on the study question being asked. Most often, data are presented stratified by men and women.


Table 1
Sample characteristics of 459 women and men participating in the Baltimore Longitudinal Study on Aging. (Source: U.S. Dept. of Health & Human Services, NIH)

Sample Characteristics

Review Questions:

  • Does the above table tell us whether the characteristics are significantly different among men and women? What statistical test could we have performed to show this?
  • Is the statistical test the same for continuous and nominal data?

4.2. Step 2: Conduct additional tests to describe the data to the reader

Of course the additional tests run on the data and presented in the table depend on the study question. In this example, we used a statistical procedure to derive what we called “dietary patterns,” which groups individuals together in non-overlapping groups according to differences in mean intakes for a set of foods/food groups. We then presented additional “characteristics” for each pattern, as described in our Table 2 below.

Table 2
Nutrient intakes (mean ± SE) and sample characteristics (N,%) across the five eating patterns identified at baseline among 459 women and men participating in the Baltimore Longitudinal Study on Aging. (Source: U.S. Dept. of Health & Human Services, NIH)

Additional Characteristics

Review Questions:

  • Did the above table test for statistical significance in nutrient intakes and sample characteristics across dietary patterns?
  • What does the above table tell us about potential confounding and/or effect modification?

4.3. Step 3: Building a regression model

We’ve finally reached the most interesting part of the analysis. Following our example, does diet predict change in BMI? We are specifically testing whether the white bread, alcohol, sweets, and meat-and-potatoes pattern predicts larger changes in BMI compared to the “healthy” dietary pattern. Therefore, our outcome (dependent) variable is change in BMI and our main exposure (independent) variables are each of the separate dietary patterns.

Because our outcome is a continuous variable, we will use a multiple linear regression analysis procedure.(Remember that our outcome must be normally distributed to use this procedure.)We are using a “multiple” regression procedure since we will be entering “multiple” variables into the model.

There are several steps that are taken to build the regression model that is presented in the paper, and the model building procedure is usually described in the Methods section of the paper.

4.3.1. Step 3.1: Build the core regression model using univariate regression analysis

In the first step, a univariate regression model is tested with the main predictor(s) and main outcome only. In this case, we will first test the model:

â?? BMI = constant + white bread + alcohol + sweets + meat-and-potato

Note: Because our main predictor variable(s) is a categorical variable, we have created what we call a “dummy” or “indicator” variable, in which a subject is assigned a ‘1’ if they fall into the category and a ‘0’ if they don’t. Then, the “reference group” – in this case, the “healthy” dietary pattern – is omitted from the model. Therefore, our regression coefficients will represent the change in BMI for those in the white bread, alcohol, sweets, or meat-and-potato pattern COMPARED TO those in the healthy pattern. (Our model will not tell us, for example, whether those in the sweets pattern will have a greater change in BMI compared to those in the alcohol pattern, although we could change our reference group to the alcohol pattern if we were interested in learning this.)

We keep the above “core” model and refer to the coefficients for each of the diet patterns when building the larger, multivariate model.The beta coefficients from this model are known as “unadjusted beta coefficients,” since no other variables are included in the model.

4.3.2. Step 3.2: Testing for confounding

Our previous tables, if they have tested for statistical significance, may indicate that there are potential confounders in our dataset that we need to look out for when building our model. Our analyses thus far indicates that a number of different dietary variables and sample characteristics (from Table 2, shown earlier) are significantly related to dietary pattern, but we still don’t know whether those same variables are related to change in BMI. Therefore, it is useful to perform univariate regression analyses for every variable that you are considering entering into the model (all covariates).


  • â?? BMI = constant + age
  • â?? BMI = constant + sex
  • â?? BMI = constant + baseline BMI
  • â?? BMI = constant + smoking
  • â?? BMI = constant + education
  • â?? BMI = constant + vitamin use
  • â?? BMI = constant + physical activity

Review Questions:

  • How can we tell whether our regression coefficients are significant?
  • Is there another way of testing whether two variables are associated, other than regression analysis?

4.3.3. Step 3.3: Begin building the multivariate regression model

We now use the knowledge we’ve gained from our confounding tests to begin building the multivariate regression model. We start with our core model, which again in this case is

â?? BMI = constant + white bread + alcohol + sweets + meat-and-potato

Most analysts then start adding each of the potential confounders, one at a time, to the above model, to test whether the variable was in fact a confounder.


â?? BMI = constant + white bread + alcohol + sweets + meat-and-potato + age

Performing the above regression is another way we can tell if age was a confounder of the relation between diet and change in BMI.If the original regression coefficients from our univariate model for each of the diet pattern variables change more than 10% from their original estimate, we say that the relation was confounded by age and we would leave the term in the model. (Note that the “10% rule” is a rule of thumb.) Likely, age was also significantly related to diet in an earlier analysis and was also, independently related to change in BMI from its own univariate analysis. (Remember that here are many types of statistical tests that can be performed to assess whether variables are related.)

A similar procedure is performed for each of the potential confounders, in which they are separately added to the “core” model (containing just the main predictors) to test for confounding.At the very least, these steps are taken for the “usual suspects” that are often confounders (e.g., age, sex, ethnicity, education, income). Additional potential confounders vary, of course, by study question.


In fact, model building can become more complicated since often there is “multiple confounding” is occurring in the data. For example (not in this example), income and education are both related to diet, as well as to weight, as well as to each other. Therefore, many epidemiologists choose to be “safe” and just include them in the model together to adjust for confounding. Many studies have enough statistical power to do this, but if there is limited power (and a model with many terms in the model has less power to detect an association), one needs to be much more cautious about simply adding all potential confounders to the model. If this procedure is followed, it is still useful to the reader to see how the coefficients changed when the models were adjusted for potential confounders.

Review Question:

  • When should a variable NOT be added to the regression model?

4.3.4. Step 3.4: Testing for interaction

The next phase of the model building procedure will test for effect modification. Analysts differ in how they approach effect modification in an analysis. Usually, effect modification is only tested when there is a biologic reason to suspect it or previous literature in the field suggests it. Unlike confounding, tests for effect modification are performed once the final (or potentially penultimate) model is built.

In this example, we thought it could be possible that the effect of diet on change in BMI could be modified by sex or age. We have previously learned that stratifying a sample by the potential modifier can show us whether a variable is an effect modifier or not. When you are using a regression analysis to test for effect modification, an interaction term is created and then added to your final model. A rule of thumb is that if the term is significant at the P < 0.01 level then the dataset should reanalyzed separately (eg, for women and for men). If the term is not significant, the term will be dropped from the final model.

4.3.5. Step 3.5: Refining the model and selecting a final model

There are a number of other steps that can be taken to refine the final model, which we will not discuss here. For example, age is not often linearly related to the outcome, so we can introduce quadratic or other age terms into the model to adjust for this to improve the model fit. In these cases, we often come up with a set of final models and we are not sure which is the best fit to the data. There are statistical tests that can be performed to test which model is a better fit to the data, and these tests can help you to select a final model. In all cases, again, biologic knowledge of the relationship being examined and knowledge of the literature in the field can aid you in both building your model and selecting a final model.

4.3.6. Step 3.6: Presenting the final model

Example: Final regression model presented

Table 3
Regression coefficients (ß) and standard errors (SE) for dietary patterns for predicting relative change in body mass index (BMI) and change in waist circumference1, comparing each pattern to the Healthy pattern, among adults participating in the Baltimore Longitudinal Study of Aging. (Source: : U.S. Dept. of Health & Human Services, NIH)

Final Regression Model

In the Table, the models we have presented are:

â?? BMI = constant + white bread + alcohol + sweets + meat-and-potato + age + sex + baseline BMI
â?? BMI = constant + white bread + alcohol + sweets + meat-and-potato + age + sex + baseline BMI + ethnicity + physical activity + past smoking + current smoking + vitamin supplement use

Review Questions:

  • How do we interpret the regression coefficients from the equation above?
  • Do the models suggest that confounding was occurring?

5. Conclusion

In conclusion, although building a regression model can be the quickest part of the whole analysis, the steps leading up to the model building are of critical importance in ensuring that you have a valid analytic dataset. In addition, there are many steps that can be taken during the model building itself that will lead to the final model. The steps I’ve outlined above will help you in building a regression model, but there is an art to it that only comes with experience and knowing your subject area.

6. Ancillary Material

6.1. Readings

6.1.1. Required

  • Review notes on regression (Lecture # 10 and associated reading)
  • Review notes on descriptive statistics (Lecture #3 and associated reading)
  • Review notes on confounding and effect modification (Lecture #4 and associated reading)