- Define risk and rate ratios
- Define the various measurements of rates
- Define Z scores and percentiles
- Discuss the bell-shaped curves and its attributes
- Demonstrate methods of displaying data
- Contrast the various types of data
- Describe commonly used statistical terms
- Discuss the natural history of disease
|Important key words or phrases.|
|Important concepts or main ideas.|
In Lecture 2 - Observational Studies we explored observational studies as a basic analytic tool for trying to unravel causal events, such as answering the question: Is bed rest is better than surgery for treating lower back pain? To date we have introduced you to two varieties of analytic studies, intervention studies and observational studies. Descriptive studies were characterized as “hypothesis generating” studies that often lead to analytic investigations.
This lecture explores the basic tools that epidemiologists and statisticians use to describe health events in human populations.
2. Natural History of Disease
Every disease or health event has a natural history, a signature pattern that occurs time and again in the absence of any outside intervention. As discussed previously, this pattern often provides clues as to causes. The observation that death rates from suicide are directly related to latitude, so that as one moves closer to the Artic Circle, for example, suicide rates (and alcoholism rates) increase. In other words simple observations of the distribution of these events in human populations led to these discoveries.
However, understanding the natural history of disease can also provide important clues that can be used to devise prevention programs and strategies. For example, we know from studying the natural history of cervical cancer that it is a slow growing cancer that progresses through various stages over a relatively long period of time before developing into an invasive, life-threatening, phase. Knowledge of this progression in normal risk women allows us to develop recommendations for Pap smear screening intervals for women that are optimal in terms of identifying potentially abnormal cells and removing them before they progress to invasive disease. The natural history of prostate cancer is a lot less well understood. As a consequence, there is controversy about when and how to screen men.
In describing the natural history of disease epidemiologists use characteristics of person, place and time.
- Characteristics of person include, but are not limited to, factors such as age, sex, race, educational level, income, family history, etc.
- Place factors may include city or town, workplace, e.g. the part of the building.
- Time would include the time of year, day, progression over time, the speed of development of disease, etc.
For example, consider the data below: (Cook County Medical Examiner’s Office-Heat Related Deaths July 1995. Source: CDC. Department of Health and Human Services.).
|FIGURE 1. Number of heat-related
deaths,* by date of occurrence, race of decedent and heat index, by date -
Chicago, July 11-27, 1995
From this diagram you should be able to identify at least one characteristic each of person, place and time associated with these heat-related deaths in Cook County during July 1995.
3. Commonly Used Descriptive Terms.
Epidemiologists use several descriptive terms to describe disease patterns. They include:
The occurrence in a community or region of a group of illnesses
of a similar nature clearly in excess of normal expectancy. The term epidemic
therefore is relative to what a community or region ordinarily
Endemic: The constant presence of a disease within a given geographic area or population group.
Pandemic: The constant presence of a disease across a wide area or region. (Gordis states that pandemic refers to a worldwide epidemic.)
4. Descriptive Statistics
[Please note that the discussion of descriptive statistics below is a whirlwind overview intended as review for most people. Chapters 2 and 3 of Pagano and Gauvreau provide a much more in depth treatment of these topics.]
4.1. Types of Data:
Different types of data are used in epidemiological studies:
Nominal Data: Pagano
defines nominal data as having values that fall into unordered
categories or classes. Examples of nominal data include hair
color, gender, state of birth, etc. Nominal data that take one of two forms,
such as having an outcome or not having an outcome, are called
dichotomous or binary data.
Ordinal Data: As noted
in Pagano & Gauvreau, “When the order among
categories becomes important, the observations are referred to as ordinal
data.” Examples include stage of lung cancer and grade
Discrete data are those that are restricted specified
values, often integers or counts. An example of discrete data
is a rank ordering of the leading causes of death or the number of prenatal
visits (see Pagano & Gauvreau table 2.3, page
- Continuous Data: “Data that represent measurable quantities but are not restricted to taking on certain specified values are known as continuous data.” Examples include height and weight that can be reported as accurate as the measuring device. “The only limiting factor for a continuous observation is the degree of accuracy with which it can be measured,” Pagano & Gauvreau, p.11.
An easy way to keep these terms straight is to think of these data types as comprising a hierarchy from nominal at the lowest and least informative level to continuous data at the highest and most informative level.
Understanding the data types you are analyzing is important because it determines the most appropriate way to summarize and depict your data. It also dictates the types of statistical tests that should be used to analyze the data. (You will learn more about this concept later in the course.)
It is also important to appreciate the fact that data gathered at a higher level can always be converted to data at a lower level. However, the reverse is not true. For example, if you collect data on the number of cigarettes smoked per day, discrete data, you can convert these data to ordinal data: < ½ a pack; ½ a pack to 1 pack; >1 pack. While this may have made it easier to summarize the data, information has been lost in the conversion. Had you collected these data by category, it would have been impossible to express them as discrete data as the actual number of cigarettes smoked per day would not be known.
Raw data is often little more than a jumble of numbers and hence very difficult to handle. We need to find ways to make sense out of this chaos of data such that we can extract information from the data and communicate it to others. We do this throughdata depiction, data summarization and data transformation.
4.2. Data Depiction: From Chaos to Order
The reasons for displaying data are two fold: 1) so that as investigators we can have a better look at the distribution of data; and, 2) to communicate this same information to others quickly.
The techniques for data depiction and data display are too numerous to describe here. However, they generally fall into categories of tables, charts or graphs. The process involves moving from the "chaos" of hundreds, if not thousands, of raw numbers on a page to the relative order of a table of graph.
Consider for a moment a recently completed study of second hand smoke exposure in restaurant and bar workers. There were N=53 (or 53 subjects in this study). There were 43 women and 10 men in the study (nominal data). A simple bar chart conveys the same information far more quickly.
Ordinal data can also be displayed conveniently via bar charts or tables. Consider the table and bar chart below:
|Age Distribution of Restaurant and Bar Workers N=53|
|Frequency||Percent||Valid Percent||Cumulative Percent|
The table does an excellent job of displaying a lot of information quickly, the age categories, relative percents and cumulative percents. Yet the chart is far more efficient in instantly conveying information about the distribution. This is an example of data that were collected as discrete data, age in years, and then converted to ordinal data, i.e. ordered age categories.
Continuous data are most often displayed using tables and graphs. Tables are used as above to categorize and summarize while graphs are used to provide an overall visual representation and the proper graphical representation of the data, e.g. the range, the most common value, the average value.
Consider for a moment the data below on urine cotinine exposure from the study of restaurant and bar workers. (Cotinine is a metabolite of nicotine found in urine, saliva and blood. It provides a biomarker that can be used to quantify the level of exposure to nicotine from second-hand smoke or other sources).
|Cotinine Levels in ng/ml of Restaurant and Bar Workers|
|Category||N||Rel %||Cumm %|
|Cotinine Levels in ng/ml of Restaurant and Bar Workers|
By reducing raw data into tabular form, we can quickly learn a lot about the nature of the distribution, the range of values and where the most common values lie. This graphical representation is called a "histogram", sometimes also called a frequency distribution. The area under the curve between any two values provides a view of the relative frequency for those values. The area under the entire curve represents 100% of the distribution. Notice that a quick glance at this curve suggests that it is not symmetrical about a midpoint. (We will return to this concept shortly).
Often we want to reduce a complex data set to just a few numbers to try and summarize major themes or tendencies in the data. For nominal or ordinal data, we can simply use proportions or percents, e.g. the percentage of bar workers who live with a smoker is 15%.
4.3. Samples from Populations
Research most often is conducted using samples of individuals from larger populations. Clearly, it’s impractical if not impossible to study all people with cardiac arrhythmias, so we select a sample of people for study. We collect data on these subjects, draw conclusions from the data and make inferences to the larger population. (Some of the problems that can occur in this process are discussed later in the course). Please notice that the notation that is used when samples are involved is different than the notation used when dealing with entire populations. Arabic letters are used for sample statistics, while Greek letters are used when referring to the counterpart population values. In fact the term statistic is used when referring to samples, while the term parameter is used when referring to population values.
4.4. The Gaussian Distribution
Many different types of data distributions are encountered in medicine. The Gaussian, or "normal", distribution is among the most important. Its importance stems from the fact that the characteristics of this theoretical distribution underlie many aspects of both descriptive and inferential statistics.
The normal distribution is defined by the following characteristics:
Note: If these criteria are not met then the distribution is not a Gaussian distribution.
Skewed data are not normally distributed. 1 When the tail is to the right, as you look at the distribution, it is called right or positively skewed. When the tail is to the left, it is called left or negatively skewed data.
4.5. Data Reduction/Summarization
There are two major ways we try and characterize data that are normally distributed: 1) provide a summary value for the middle value or central tendency; and 2), provide an estimate of the amount of variability we observe.
4.5.1. Measures of Central tendency:
The mean (average) is the most appropriate measure of central tendency for a normal distribution but it is not such a good measure if the distribution is skewed. This is because the mean is influenced by extreme values. A lot of extreme values will tend to pull the mean in the direction of the skew. In this instance the median, or middle value, is the most appropriate measure to use. (Return to the data on urine cotinine. Is it skewed? Which is a better measure of central tendency, the mean or median?) In the case of the cotinine data provided earlier, the mean is 12.7 while the median is 8.6. The mean and the median are the most commonly used measures of central tendency, while a third measure, the mode, or most common value, is used rarely. It is more of a descriptive term.
4.5.2. Measures of Variability:
One measure of variability is the range, the distance from the highest to the lowest value. In the case of the urine cotinine data the range is from 0 to 54.9, which is 54.9. The problem with the range as a measure of variability is that it uses only two numbers from the distribution essentially ignoring the rest of the information contained in the data set. A better measure is the variance. The variance is in effect the average of the squared deviations of each individual value from the mean. (Note: Most often you will use the square root of the variance, the standard deviation). The larger the variance or standard deviation, the larger the spread in the distribution. Note that although the standard deviation can be calculated in a skewed distribution, it has no meaning, i.e. it cannot be interpreted as if it were calculated from a normal distribution.
4.6. Data Transformation-Percentiles
It is sometimes useful to transform data from pure raw scores or values to percentiles. The percentile defines the percentage of values in the distribution that are at or below a specified value. For example, the 95th percentile is that value below which 95% of all the values in the distribution lie at or below it. (The interquartile range is the 75th percentile minus the 25th percentile).
By transforming ‘raw’ data or scores to percentiles, you are in essence moving from a system of units or measures that may not be universally understood, e.g. ng/ml of urine cotinine from our previous example, to one that is universally understood, percents. For example, if you were a bar worker and you were told that your saliva cotinine level was 8.6ng/ml, it might not mean much to you. However, if you were told that you had a value of cotinine that put you in the 60th percentile, it would immediately communicate something to you about where you stood relative to your colleagues in the distribution.
If data are normally distributed, or near normally distributed, transforming data from a raw score to percentiles is easy. Consider the case of a distribution of subject heart rates in the figure below. You can see instantly that if your pulse or heart rate is 85 bpm that you are in the 84th percentile. You do not need to know anything about heart rates or anything else in order to understand that there are a lot of people with lower rates than you in the distribution.
|Normal distribution for heart rates graph|
The z-score allows one to carry out such a transformation from raw data. The formula for z-score is:
- χi =raw score
- ì = population mean
- ó = standard deviation
- z = ‘z score’ the ‘raw’ score transformed to the number of standard deviation units the score is from the mean
Using a table of the normal distribution (z table) one can then find the percentile associated with the z for any given raw score. (see Pagano & Gauvreau Table A.3 p. A-9). Example: If your heart rate is 68 bpm, knowing that heart rates follow a Gaussian, normal, distribution with a mean of 75 and a standard deviation of 10 bpm, yields a z-score of -0.7 (75-68/10). Consulting the Table shows that a z of -0.7 corresponds to the 24.2 percentile. In other words, 24.2 of people have a pulse rate lower than yours. 2
It is important to note that there are other reasons for transforming data. Statisticians often prefer to work with normal distributions. Skewed distributions can be made to approximate normal distributions by a variety of techniques, such as taking the log of each value. One must remember, however, that the resulting normal distribution is a distribution of log values.
2 Note: To save space the table provides information only on the upper tail of the normal distribution. However, since the normal distribution is symmetrical the upper and lower tails are the same. The minus sign indicates that in this case you are dealing with the lower tail of the distribution.
4.7. Descriptive Epidemiology
Between July 13 and July 21, 1995, 1177 deaths occurred in Chicago, an 85% increase in the number that occurred in the same interval in 1994 (N=637). Of these deaths, 465 were identified as “heat related”. (Exposures that culminate in deaths are referred to as mortal events while exposures that culminate in disease, disability or injury are referred to as morbid events).
The case fatality rate is the number of individuals dying of a disease during a specified time period divided by the number of individuals with the disease. In contrast, the mortality rate is the number of deaths from a disease during a specified time period divided by the number of individuals in the population during the specified time period. Note that it is possible for a disease to have a high case fatality rate but a low mortality rate. This could occur with a rare but rapidly fatal disease.
The problem in making comparisons with other years is that we do not know how the denominator population may have changed from one year to the next. Epidemiologists use rates in making comparisons between events that occur at different times, in different places or in different populations.
A rate is defined as:
|Rate =||# of events (e.g. deaths) per unit of population during a defined time interval|
|Population at risk during the same time interval|
220.127.116.11. To calculate a rate one needs:
- a defined time interval (day, week, month, year, decade)
- the number of events occurring in the time interval, e.g. a year
- an estimate (or count) of the population at risk in the time interval, e.g. a year
- a multiplier or constant (X 10; x100; x1,000 etc).
The one-week mortality rates for Cook County in 1995 vs. 1994 look like this:
|1994 637/5,135,132 x 100,000 = 12.4 per 100,000 per week|
|1995 1177/5,185,152 x 100,000 = 21.2 per 100,000 per week|
Note: Census Bureau statistics for the population of Cook County 1994 and 1995 respectively: 5,135,132 and 5,185,152.
This represents an approximately 70% increase in the mortality rate in the second week of July between 1994 and 1995, a huge increase. Clearly something other than changing demographics was responsible for this difference. Rates are a measure of the intensity with which events are occurring in a defined time period.
18.104.22.168. Some caveats about rates follow:
- Rates measure the intensity with which events are occurring but they do not tell us the magnitude of a problem in terms of the number of people involved. For example, if you do not know the population of Cook County, you do not know how many people died during this week of July 1995 simply by looking at the rate.
- Comparing rates can be perilous unless you are certain that the populations being compared at risk and have the same underlying characteristics. (See section on rate adjustment.)
- The choice of a constant by which you multiply a rate is dictated by convention rather than any absolute rules. For example, by convention total (all cause) mortality rates are expressed per 100,000 while infant mortality rates are expressed per 1,000. Usually a constant is chosen such that multiplying by it results in at least one integer to the left of the decimal point.
- Certain rates in public health have very specific definitions which you must make sure you understand before trying to interpret them, e.g. case fatality rate, disease specific mortality rate, perinatal mortality rate. (See Gordis text for more details)
One very special type of rate that is used universally in epidemiology is the incidence rate. The incidence rate, sometimes called cumulative incidence, is defined as follows:
|# of NEW cases of disease from population at risk during a specified period|
|Inc =||------------------------------------------------------------------------------------X k (constant)|
|# of people at risk at the beginning of the specified period|
Because the incidence measures the number of NEW events, it provides an index of the speed or velocity of propagation of a disease in a population. Changes in incidence over time, either up or down, suggest that a disease may be waxing or waning in importance.
(Note: the Cook county mortality rate for July 1995 is an example of an incidence measure. Why?)
4.7.3. Incidence Density
Another measure of incidence that is often used in observational studies, e.g. prospective cohort studies and clinical trials is incidence density (I.D.). The numerator for incidence density is calculated in the same way as for cumulative incidence, i.e. the number of NEW events in a pre-defined interval, e.g. one week, one month, two years, etc. What is different is the denominator. To calculate the denominator one adds the sum of the length of time at risk for each individual in the population. (This would obviously be a very time consuming task if there were a great many subjects in a study). Often, however, this information is simply not available in which case one usually relies on cumulative incidence. The advantage of incidence density is that it provides a more precise measure of risk since subjects may not always be followed for the same length of time.
Thirty (30) subjects recently diagnosed with diabetes were followed for periods that ranged from one to five years. At the end of the study, five subjects showed signs of peripheral vascular disease. There were a total of 130 person-years of follow-up.
|I.D. = 5/ 130 = .038 or 3.8 per 100 P-Y|
Another important public health measure is prevalence. Prevalence, which is a proportion. is defined as:
|# of individuals with a disease or characteristic at a specific time|
|# of individuals in the population at the specified time above|
Prevalence is expressed as a percent, e.g. the prevalence of smoking in the U.S. in 1999 was 24%, or as a decimal, 0.24.
- Prevalence is a "cross-sectional" measure, i.e. it is measured at one point in time.
- Prevalence includes both "old" and "new cases", i.e. both people who began smoking prior to 1999 and those who just began in 1999.
- Prevalence is a measurement of the burden of illness. Unlike incidence, it provides an estimate of the burden of disease. If the prevalence of Type II diabetes is 3% in the U.S., that suggests that 7.65 million (0.03 X 255 million) people have this disease, a substantial number.
4.7.5. Relationship between Incidence and Prevalence.
It turns out that prevalence and incidence are related to one another such that
|Prevalence = Incidence x Duration (avg) (NOTE: This assumes incidence remains constant.)|
In other words, if you know any two of these measures you can compute the third. However, another important consequence of this relationship is that it provides insight into the relationship between incidence and prevalence.
|Scenario:||HCV has likely been around for many years, although it wasn't recognized as a disease entity until 1991. The two most important sources of HCV infection are infected blood or tissue and sharing of needles by injection drug users. Since blood, blood products and tissue for organ transplants have been screened for HCV since 1992, this is no longer a source of new infections. Clinically apparent disease does not usually occur for 15 to 20 years after infection.|
|Question:||Is the prevalence of HCV increasing or decreasing? Why?|
|Question:||Recent advances in therapy, showing as high as a 35% "cure" rate, means that people who are infected and develop symptoms are living longer and longer. What is the likely impact on the prevalence of disease?|
|Question:||If a vaccine for HCV is developed (thought to be unlikely in the near future) what would be the impact on the prevalence of disease? On incidence?|
4.7.6. Using Rates to Make Comparisons
Consider the following data showing the incidence of Type II diabetes (adult onset diabetes) in two communities in the Southwestern United States in 1999.
|Town A||Town B|
|Age Group||Pop||# New Cases||Rate||Pop||# New Cases||Rate|
The rate ratio is one way to compare these two sets of rates. Based on this we might conclude that Town A has a 35 % higher incidence of Type II diabetes compared to B.
|Rate Ratio 27.6/20.4 = 1.35|
Yet look at the age specific rates for Town A compared with Town B. These strata specific rates, age in this case, are referred to as category specific rates. They are identical. How is it that each community has the same age specific incidence rates but different overall rates? (We refer to these overall rates as crude rates). The answer of course lies in the fact that while the rates are the same for each age category, the two communities have different age distributions, there being more older people in Town A than B.
Fortunately, there are statistical methods available to make adjustments in these crude rates such that in making comparisons between towns A & B we can take into account differences in the age distribution. In one of the more common procedures, called direct adjustment, one takes the category specific rates for each community and applies them to a population with a standardized population distribution.
This procedure then results in an estimated number of incident cases for each town (given a standard population) and ultimately allows the calculation of an adjusted rate allowing the calculation of an adjusted rate ratio. The procedure looks like this:
|Standard Population||Rate for A||Expected # A||Rate for B||Expected # B|
|The Adjusted Rate Ratio = 24/24=1|
Interpretation: One may conclude from this that the incidence of Type II diabetes in the two towns is virtually identical when one takes into account differences in the age distribution of the two populations.
One important caveat about comparing rates is that you try and take into account factors that may be distributed differently in the communities being compared that are associated with the disease or health events being counted.
4.7.7. The Concept of Risk
The notion of risk is a pivotal concept in epidemiology. It may be thought of as the probability or likelihood of moving from one health "state" to another, i.e. the risk of dying, or becoming infected with cryptosporidium, or developing a cardiac arrhythmia. Observational studies often yield estimates of risk, such as the risk of developing osteoarthritis of the knees if you are overweight.
Consider the following scenario:
Researchers, wanting to study the relationship between obesity and osteoarthritis (OA) of the knees, recruited 20,000 women to participate in a 15-year follow-up study. Subjects were screened at the beginning of the study for any evidence of existing OA of the knees as well as for obesity. Women with pre-existing OA were eliminated from the study. The subjects were subsequently examined annually for evidence of OA and classified as either OA + or OA -. Their heights and weights were also measured. At the conclusion of the study, 240 of the 8,000 women who were classified as obese had been diagnosed as having OA , while 130 of the normal weight women showed evidence of OA of the knees.
Question. Given this study, how would you assess the risk of developing OA given a history of obesity?
A great way to analyze problems like this is with what"s called a contingency table, in this case a 2 x 2 or "fourfold table" contingency table.
The convention most often used in epidemiology is to present the contingency table such that the rows represent the "exposure" (obesity) while the columns represent the "outcome" (OA). Hence the data from the scenario above would look like this:
The association between obesity and the incidence of OA can then be assessed using the ratio of incidence rates for those in the exposed and the non-exposed group. (Please note that this is the 15 year incidence rate and not the one year rate).
The rate ratio would then be:
|Rate ratio= 240/8000 ÷ 130/12000 = .03/.01= 3|
How would you interpret this rate ratio?
The name of this special rate ratio is relative risk. It is defined as the incidence of disease in the exposed divided by the incidence in the non-exposed.
|Incidence in the exposed|
|Relative Risk =||-----------------------------|
|Incidence in the non-exposed|
22.214.171.124. Interpreting Relative Risk:
- RR = 1: The risk in the exposed equals the risk in the non-exposed; there is no association between the exposed and the non-exposed
- RR > 1: The risk in the exposed is greater than the risk in the non-exposed; this is a positive association
- RR<1: The risk in the exposed is less than the risk in the non-exposed; this is a negative association
The relative risk is almost always cited when epidemiologists seek to describe the strength of an association between an exposure and outcome. There is a convention in labeling the cells of a contingency table using the lower case letters of the alphabet. You should learn this convention since it is almost universally used.
|Exposure||Y (+)||N (-)||Totals|
RR= a/a+b ÷ c/c+d
There is a special case of the relative risk called the odds ratio. The odds ratio is used when assessing the association from a case-control study. In such studies one cannot calculate the incidence of disease directly since the investigator sets the size of the study population somewhat arbitrarily.In case control studies the relative risk can be estimated using the odds ratio. It is defined as:
|Odds Ratio = a x d/ b x c|
The odds ratio is interpreted in exactly the same manner as the relative risk.
Note: It is not unusual to see Odds Ratios reported for other kinds of studies as well, e.g. cohort studies. These Odds Ratios are the result of analysis of the data using a technique called logistic regression. Logistic regression allows a researcher to examine a study outcome while controlling for other variables that may account for the results observed. Although interpreted similarly, these Odds Ratios should not be confused with calculation of Odds Ratios in case- control studies.
126.96.36.199. Attributable Risk
If relative risk measures the strength of an association then attributable risk measures the actual amount of illness or disease we can ascribe to a given exposure, i.e. the amount of disease attributable to the exposure. Attributable risk is sometimes referred to as the risk difference for reasons that should be obvious.
Attributable risk is defined as :
|AR = Incidence in the exposed (E+) - Incidence in the non-exposed (E-)|
From the OA example above, the attributable risk of developing OA if one were obese can be defined as:
|.03-.01 or .02 (or 20 per 1,000 per 15 years)|
Interpretation. Approximately 20 cases of OA per 1,000 people exposed (obese) are the result of obesity, while about 10 cases per 1000 are attributable to other factors.
The tools and techniques of descriptive epidemiology are critical because their application can both lead to the identification of important etiologic factors and because descriptive epidemiologic research often suggests possible hypothesis for future investigation.
6. Ancillary Material
- Read Chapter 12, Data Presentation, Pagano Test
- Read Sections 3.1 and 3.2 of Chapter 3, Numberical Summary Measures, Pagano Test
- Read Section 6.5 of Chapter 6, Probability, Pagano Test
- Read Chapter 3, Measuring the Occurrence of Disease, Pagano Test
- Read Chapter 10, Estimating Risk, Is There an Association?, Gordis TRest
- Read Attributable Risk Section, Chapter 11, More on Risk, Estimating the Potential for Prevention, pages 172-174, Gordis Text
- Read Section 7.4, The Normal Distribution, pages 177-185