Tufts OpenCourseware


The average; outliers (very large or very small variables) affect mean. Means are typically reported when data are normally distributed.
Half the values lie above the median and half the values lie below the median; outliers do not affect median. Medians are typically reported when data are skewed.

1,3,5,7,9 Median = 5
1,3,3,5,5,6 Median = 2.5
Most common variable.

1,3,5,5,7 Mode = 5
1,3,3,5,5,6 Bimodal = 3 and 5
Largest - smallest variable.

100, 300, 420, 600, 900 Range = (900-100) = 800
Measures the spread of the variables in a distribution.
Standard Deviation:
The square root of the variance; SD = Variance.
Normal (Gaussian) Distribution:
Bell-shaped curve
  • Mean = median
  • Symmetric around the mean
  • One SD above and below the mean contain 68% of variables in the distribution
  • Two SDs above and below the mean contain 95% of variables in the distribution
  • Three SDs above and below the mean contain about 99% of variables in distribution
  • The area under the curve is 100%
Standard Normal Distribution:
A normal distribution with the mean set at 0 and other numbers correspond to the distance from the mean; e.g. +1 is one standard deviation above the mean; -2.6 is 2.6 standard deviations below the mean (See Z score below)
Z Score:
Calculated from the standard normal distribution. Allows one to determine the percentile of a particular score. For example, if the mean grade on an exam is 80, with a population standard deviation of 10, then a score of 90 is (90-80) divided by 10 gives a Z score of one. This indicates that the score of 90 is one standard deviation above the mean.
Describes the percent of variables at or below that value.

A student's exam score is 82, and it?s at the 95th percentile of exam scores: 95% of those taking the exam scored 82 or lower.
Interquartile Range:
The range of values between the 25th and 75th percentiles.
Skewed Distribution:
Left (negative) skew has tail on left; right (positive) skew has tail on right.
A statistical method of taking skewed data and transforming it to normally distributed data; often done by log or exponential transformations. By transforming the data, the statistician may use parametric statistical tests rather than nonparametric statistical tests.
Different distribution for each sample size; based on degrees of freedom; it is symmetric about the mean, like a normal distribution; it's more variable than the normal distribution; as sample size increases, T-distribution approaches (looks more like) a normal distribution.
Central Limit Theorem:
An important statistical theorem that says that given a sufficiently large sample size, a sample drawn from a distribution will be normally distributed regardless of the shape of the original population distribution.
Standard Error of the Mean (SEM):
Describes the variability in a distribution of sample means; mathematically it's the sample standard deviation divided by the square root of the sample size; the SEM is a smaller number than the SD.
95% Confidence Interval:
Describes the interval (two points) such that the true value has a 95% probability of lying between the two points; for a mean, 95% CI is x-bar + t (SEM); CI encompasses notion of repeated sampling, i.e. upon repeated sampling, 95% of the confidence limits will contain the true value. Example: Based on a random sample, the mean height of American medical students is 68", 95% CI 67", 69". Interpretation: With 95% certainty, the true average height of American medical students lies somewhere between 67" and 69". Or, 95% of the same-sized random samples would yield an average height between 67" and 69".
The probability of getting a result as extreme or more extreme than the observed outcome by chance alone.
Alpha Value (Level of Significance):
Specifies the value which the researchers will use to make a determination as to whether or not the results are likely or unlikely due to chance alone; it's the old line in the sand routine; usually set at 0.05;
  • When the P-value is greater than the alpha value, authors conclude results are likely due to chance alone (not a true difference), and are not statistically significant.
  • When the P-value is less than or equal to the alpha value, authors conclude results are unlikely due to chance alone (there is a true difference ), and are statistically significant.
Null Hypothesis:
Usually states there is no difference between the groups being assessed.
  • The mean IQ of BUSM 2s = The mean IQ of TUSM 2s
Alternative (Research) Hypothesis:
Usually states there is a difference between the groups being assessed.
  • Two-Tailed: The mean IQ of BUSM 2s The mean IQ of TUSM 2s
  • One-Tail: The mean IQ of BUSM 2s > The mean IQ of TUSM 2s
  • One-Tail: The mean IQ of BUSM 2s < The mean IQ of TUSM 2s
If the p-value is greater than the Alpha Value:
Fail to reject the null hypothesis.
If the p-value is less than or equal to the Alpha Value:
Reject the null hypothesis.
Type I (Alpha) Error:
Incorrectly rejecting the null hypothesis; a false positive conclusion
Type II (Beta) Error:
Failing to reject the null hypothesis when the alternative hypothesis is correct; a false negative conclusion.
The probability of finding a specified difference, or larger, when a true difference exists; the probability of correctly rejecting the null hypothesis.
  • Power = [1 - (probability of a Type II Error)]
  • Keeping other variables constant, the best way to increase power is to increase sample size
  • Studies with small sample sizes tend to have low power
  • Generally speaking, like to see a power of 80% or more to detect a clinically relevant difference
Multiple Comparisons: .
Performing multiple statistical tests without changing alpha increases the probability of a Type I error. Bonferroni Correction is one method to address this whereby alpha is divided by the number of statistical tests to be performed to yield a new alpha, e.g. .05/5 (tests) yields a new alpha of 0.01
Discrete Variables:
Can be placed in categories;
  • Nominal Data: gender, medical school, hair color, board certification
  • Ordinal Date: Order exists; e.g. stage of lung cancer, severity of illness, school grade
Continuous Variables:
Have infinite number of values within a given range; limited by accuracy of measurement.
  • height, age, weight
Parametric Tests:
Performed on data that are normally distributed.
Nonparametric Tests:
performed on data that are skewed, i.e. not normally distributed.
Chi-square (χ2):
A statistical test to determine if there is a difference between two proportions; performed on discrete data.
  • Is the proportion of BUSMs on financial aid different that the proportion of TUSMs on financial aid?
  • Is there a difference between vaccinated students who develop influenza vs. non-vaccinated students who develop influenza?
Fisher's Exact Test: .
Like the Chi-square test but usually done with small sample sizes, defined as any "expected" cell with a number less than five
Student t-test (two-sample t-test): .
Performed to determine if there is a difference in the means of two groups; only done on normally distributed data
  • Is there a difference between the mean height of BUSMs vs. TUSMs?
One-sample t-test:
Performed to determine if there is a difference between a known population mean and a sample mean: is the average height of women in Tufts Medical School different from the known average height of all women in medical school?
ANOVA (Analysis of Variance):
Like the Student t-test except it is used when comparing three or more groups; performed on normally distributed data: Is there a difference in the average height of women in Tufts, BU and Harvard Medical Schools?
Paired t-test:
Performed to determine if there is a difference between a before and after measurement of the same subjects (dependant data); note that measurements are made on the same subjects; performed on normally distributed data.
  • Is there a difference between the systolic BP of subjects first placed on medicine A then placed on medicine B?
Assesses whether or not there is a linear (straight line) relationship between two continuous variables; does not matter which variable is on X or Y axis; expressed as r (rho) = Pearson's Correlation Coefficient; r varies from -1 to +1; -1 is perfect negative correlation and +1 is perfect positive correlation; correlation does not assume causality; requires random selection of variables and a normal distribution of both variables, bivariate normality
Spearman Rank Correlation:
Simple Linear Regression: Describes the straight line relationship between two continuous variables, i.e. does knowing X help to predict Y; has the equation of a straight line, Y = a + bX .
  • Y = Dependant (outcome) variable
  • X = Independent (predictor) variable
  • a = Y intercept when X = 0
  • b = Slope (change in Y per unit change of X)
  • It does matter which variable is placed on the X-axis and which variable is placed on the Y-axis.
R2 (Coefficient of Determination):
From regression; equals Pearson’s Correlation Coefficient from correlation squared; represents the amount of variance in Y that can be predicted by knowing X; ranges between 0 and 1.
Multiple Regression Analysis:
Has two or more independent variables; particularly valuable as a away to adjust for confounding by adding the potential confounder as an independent variable
Logistic Regression Analysis:
Performed when the dependant variable (Y) is dichotomous
Primary Screening:
Performed to prevent a disease; e.g. screening for hypercholesterolemia to prevent heart disease.
Secondary Screening:
Performed to reduce the impact of a disease; e.g. mammography to detect early breast cancer.
Given disease is present, the probability of testing positive.
  • Of 100 people with Hepatitis A, 80 test positive: the sensitivity is 80%
Specificity: .
Given disease is absent, the probability of testing negative.
  • Of 50 people without SLE, 45 have a negative ANA test: the specificity of the ANA for SLE is 90%
Predictive Value Positive:
Given the test is positive, the probability disease is present
  • 100 women have a suspicious mammogram for breast CA, and after bx 20 are diagnosed with breast cancer: the predictive value positive of the mammogram for breast cancer was 20%.

Predictive Value Positive

Predictive Value Negative:
Given the test is negative, the probability disease is absent
  • 100 men have a negative ETT, and after cardiac cath 10 are diagnosed with CAD: the predictive value negative of the ETT for CAD was 90%
  • Sensitivity = TP + (TP + FN)
  • Specificity = TN + (TN + FP)
  • Predictive Value Positive = TP + (TP + FP)
  • Predictive Value Negative = TN + (TN + FN)
Descriptive Studies:
Describe data, generally without assessing causal associations.
  • Case Report (An interesting observation)
  • Case Series (Interesting Observations)
  • Correlation Study (Done on large populations; as the per capita meat consumption in New Jersey and the prevalence of CAD in New Jersey)
  • Prevalence (Cross-Sectional) Study: Simultaneously assesses exposures and outcomes in a group; often done as a questionnaire; e.g. a questionnaire administered to women in their 50s which asks questions about diet, exercise, etc. and the diagnoses of certain diseases, as hypertension, heart disease, etc.
Random Sample:
Each subject in the underlying population has an equal probability of being selected in the sample.
Stratified Random Sample:
The population is first stratified (divided) into groups, as male and female, and samples are then randomly drawn from each group. This method might be used, for example, if researchers want to assure an equal number of men and women are in the sample.
Analytic Studies:
Generally assess for causal associations.
  • Case Control: Subjects selected based on outcome (disease) and exposure(s) is then assessed Men with and without lung cancer are asked if they smoke cigarettes
  • Prospective Cohort: Subjects with and without an exposure are prospectively followed to determine an outcome(s) A group of smokers and a group of non-smokers are followed to determine who develops lung cancer
  • Retrospective Cohort: Groups with and without an exposure at some past time are identified (as through HMO records) to determine who develops an outcome(s); A group of smokers in 1970 and a group of non-smokers in 1970 are identified in HMO records, and records are reviewed to determine who developed lung cancer through 1990
  • Randomized Controlled Trial: Gold-standard; randomization helps to assure comparability between comparison groups; avoids confounding bias Patients with CAD are randomized to receive medication A or medication B; subjects followed with death from MI as outcome of interest. A Cross-Over Study occurs when subjects are told to switch (cross-over) their treatment arm during a study. Example: Subjects are randomly assigned to medicine A or medicine B. After four weeks, subjects on medicine A switch to medicine B and subjects initially on medicine B switch to medicine A.
  • Meta-analysis: Analyzes data from several similar studies to present one overall conclusion

RCT: Pros: gold-standard; best design to assess for causality between exposure (trial arms) and outcomes; generates incidence data; best study design to avoid confounding bias; can assess multiple outcomes (primary and secondary outcomes) Cons: May take time = $ and loss-to-follow-up of subjects; can be ethical reasons not to conduct

Prospective Cohort Study: Pros: generates incidence data; no ethical concerns; can assess multiple outcomes; good for rare exposure Cons: bad for rare outcomes; may take time = $ and loss-to-follow-up; subjects may change exposure status, e.g. smokers quit smoking during the study

Retrospective Cohort Study: Pros: easy to do, so less $; can assess multiple outcomes; good for rare exposures; Cons: bad for rare outcomes; dealing with retrospective data; incomplete data

Case Control: Pros: good for rare disease; can assess multiple exposures; quick to do so less time and $ Cons: subject to several biases as recall bias, interviewer bias, selection bias; retrospective data; bad for rare exposures

Meta-analysis: Pros: increases power by combining study results; may help to clarify truth when studies have different conclusions; quick to do so less $; Cons: Studies not done exactly the same way; studies with negative conclusions may be less likely to be published so not included in meta-analysis, so-called publication bias.

Informed Consent:
The process of informing subjects about the potential risk and benefits of participating in a clinical trial.
Intention-to-Treat Analysis:
A type of analysis for randomized controlled trials. Subjects are analyzed with the group to which they were assigned, whether or not they complied with the assigned study arm.
Efficacy Analysis:
In contrast to intention-to-treat analyses, only compliant subjects are analyzed. The FDA prefers intention-to-treat analyses vs. efficacy analyses for external validity reasons, i.e. patients in office might have similar compliance to patients in study so intention-to-treat analysis might better predict actual clinical outcomes
Phase One Study:
Assesses safety of a new treatment
Phase Two Study:
Assesses efficacy of a new treatment
Phase three Study:
Compares new treatment to standard therapy
Internal Validity:
A review of a study to determine if the study's conclusions were erroneous due to chance, bias or confounding. This helps the reviewer to determine how well the researchers conducted their study.
External Validity:
Assuming the internal validity of a study is acceptable, how clinically important are the results.
The proportion of subjects in a group with a certain disease; includes new and old cases; a snapshot in time; each snap-shot can have a different prevalence as new cases are diagnosed while others are cured or die.
  • Of 100 students in a group, 2 have influenza: the prevalence of influenza is 2%
Incident Cases:
New cases occurring over a defined period of time.
Cumulative Incidence (Attack Rate):
The number of outcomes occurring over a specified period of time in a group that was initially free of the outcome of interest
  • 100 influenza free students are followed for the three winter months and 20 develop influenza: the cumulative incidence of influenza was 20% over three months
Incidence Rate (Density):
Describes the number of new cases, in a group initially disease free, as a function of person time; has a funny denominator, like person-years, or person-months;
  • The incidence of influenza in TUSMS for 1996 was 2 cases per 100 person-months
Prevalence =
(Incidence Rate) (Average Duration of Disease)
Absolute Risk:
The probability of going from a healthy state to an ill state, e.g. the probability that a 50 year old female with hypertension will develop an MI over the next 20 years.
Relative Risk:
Measures the strength of association between an exposure and outcome; it compares the risk of an outcome between groups;
  • RR > 1 = positive risk factor, e.g. cigarette smoking and lung cancer
  • RR < 1 = negative risk factor , or protective effect, e.g. high HDL levels and CAD
  • RR of 1 = no association, e.g. eating grapes and mesothelioma
  • RR = Incidence of disease in the exposed + Incidence of disease in the non-exposed
  • RR = (A+B) + (C+C+D)
  • RR can also be calculated from some other data, as RR = Prevalence of disease in Population A + Prevalence of disease in Population B
    Yes No
    Yes A B
    No C D
Odds Ratio:
An approximation of relative risk; must be used in case control studies and may be used in cohort studies or RCTs; interpreted the same way as relative risk.
  • OR = (A)(D) + (B)(C)
Attributable Risk:
The excess disease in the exposed that can be attributed to the exposure; e.g. how much lung cancer in smokers can be attributed to smoking .
  • AR = Incidence of disease in exposed - Incidence of diseases in non-exposed
  • AR = [A + (A+B)] - [C + (C+D)]
An erroneous study conclusion when a factor is associated with an exposure and is itself an independent risk factor for the outcome; confounding is a type of bias that threatens internal validity.
  • It appears that individuals who drink alcohol have a higher relative risk for lung cancer than those who do not drink alcohol. The results are misleading (confounded) since alcohol users are more likely to smoke cigarettes than non-drinkers, and cigarettes are an independent risk factor for lung cancer

Researchers can control for confounding by:

  • Restriction: individuals with known risk factors for the outcome can't participate in study
  • Stratified Analysis of results (compares strata specific results)
  • Adjustment (direct and indirect) for risk factor
  • Multivariate Analysis (the potential confounder is added as an independent variable)
  • Matching: For cohort and case control studies, potential confounders are assigned equally to each arm
Effect Modification (Interaction):
Occurs when the association between the outcome and the exposure vary by some third factor.
  • The relative risk of developing lung cancer in smokers less than 30 years old is different than the relative risk of developing lung in smokers older than 30 years old, e.g. age was an effect modifier
Surveillance Bias:
Occurs when one cohort is followed more closely than the other cohort such that the group followed more closely is more likely to be diagnosed with the disease; this yields an erroneous over or underestimation of the risk of developing the disease between the two cohorts. Example: A group of women on estrogen-progesterone might be followed more closely for the development of DVT than another group of women not on estrogen-progesterone such that DVT appears to occur more frequently in the former.
Crude Rates:
The actual observed rate of outcomes in a defined population.
Adjusted Rates:
Consider differences (potential confounders) between populations, such as age and gender, such that the crude rates do not present erroneous conclusions
  • One would expect to see a higher incidence of CAD in a retirement community vs. a community of younger workers, but the age and sex adjusted CAD incidence rates would probably not differ.
Herd Immunity:
The disease protection of an unimmunized individual because the population is immunized; the unimmunized person does not come into contact with the disease, have a risk, as others are immunized.
Epidemic Curve:
Plots an epidemic by person (who gets it), place (where do they get it) and time (when do they get it).
Kaplan-Meier Survival Curves:
A method of drawing curves that shows survival of different groups over time. The log-rank test can be used to determine if there is a statistically significant difference between he survival curves.
A systematic error with a study leading to an erroneous estimation of the association between the exposure and the outcome. Examples:
  • Recall bias: In case control studies where individuals, by virtue of their health status, differentially recall their exposure status
  • Selection bias: In retrospective studies, occurs when subjects differentially enter a study based on an association with the exposure (case control) or outcome (retrospective cohort)
  • Information bias: Data is erroneously categorized in exposure or outcome status.
  • Interviewer bias: Occurs in a case control study where unblinded interviewers obtain data differently in the diseased vs. control groups.
Studies are double-blinded when neither the investigators nor subjects know the arm to which they have been assigned. In a single-blinded study, the subjects do not know their study arm but the investigators do know. Double-blinding helps to avoid differential (non-random) misclassification of outcomes, a situation where the unblinded investigator unintentionally but differentially assigns outcomes based on his/her bias regarding the treatment. The more subjective the outcome, the more important it is to blind the subjects so their biases will not lead to a misclassification of outcomes, e.g. the degree of morning joint pain.
Cause-Effect Relationship:
Associations might not be cause-effect relationships; must rule out chance, bias and confounding as alternate explanations for the observed results. When assessing causal relationships, consider the following:
  • Strength of association, how large, or small, is the relative risk or odds ratio?
  • Consistency with results from other similar studies
  • Biological credibility, is there a known biologic mechanism that can be cited as an explanation?
  • Dose response relationship, e.g. the more one smokes, the higher the relative risk of developing lung cancer
  • Temporal relationship , does the time course from exposure to disease make sense?
Greater than expected disease frequency in a defined population
Greater than expected disease frequency in a large defined area, as North America
Constant presence of a disease in a defined population.
Case Fatality Rate:
Proportion of patients with a disease who die of that disease.
  • 10 people develop virus A, and 1 dies: the CFR of virus A is 10%
Mortality Rate:
Proportion of people who die from a disease in a defined population.
  • 10 people in a population of 1000 die of virus A: the mortality rate of this virus in this population is 1%