Epidemiology for mathematicians “ Looking at wildflowers from horseback”

Epidemiology for mathematicians“Looking at wildflowers from horseback” David Ozonoff, MD, MPH Boston University School of Public Health DIMACS Working Group on Order Theory in Epidemiology March 7, 2005

Tutorial overview and goals • The landscape of epidemiology • What is epidemiology? • Who is an epidemiologist? • Who employs them? • Kinds of epidemiology • How epidemiologists think • What kinds of things do they work with? • What kinds of things are they interested in?

Tutorial overview and goals, cont’d • Some language and concepts of epidemiology • Language of occurrence measures • Study designs • Causal inference

I. Landscape, perspective, language What is epidemiology? Who is an epidemiologist? Who employs epidemiologists? Flavors of epidemiology: Descriptive, analytic Epi and mathematics: models and patterns Some examples of epidemiological thinking

Some definitions of epidemiology • Study of health and illness in populations (Kleinbaum, Kupper and Morgenstern) • Study of the distribution and determinants of disease frequency in human populations (MacMahon and Pugh; Susser) • Study of the occurrence of illness (Rothman I) • Theoretical epidemiology: discipline of how to study the occurrence of phenomena of interest in the health field (Miettinnen) [NB: not illness centered]

Some more (cynical) definitions • Rothman II: “Unfortunately, there seem to be more definitions of epidemiology than there are epidemiologists. Some have defined it in terms of its methods. While the methods of epidemiology may be distinctive, it is more typical to define a branch of science in terms of its subject matter rather than its tools….If the subject of epidemiologic inquiry is taken to be the occurrence of disease and other health outcomes, it is reasonable to infer that the ultimate goal of most epidemiologic research is the elaboration of causes that can explain patterns of disease occurrence.” • Schneiderman: Epidemiology is the practice of criticizing other epidemiologists

Consensus notions • Deals with populations, not individuals • Deals with (frequency of) occurrences of health related events • Has a (major but not exclusive) concern with causes (“determinants”) of disease patterns in populations

Remarks • Public health perspective • “Flavors”: Analytic versus descriptive epidemiology • Causal inference: assumptions • Disease occurrence is not random. • Systematic investigation of different populations can identify causal and preventive factors • Observational versus experimental sciences • Chronic disease and infectious disease epidemiology • What is “theoretical epidemiology”?

Some examples • Do environmental exposures increase risk of disease? • John Snow: cholera epidemic of 1854 • Contaminated water and leukemia in Woburn, MA • Are vitamin supplements beneficial? • Does Vitamin E lower risk of Alzheimer’s Disease • Folic acid and risk of neural tube (birth) defects • Do behavioral interventions reduce risk behaviors? • Community–based studies to change diets • Peer interventions to reduce HIV-risk behaviors

Who is an epidemiologist? • Relatively new in medical science • Precursors: John Graunt (17th century), John Snow (19th century) • Rise as a profession: Wade Hampton Frost at JHU • 1950s and 1960s: CDC and consolidation as professional discipline, still mainly physicians • 1960s+: Infectious disease -> Chronic disease epi • Professonalization • Doctoral degrees in epidemiology • Now most epidemiologists are not docs

Who employs epidemiologists? • Public sector • State and federal health officials • Communicable and chronic disease programs • Infectious disease, outbreak investigations • Cancer registries, environmental studies, program areas in substance abuse, health services, etc., etc. • Research at CDC, NIH, academia, etc. • Private sector • Industry (chemical companies, drug companies) • Consultants • Academia, NGOs

“Flavors” of epidemiology • Descriptive epidemiology • Analytic epidemiology (finding “risk factors”, a.k.a. “causes”)

Descriptive epidemiology • Describe patterns of disease by: person, place, time • Good for monitoring public’s health (e.g., surveillance, vital events) • Used for administrative purposes (e.g., planning) • Good for generating hypotheses

NB: Disease patterns and the Science of patterns

Description • Two kinds • Tabulations or summaries only (no inference or estimation) • Inference • Prediction to other populations (“generalization”; surveys and polling) • “True” value in face of noise • May also assume data produced by underlying population model and try to describe it • Parametric: particular functional form assumed • Parameter = value that indexes family functions, e.g., mean and std deviation of Normal distribution • Non-parametric: data-driven estimate of underlying density or distribution

A word about “models” and “patterns” (our usage) • Models are high level, “global” descriptions of all or most of dataset • Descriptive or inferential component • Examples • Regression models, mixture models, Markov models • Patterns are “local” features of data • Perhaps only a few people or a few variables • Also descriptive or inferential • Descriptive: look for people with “unusual” features • Inferential: Predict which people have “unusual” features • Examples: Association rules, mode or gap in density function, outliers, inflection point in regression, symptom clusters, geographic “hot spots”, predict disease from symptoms

Models and patterns, cont’d • Epidemiologists use both but more interested in patterns, i.e., more interested in “structure” that is local than “structure” that is global • George Box: “All models are wrong but some models are useful” describes epi viewpoint • But epidemiologists tend to think of patterns as “real,” even if misleading

Warning: word “model” differs by context but is usually some kind of metaphor • Metaphor: a figure of speech literally denoting one kind of thing but used to represent or reason about another kind of thing • Examples: fashion model, model citizen (represent an “ideal”); scale model; animal model; mathematical model; model of an axiomatic system; regression model

Question: What do we learn from the following examples? Describing populations by person, place and time: illustrating how epidemiologists think

Person (age, sex, race) Death rates per 105 US population from coronary disease by age and sex, 1981

Place • Where are the rates of disease the highest and lowest? Malignant Melanoma of Skin

Place

A Variation on Place: Migrant StudiesMortality rates (per 100,000) due to stomach cancer

TimeDoes frequency of disease differ now from in the past?

What is a Population? • How an epidemiologist would put it • Group of people with a common characteristics like age, race, sex, geographic location, occupation, etc. Two types of populations, based on whether membership is permanent or transient: • Fixed population or cohort: membership is permanent and defined by an event Ex. Atomic bomb survivors, Persons born in 1980 • Dynamic population: membership is transient and defined by being in or out of a "state.” Ex. Members of HMO Blue, residents of the City of Boston

First step, summary description • Tabulate data by selected features of person, place, time • What are characteristics of population members? (how many of each sex, race, etc.) And combinations of these features (How many white women? Employed? Etc.)

Constructing contingency table from “raw data” • “raw data” consists of listing of each subject and his or her attributes:

One dimensional Contingency Table (CT) is just a frequency table, i.e., a table that gives number of subjects with each attribute One-way tables

Two-way tables • Most contingency tables are (at least) two-way, i.e., they cross-classify two attributes

Or in more familiar form… Sex by handedness and age But this is only part of the possible two way tables as it does not represent handedness versus age, for example

What is a Population?How a mathematician might put it • A population is a triple, (G, M, I) • Two sets, G and M; G is a set of “people” or “subjects”, M is a set of features the subjects might “have” • A relation I, I  G  M • Interpretation: r = (g, m)  I means that subject g  G “has” attribute m  M

Contingency tables (“cross-tabs”) • Mainstay of data preparation, inspection and analysis • Requires study design based operations • Sampling set of n subjects in set G • Variable selection (classification scheme)  set of m variables in set M • E.g., age, sex, disease status (as indicator variables) • Measurement binary relation I  G  M • E.g., ordered pair (case 2, female=yes) is typical member of I • We call the triple (G, M, I) a data structure for the contingency table (also called a formal context in FCA literature) • Simple formulation allows use of rich mathematical theory • Much more about this from Alex Pogel

Quantification: Disease frequency • Goal will be to see if occurrence of disease differs in populations with different characteristics or experiences (note comparison is at heart of this) • Quantify disease occurrence in a population at certain point or period of time • Population (counting, absolute scale) • How big? • Composition? • Occurrence (counting, absolute scale) • Existing cases? New cases? • Time • Calendar time? (NB: interval scale, preserved under pos. lin. xform) • Duration of time (NB: ratio scale, preserved under similarity xform) • More about this in Fred Roberts’s tutorial

Ex. Hypothetical Frequency of AIDS in Two Cities # new casestime periodpopulation City A 58 1985 25,000 City B 35 1985-86 7,000 Annual "rate" of AIDS City A = 58/25,000/1yr = 232/100,000/yr City B = 35/7,000/2 yrs = 17.5/7000/yr = 250/100,000/yr Make it easy to compare rates (i.e., make them “commensurable”) by using same population unit (say, per 100,000 people) and time period (say, 1 year) NB Commensurability is property of underlying relational system used in measurement (treated in Roberts tutorial)

Three kinds of quantitative measures of frequency of occurrence Used to relate number cases of disease, size of population, time • Proportion: numerator is subset of denominator, often expressed as a percentage • Ratio: division of one number by another, numbers don't have to be related • Rate: time (sometimes space) is intrinsic part of denominator, term is often misused (e.g., “birthrate”) Need to specify if measure represents events or people

(Point) Prevalence (P) Quantifies number of existing cases of disease in a population at a point in time • P = Number of existing cases of disease (at a given point in time)/ total population • Ex. City A has 7000 people with arthritis on Jan 1st, 2002 • Population of City A = 70,000 • Prevalence of Arthritis on Jan 1st = .10 or 10% Prevalence is a proportion

Incidence - quantifies number of • new cases of disease that (b) develop in a population at risk (c) during a specified time period Three key ideas: • New disease events, or for diseases that can occur more than once, usually first occurrence of disease • Population at risk (candidate population) - can't have disease already, should have relevant organs • Enough time must pass for a person to move from health to disease

Two Types of Incidence Measures Cumulative Incidence (“Attack Rate”) (Abbreviated Cum Inc. CI) Incidence Rate (“Incidence Density”) (Abbreviated I, IR, ID)

Incidence rate (I, IR) = # new cases of disease Total person-time of observation Also called incidence density (ID)

Accrual of Person-Time • Jan Jan Jan • 1981 1982 • -----------------------x • -------------------------x • -------------------------------------------- 1.1 Person-Year (PY) 1.2 PY 2.2 PY 4.5 PY Subject 1 Subject 2 Subject 3 X = outcome of interest, incident rate = 2/4.5 PY

Some Ways to Accrue 100PY • 100 people followed 1 year each = 100 py • 10 people followed 10 years each= 100 py • 50 people followed 1 year plus 25 people followed 2 years = 100 py Time unit for person-time = year, month or day Person-time = person-year, person-month, person-day

Ex.: (Cohort) study of risk of breast cancer among women with hyperthyroidism • Followed 1,762 women ---> 30,324 py • Average of 17 years of follow-up per woman • Ascertained 61 cases of breast cancer • Incidence rate = 61/30,324 py = .00201/y = 201/100,000 py (.00201 x 100,000 p/100,000 p)

Dimensions Prevalence = people people no dimension Cumulative incidence = people people no dimension Incidence rate = people people-time dimension is time –1

Types of (instantaneous) rates Relative rate (person-time or incidence rate) Absolute rate (used in infectious disease epi and health services) Also where units do not involve time, such as accidents per passenger mile or cases per square area

Relationship between prevalence and incidence P = IR x D • Prevalence depends on incidence rate and duration of disease (duration lasts from onset of disease to its termination) • If incidence is low but duration is long - prevalence is relatively high • If incidence is high but duration is short - prevalence is relatively low • This is an example of Little’s equation in queuing theory:time-avg number of units in the system = arrival rule x avg delay time/unit • This equation is true if ...

Conditions for equation to be true: • Steady state • IR constant • Distribution of durations constant • Prevalence of disease is low (less than 10%) In queuing theory terms: strictly stationary process in steady state conditions

Figuring duration from prevalence and incidence • Lung cancer incidence rate = 45.9/100,000 py • Prevalence of lung cancer = 23/100,000 • D = P = 23/100,000 p = 0.5 years • IR 45.9/100,000 py • Conclusion: Individuals with lung cancer survive 6 months from diagnosis to death

Uses of Prevalence and Incidence Measures • Prevalence: administration, planning • Incidence: etiologic research (problems with prevalence since it combines IR and D), planning

Common measures of disease frequency for public health • Crude death (mortality) rate: • Total number of deaths from all causes 1,000 people For one year (also cause-specific, age-specific, race-specific death rate) • Live-birth rate: total number of live births For one year 1,000 people (sometimes women of childbearing age) • Infant mortality rate: # deaths of infants under 1 year of age For one year 1,000 live-births

Frequency measures used in infectious disease epidemiology • Attack rate: • # cases of disease that develop during defined period • # in pop. at risk at start of period • (usually used for infectious disease outbreaks) • Case fatality rate: • # of deaths for a defined period of time • # cases of disease • Survival rate: • # living cases for a defined period of time • # cases of disease

Epidemiology for mathematicians “ Looking at wildflowers from horseback”