Statistical Methods for Health Intelligence Lecture 2: Perspectives, Data Types & Summaries

Statistical Methodsfor Health IntelligenceLecture 2: Perspectives,Data Types & Summaries Iain Buchan University of Manchester buchan@man.ac.uk

Course Material 1: Basic Text • Medical Statistics, 4th EdCampbell, Machin & WaltersWiley 2007 • Statistical knowledge level:Public health practitioner • How are you getting on? • Are you using any other learning materials?

Your Participation • Today: questions about your reading • Take notes on my comments • Prepare to reproduce exercises in R

Course Material 2: R • Statistics: An Introduction Using RCrawley, Wiley 2005 • cran.r-project.org • Reproduce each example in course text • Prepare to do submit R scripts for assessment

Course Material: Optional • Probability and Random Variables: a beginner’s guideStirzaker, Cambridge University Press 1999 • Bad ScienceGoldacre, Fourth Estate Ltd, 2008

Define • statistics • quantitative information about a topic • Statistics • The measurement of uncertainty

The Statistical Movement Circa 1900: Galton, Pearson, Edgeworth and Yule establish Statistics as a discipline Early/mid 1900s: Fisher consolidatesstatistical methods and experimental philosophy

Think • Whose perspective is Chapter 1? • Medical Statistician • Why must the Informatician look wider? • May not have the luxury of study design • Data- vs. hypothesis-driven research • Maximise information validity & utility

Health Statistics 1600-1860 Reasoning Summarisation Knowledge Observation

Health Statistics 1860-≈2000/now Reasoning Summarisation & Statistical Modelling Knowledge Observation± Experimentation

Early/mid 1900s: Greenwood, Bradford-Hill & Doll pushStatistics into medical research Evidence Based Medicine Causality Clinical Trials Mid-late 1900s: Cochrane pushes for the routine application of randomised clinical trials and leaves the evidence based medicine movement in his wake Effectiveness & Efficiency

Hypothesis-driven Research

Define • Epidemiology • the study ofthe distributionand determinantsof diseaseand health-related statesin populations JM Last, 2000

Define • Confounding factor • A factor associated with bothexposure and outcomebut not on the causal pathwayabout which the inference is being made • What confounded the water cancer vs. water fluoridation example in the book?

Causal Inference Exposure Outcome Causal pathway Association Confounder

Sieving Associations C = caffeine, MI = myocardial infarction (heart attack) Disciplined approach to causal inference, Bradford-Hill: Criteria (temporality, strength, dose-response,consistency, plausibility, consideration of alternatives,open to experiment, specificity, coherence)

Hard to Make a Confident Causal Inference • Plausible pathway to link outcome to exposure • Same results if repeat in different time, place person • Exposure precedes outcome • Strong relationship ± dose effect • Causal factor relates only to the outcome in question • Outcome falls if risk factor removed...

Think • What is the most important question a Statistician wants a medic to ask? • How might I be wrong? • In designing my study • In making an inference about an association • In generalising my inference beyond the study population • Statisticians are understandably conservativeInformaticians must be carefully informative

Exhausted Epidemiology Platform Problem 1:Dwindling hits from tools todetect independent “causes” Problem 2:Knowledge can’t be managedby reading papers any more The big public health problems e.g. Type 2 Diabeteshave “complex webs of causes” The “data-set” and structureextend beyondthe study’s observations

Evidence limits showing • Epidemiology has exhausted the big simple causes of ill health • Many trials have weak external validity • Public health interventions are largely unstudied Many patterns of ill health in society remain unexplained via conventional studies

Need Statistical Informatics Data Necessary Complexity of Models Human Resource

Define • Statistical Data-types & Measurement Scales • Categorical  Qualitative measuring • Binary/Dichotomous • Nominal > 2 categories, without order • Ordinal (loose) • Nominal with order • Ordinal (ties = lack of measurement sensitivity) • Numerical  Quantitative measuring • Counts • Continuous (any value in a range) • Interval (fixed and defined, meaningful mean difference) • Ratio (zero means something)

Caution • Don’t treat ordered nominal data as interval! • Why? • Give examples? • Relate these to software requirements

Discuss • Why categorise continuous data? • Meaningful thresholds (e.g. Hypertensive) • Compact summary / easy presentation • Easier analysis (good / bad?) • Avoid regression to the mean (homework)

Think • What is audit? • A quality improvement process that seeks to improve a service through systematic against explicit criteria and implementing change • How does this differ from research? • Ethics • Constrained design • What is a natural experiment? • Homework...

Summarise Binary Data: r/n • Describe a proportion • r = outcome or feature present (numerator) • n = number of subjects observed (denominator) • p=r/n; RR = p1/p2; (A)RD = |p2-p1| • Relative Risk (RR) abuse • Pill ↑ risk DVT by (RR =) 2statistically significantclinically insignificant2 women in 10,000 pill-years

Summarise Binary Data: r/n~t • Describe a rate • r = outcome/success/failure (numerator) • n = number of subjects observed (denominator) • t = time over which subjects observed • n*t = person time – why important? • Some may drop out or be lost to follow-up • (incidence) rate IR=r/n, IRR • IRR = 1R1/IR2; IRD = |IR2-IR1|

25% 20% 15% Males 10% Females 5% 0% Year Percentage excess deaths in North vs. South England Source: John Hacking & Iain Buchan, pre-publication 2009

Summarise Binary Data: Crosstabs • Variables C1-Ck – what is a crosstab? • Cross-tabulate categorical variablessay disease registration by gender2 by 2  r by c tables • Usually two way or two dimensional • Models may need higher dimensionssay disease registration by gender by speciality • Is a data cube the same? • Data Cube: A relational aggregation operator generalizing group-by, crosstab, and subtotals

Contingency Table Dimension 1: Exposure/Treatment/Category 1 Absent Present b a Present Dimension 2:Outcome/Status/Category 2 c d Absent

Summarise Binary Data: Odds • How do odds differ from risk/proportion/probability? • Ratio of occurrence to non occurrence • Odds = p(1-p) • OR = (a/c)/(b/d)=ad/bc • p=a/(a+c),so if a<<c then a/(a+c) ≈ a/c and OR ≈ RR • OR_success = 1/OR_failure, not so for RR • Tractable computation with log odds

Caution • If the odds ratio is interpreted as a relative risk it will always overstate any effect size: the odds ratio is smaller than the relative risk for odds ratios of less than one, and bigger than the relative risk for odds ratios of greater than one • The extent of overstatement increases as both the initial risk increases and the odds ratio departs from unity • However, serious divergence between the odds ratio and the relative risk occurs only with large effects on groups at high initial risk. Therefore qualitative judgments based on interpreting odds ratios as though they were relative risks are unlikely to be seriously in error • In studies which show reductions in risk (odds ratios of less than one), the odds ratio will never underestimate the relative risk by a greater percentage than the level of initial risk • In studies which show increases in risk (odds ratios of greater than one), the odds ratio will be no more than twice the relative risk so long as the odds ratio times the initial risk is less than 100%

Visualise Categorical Data • When is a pie chart useful? • Seldom: arguably only in metaphor • How do you add dimensions to a bar chart? • Cluster • When is a 3D effect useful • Not in 2D concepts! • Showing additional dimensions e.g. 2nd level cluster

What is arguably wrong with this visualisation?

Preparation for 15 Feb • Read chapters 4,5,6 to understand natural distributions and sampling • Return to chapter 3, run the examples in R and generate some alternative examples • Prepare to show ideal visualisations and summaries with your R scripts

Statistical Methods for Health Intelligence Lecture 2: Perspectives, Data Types & Summaries