Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products Nuoo-Ting (Jassy) Molitor1 Chris Jackson2 With Nicky Best, Sylvia Richardson1 1Department of Epidemiology and Public Health Imperial College, London 2MRC Biostatistics Unit, Cambridge jassy.molitor@imperial.ac.uk chris.jackson@mrc-bsu.cam.ac.uk http://www.bias-project.org.uk

Outlines • Motivation of combining different data sources • Case study: Chlorination Study • Data Sources • Statistical modeling • Simulation and Real Data Analysis

Missing values Unobserved confounder Measurement errors Selection bias Observational studies • Fill with lots of uncertainties other than random errors Random errors Uncertainties are hard to identify within a single data set

Combining multiple data sources • Research questions are complicated in nature and a single data set may not able to provide sufficient answer. • Example: Puzzle

Case study Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birth weight

reacts Natural organic matter and / or Chemical compound bromide Chlorine • organic & inorganic byproducts • bromate • chlorite • haloacetic acids (HAA5) • total trihalomethanes (THMs) Example of combining different data sources – Chlorination Study LBW and pre-term (LBWP) Environmental Exposure Chlorine Byproducts (THMs) Low Birthweight (LBW) (birth weight < 2.5kg) Outcome Low Birth-weight (LBW) LBW and Full-term (LBWF) Gestation age Covariates: mothers’ race/ethnicity Babies’ sex mothers’ smoking status Mothers’ maternal age during the pregnancy • LBW: baby’s birth weight is less than 2.5 kg • LBWP: LBW babies were born less than 37 weeks • LBWF: LBW babies were born at least 37 weeks

Available data sources related to the Chlorination Study Why do we need them? Administrative data (NBR) Aggregate data Survey data (MCS) • Deal with • Small % of LBW in pop • Inconclusive link between • LBW and THMs • Imputing • missing • covariates • Adjust for important • subject level covariate • Allows to examine • different types of LBW

NBR (national birth registry) MCS (millennium cohort study) Administrative data (large) -power, no selection bias Observed postcode Missingsmoking and race/ethnicity Missing baby’s gestation age Survey data (Subset of NBR) - low power, selection bias Observed postcode Observed smoking and race/ethnicity Observedbaby’s gestation age Aggregate Data (UK) Observed postcode Census 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure Summary of data sources

Building the sub-model Multinomial logistic regression for MCS y r m ~ Multinomial (pr m,1:3, 1) log(pr m,2 / pr m,1)= b10 + b11 THMr m + b12 Cr m log(pr m,3 / pr m,1)= b20 + b21THMr m + b22 Crm Disease sub-model for MCS m: subject index for MCS r: region index Disease Model Parameters THM r m C r m LBWP y : Birth weight indicator (1: normal, 2: LBWP, 3: LBWF) THM: THM (chlorine byproduct) exposure C: missing covariates such as race/ethnicity and smoking. Only observed in the MCS. normal y r m LBWF Unknown Known

Building the sub-model Multinomial logistic regression for NBR y r n ~ Multinomial (pr n,1:3, 1) log(pr n,2 / pr n,1)= b10 + b11 THMr n + b12 Cr n log(pr n,3 / pr n,1)= b20 + b21THMr n + b22 Crn Disease sub-model for NBR n: subject index for NBR r: region index Disease Model Parameters THM r n LBWP Cr n Missing LBWP & LBWFwere due to missing gestation age C: missing covariates such as race/ethnicity and smoking (Missing in the NBR, but Observed in the MCS) normal LBWF y r n Unknown Known

Missing outcome model - impute LBWP and LBWF for NBR NBR MCS known Disease Model Parameters Disease Model Parameters unknown THM r n THM r m LBWP C r n LBWP normal LBW C r m normal LBW G-age: Gestation age LBWF y r m y r n LBWF missing G-age Birth Weight (BW) Birth Weight (BW)

Building the sub-model Missing Covariate Model Impute Cr n in terms of aggregate data and MCS data NBR MCS C r n C r m missing covar. model parameters Aggregate Ar Since our missing covariate such as race and smoke are binary variables, we use a multivariate-probit model to account for their correlation Unknown Known

Multivariate Probit Model (Chip & Greenberg,1998) Race Smoke Correlation 1: nonwhite (Asian, Black, Others) 0: white 1: yes 0: no Define underlying continuous variables (smoke*, race*) Smoke= I(smok* >0) & Race= I (Race* >0) S: Sampling Stratum Adjust for selection bias

LBWP normal y r m LBWF Unified model NBR disease sub-model MCS disease sub-model Disease Model Parameters Disease Model Parameters THM r n THM r m LBWP C r n C r m normal LBWF y r n Missing Outcome Model C r n C r m Missing covar. model parameters known Aggre. Ar unknown Missing covar. sub-model

1. Disease Model (y={1,2,3} ) i: subject index Nm : group of subjects who had missing outcome (ymiss ) r: region u: index for the category of outcome yobs: observed outcome X: observed covariates 2. Missing Outcome Model 3. Missing Covariates Model (Multivariate Probit)

Investigating the performance of the unified model A (aggre.) C (0/1) Y (1, 2, 3) Missing Covariate Model Missing Outcome Model • Good Performance of model depended on • How well the aggre. data can inform C (covariate) • How strong C and Y are linked We can examine the following 4 data scenarios 1. Strong (A  C) Strong (CY) 2. Strong (A  C) Weak (CY) 3. Weak (A  C) Strong (CY) 4. Weak (A  C) Weak (CY)

Step 2: Missing assignment: - randomly chose 80% of subjects and treat their C as missing - only 10% of individuals with outcomes in categories 2 or 3 were assigned to be missing Repeat step 2 : generate 20 replicate samples Simulation Study Step 1: Create data (N=1333) under the scenarios: • Step 3: Compare the prediction based on • an analysis using fully observed data (no imputation) • with an analysis using partially observed data (imputation). • Note: partially observed data were analyzed under various models • Covariate sub-model (examining A C) • Outcome sub-model (examining C Y) • Unified Model (examining AC and CY) • Unified Model with cut

Examining the Imputation of missing covariateone level (AC) Assign higher probability of covariate pattern to subjects whose true covariates corresponding to that pattern than to those whose true pattern is different Strong AC Weak AC Ability to discriminate ture covariate pattern decrease

Examining the Imputation of missing covariatetwo level (AC & C Y) Feedback form outcome model is beneficial to covariate imputation. The predicted probabilities of covariate patter (C=0,0) are better able to discriminate between subjects whose true covariates are C=0,0 or not. In particular, weak C scenarios.

Examining the impact of the imputation modelon the Y-C association • Outcome VS unified model • Unified model has higher MSE than outcome model • (more missing values need to impute) • Unified VS. Unified with cut • Strong Y-C association help reduce MSE • but not weak Y-C association

Real data analysis – a water company in Northern England Data: Restrict on:Singleton birth Period: Sep 2000 – Aug 2001 Subjects: Total 9278 MCS 1333 NBR 7945 + = Missing Race Missing Smoke Missing outcome at levels of 2 (LBWP) and 3 (LBWF) Complete Observed information Missing % in Race and Smoke: ~ 85% Missing % in Outcome: ~ 7%

Real data analysis – a water company in northern England • Exposure variable : THMs • It was dichotomized into 2 groups • low-medium exposure group (<= 60 g/l) : 57.35 % • high exposure group (>60 g/l) : 42.65 % • Estimated in separate model for MCS and NBR (Whitaker et al, 2005) In addition to race and smoke, we also adjust for : baby’s sex mother maternal age Observed in both MCS and NBR

Models for real data analysis No imputation VS. Imputation a. Multinomial logistic regression model for MCS data (Bayesian) - no imputation b. Bayesian multiple bias model for combined NBR, MCS and aggregate data - impute missing outcome and covariates

Results for the real data analysis (Low birth-weight full-term VS Normal) * 95% Bayesian Credible Interval All parameter estimates adjusted for baby’s sex, mother maternal age

Conclusion • There is an evidence for association of THM exposure with low birth-weight full-term. • Combining the datasets can • increase statistical power of the survey data • alleviate bias due to confounding in the administrative data • Must allow for selection mechanism of survey when combining data

THANKS • Mireille Toledano • Mark Nieuwenhuijsen • James Bennett • Peter Hambly • Daniela Fecht • John Molitor

Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1