260 likes | 425 Vues
Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products. Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health
E N D
Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products Nuoo-Ting (Jassy) Molitor1 Chris Jackson2 With Nicky Best, Sylvia Richardson1 1Department of Epidemiology and Public Health Imperial College, London 2MRC Biostatistics Unit, Cambridge jassy.molitor@imperial.ac.uk chris.jackson@mrc-bsu.cam.ac.uk http://www.bias-project.org.uk
Outlines • Motivation of combining different data sources • Case study: Chlorination Study • Data Sources • Statistical modeling • Simulation and Real Data Analysis
Missing values Unobserved confounder Measurement errors Selection bias Observational studies • Fill with lots of uncertainties other than random errors Random errors Uncertainties are hard to identify within a single data set
Combining multiple data sources • Research questions are complicated in nature and a single data set may not able to provide sufficient answer. • Example: Puzzle
Case study Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birth weight
reacts Natural organic matter and / or Chemical compound bromide Chlorine • organic & inorganic byproducts • bromate • chlorite • haloacetic acids (HAA5) • total trihalomethanes (THMs) Example of combining different data sources – Chlorination Study LBW and pre-term (LBWP) Environmental Exposure Chlorine Byproducts (THMs) Low Birthweight (LBW) (birth weight < 2.5kg) Outcome Low Birth-weight (LBW) LBW and Full-term (LBWF) Gestation age Covariates: mothers’ race/ethnicity Babies’ sex mothers’ smoking status Mothers’ maternal age during the pregnancy • LBW: baby’s birth weight is less than 2.5 kg • LBWP: LBW babies were born less than 37 weeks • LBWF: LBW babies were born at least 37 weeks
Available data sources related to the Chlorination Study Why do we need them? Administrative data (NBR) Aggregate data Survey data (MCS) • Deal with • Small % of LBW in pop • Inconclusive link between • LBW and THMs • Imputing • missing • covariates • Adjust for important • subject level covariate • Allows to examine • different types of LBW
NBR (national birth registry) MCS (millennium cohort study) Administrative data (large) -power, no selection bias Observed postcode Missingsmoking and race/ethnicity Missing baby’s gestation age Survey data (Subset of NBR) - low power, selection bias Observed postcode Observed smoking and race/ethnicity Observedbaby’s gestation age Aggregate Data (UK) Observed postcode Census 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure Summary of data sources
Building the sub-model Multinomial logistic regression for MCS y r m ~ Multinomial (pr m,1:3, 1) log(pr m,2 / pr m,1)= b10 + b11 THMr m + b12 Cr m log(pr m,3 / pr m,1)= b20 + b21THMr m + b22 Crm Disease sub-model for MCS m: subject index for MCS r: region index Disease Model Parameters THM r m C r m LBWP y : Birth weight indicator (1: normal, 2: LBWP, 3: LBWF) THM: THM (chlorine byproduct) exposure C: missing covariates such as race/ethnicity and smoking. Only observed in the MCS. normal y r m LBWF Unknown Known
Building the sub-model Multinomial logistic regression for NBR y r n ~ Multinomial (pr n,1:3, 1) log(pr n,2 / pr n,1)= b10 + b11 THMr n + b12 Cr n log(pr n,3 / pr n,1)= b20 + b21THMr n + b22 Crn Disease sub-model for NBR n: subject index for NBR r: region index Disease Model Parameters THM r n LBWP Cr n Missing LBWP & LBWFwere due to missing gestation age C: missing covariates such as race/ethnicity and smoking (Missing in the NBR, but Observed in the MCS) normal LBWF y r n Unknown Known
Missing outcome model - impute LBWP and LBWF for NBR NBR MCS known Disease Model Parameters Disease Model Parameters unknown THM r n THM r m LBWP C r n LBWP normal LBW C r m normal LBW G-age: Gestation age LBWF y r m y r n LBWF missing G-age Birth Weight (BW) Birth Weight (BW)
Building the sub-model Missing Covariate Model Impute Cr n in terms of aggregate data and MCS data NBR MCS C r n C r m missing covar. model parameters Aggregate Ar Since our missing covariate such as race and smoke are binary variables, we use a multivariate-probit model to account for their correlation Unknown Known
Multivariate Probit Model (Chip & Greenberg,1998) Race Smoke Correlation 1: nonwhite (Asian, Black, Others) 0: white 1: yes 0: no Define underlying continuous variables (smoke*, race*) Smoke= I(smok* >0) & Race= I (Race* >0) S: Sampling Stratum Adjust for selection bias
LBWP normal y r m LBWF Unified model NBR disease sub-model MCS disease sub-model Disease Model Parameters Disease Model Parameters THM r n THM r m LBWP C r n C r m normal LBWF y r n Missing Outcome Model C r n C r m Missing covar. model parameters known Aggre. Ar unknown Missing covar. sub-model
1. Disease Model (y={1,2,3} ) i: subject index Nm : group of subjects who had missing outcome (ymiss ) r: region u: index for the category of outcome yobs: observed outcome X: observed covariates 2. Missing Outcome Model 3. Missing Covariates Model (Multivariate Probit)
Investigating the performance of the unified model A (aggre.) C (0/1) Y (1, 2, 3) Missing Covariate Model Missing Outcome Model • Good Performance of model depended on • How well the aggre. data can inform C (covariate) • How strong C and Y are linked We can examine the following 4 data scenarios 1. Strong (A C) Strong (CY) 2. Strong (A C) Weak (CY) 3. Weak (A C) Strong (CY) 4. Weak (A C) Weak (CY)
Step 2: Missing assignment: - randomly chose 80% of subjects and treat their C as missing - only 10% of individuals with outcomes in categories 2 or 3 were assigned to be missing Repeat step 2 : generate 20 replicate samples Simulation Study Step 1: Create data (N=1333) under the scenarios: • Step 3: Compare the prediction based on • an analysis using fully observed data (no imputation) • with an analysis using partially observed data (imputation). • Note: partially observed data were analyzed under various models • Covariate sub-model (examining A C) • Outcome sub-model (examining C Y) • Unified Model (examining AC and CY) • Unified Model with cut
Examining the Imputation of missing covariateone level (AC) Assign higher probability of covariate pattern to subjects whose true covariates corresponding to that pattern than to those whose true pattern is different Strong AC Weak AC Ability to discriminate ture covariate pattern decrease
Examining the Imputation of missing covariatetwo level (AC & C Y) Feedback form outcome model is beneficial to covariate imputation. The predicted probabilities of covariate patter (C=0,0) are better able to discriminate between subjects whose true covariates are C=0,0 or not. In particular, weak C scenarios.
Examining the impact of the imputation modelon the Y-C association • Outcome VS unified model • Unified model has higher MSE than outcome model • (more missing values need to impute) • Unified VS. Unified with cut • Strong Y-C association help reduce MSE • but not weak Y-C association
Real data analysis – a water company in Northern England Data: Restrict on:Singleton birth Period: Sep 2000 – Aug 2001 Subjects: Total 9278 MCS 1333 NBR 7945 + = Missing Race Missing Smoke Missing outcome at levels of 2 (LBWP) and 3 (LBWF) Complete Observed information Missing % in Race and Smoke: ~ 85% Missing % in Outcome: ~ 7%
Real data analysis – a water company in northern England • Exposure variable : THMs • It was dichotomized into 2 groups • low-medium exposure group (<= 60 g/l) : 57.35 % • high exposure group (>60 g/l) : 42.65 % • Estimated in separate model for MCS and NBR (Whitaker et al, 2005) In addition to race and smoke, we also adjust for : baby’s sex mother maternal age Observed in both MCS and NBR
Models for real data analysis No imputation VS. Imputation a. Multinomial logistic regression model for MCS data (Bayesian) - no imputation b. Bayesian multiple bias model for combined NBR, MCS and aggregate data - impute missing outcome and covariates
Results for the real data analysis (Low birth-weight full-term VS Normal) * 95% Bayesian Credible Interval All parameter estimates adjusted for baby’s sex, mother maternal age
Conclusion • There is an evidence for association of THM exposure with low birth-weight full-term. • Combining the datasets can • increase statistical power of the survey data • alleviate bias due to confounding in the administrative data • Must allow for selection mechanism of survey when combining data
THANKS • Mireille Toledano • Mark Nieuwenhuijsen • James Bennett • Peter Hambly • Daniela Fecht • John Molitor