1 / 26

Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products. Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1 1 Department of Epidemiology and Public Health

makana
Télécharger la présentation

Nuoo-Ting (Jassy) Molitor 1 Chris Jackson 2 With Nicky Best, Sylvia Richardson 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian graphical models for combining mismatched administrative and survey data: application to low birth weight and water disinfection by-products Nuoo-Ting (Jassy) Molitor1 Chris Jackson2 With Nicky Best, Sylvia Richardson1 1Department of Epidemiology and Public Health Imperial College, London 2MRC Biostatistics Unit, Cambridge jassy.molitor@imperial.ac.uk chris.jackson@mrc-bsu.cam.ac.uk http://www.bias-project.org.uk

  2. Outlines • Motivation of combining different data sources • Case study: Chlorination Study • Data Sources • Statistical modeling • Simulation and Real Data Analysis

  3. Missing values Unobserved confounder Measurement errors Selection bias Observational studies • Fill with lots of uncertainties other than random errors Random errors Uncertainties are hard to identify within a single data set

  4. Combining multiple data sources • Research questions are complicated in nature and a single data set may not able to provide sufficient answer. • Example: Puzzle

  5. Case study Combining birth register, survey and census data to study effects of water disinfection by-products on risk of low birth weight

  6. reacts Natural organic matter and / or Chemical compound bromide Chlorine • organic & inorganic byproducts • bromate • chlorite • haloacetic acids (HAA5) • total trihalomethanes (THMs) Example of combining different data sources – Chlorination Study LBW and pre-term (LBWP) Environmental Exposure Chlorine Byproducts (THMs) Low Birthweight (LBW) (birth weight < 2.5kg) Outcome Low Birth-weight (LBW) LBW and Full-term (LBWF) Gestation age Covariates: mothers’ race/ethnicity Babies’ sex mothers’ smoking status Mothers’ maternal age during the pregnancy • LBW: baby’s birth weight is less than 2.5 kg • LBWP: LBW babies were born less than 37 weeks • LBWF: LBW babies were born at least 37 weeks

  7. Available data sources related to the Chlorination Study Why do we need them? Administrative data (NBR) Aggregate data Survey data (MCS) • Deal with • Small % of LBW in pop • Inconclusive link between • LBW and THMs • Imputing • missing • covariates • Adjust for important • subject level covariate • Allows to examine • different types of LBW

  8. NBR (national birth registry) MCS (millennium cohort study) Administrative data (large) -power, no selection bias Observed postcode Missingsmoking and race/ethnicity Missing baby’s gestation age Survey data (Subset of NBR) - low power, selection bias Observed postcode Observed smoking and race/ethnicity Observedbaby’s gestation age Aggregate Data (UK) Observed postcode Census 2001 - region-level of race/ethnicity composition Consumer survey: CACI - region-level of tobacco expenditure Summary of data sources

  9. Building the sub-model Multinomial logistic regression for MCS y r m ~ Multinomial (pr m,1:3, 1) log(pr m,2 / pr m,1)= b10 + b11 THMr m + b12 Cr m log(pr m,3 / pr m,1)= b20 + b21THMr m + b22 Crm Disease sub-model for MCS m: subject index for MCS r: region index Disease Model Parameters THM r m C r m LBWP y : Birth weight indicator (1: normal, 2: LBWP, 3: LBWF) THM: THM (chlorine byproduct) exposure C: missing covariates such as race/ethnicity and smoking. Only observed in the MCS. normal y r m LBWF Unknown Known

  10. Building the sub-model Multinomial logistic regression for NBR y r n ~ Multinomial (pr n,1:3, 1) log(pr n,2 / pr n,1)= b10 + b11 THMr n + b12 Cr n log(pr n,3 / pr n,1)= b20 + b21THMr n + b22 Crn Disease sub-model for NBR n: subject index for NBR r: region index Disease Model Parameters THM r n LBWP Cr n Missing LBWP & LBWFwere due to missing gestation age C: missing covariates such as race/ethnicity and smoking (Missing in the NBR, but Observed in the MCS) normal LBWF y r n Unknown Known

  11. Missing outcome model - impute LBWP and LBWF for NBR NBR MCS known Disease Model Parameters Disease Model Parameters unknown THM r n THM r m LBWP C r n LBWP normal LBW C r m normal LBW G-age: Gestation age LBWF y r m y r n LBWF missing G-age Birth Weight (BW) Birth Weight (BW)

  12. Building the sub-model Missing Covariate Model Impute Cr n in terms of aggregate data and MCS data NBR MCS C r n C r m missing covar. model parameters Aggregate Ar Since our missing covariate such as race and smoke are binary variables, we use a multivariate-probit model to account for their correlation Unknown Known

  13. Multivariate Probit Model (Chip & Greenberg,1998) Race Smoke Correlation 1: nonwhite (Asian, Black, Others) 0: white 1: yes 0: no Define underlying continuous variables (smoke*, race*) Smoke= I(smok* >0) & Race= I (Race* >0) S: Sampling Stratum Adjust for selection bias

  14. LBWP normal y r m LBWF Unified model NBR disease sub-model MCS disease sub-model Disease Model Parameters Disease Model Parameters THM r n THM r m LBWP C r n C r m normal LBWF y r n Missing Outcome Model C r n C r m Missing covar. model parameters known Aggre. Ar unknown Missing covar. sub-model

  15. 1. Disease Model (y={1,2,3} ) i: subject index Nm : group of subjects who had missing outcome (ymiss ) r: region u: index for the category of outcome yobs: observed outcome X: observed covariates 2. Missing Outcome Model 3. Missing Covariates Model (Multivariate Probit)

  16. Investigating the performance of the unified model A (aggre.) C (0/1) Y (1, 2, 3) Missing Covariate Model Missing Outcome Model • Good Performance of model depended on • How well the aggre. data can inform C (covariate) • How strong C and Y are linked We can examine the following 4 data scenarios 1. Strong (A  C) Strong (CY) 2. Strong (A  C) Weak (CY) 3. Weak (A  C) Strong (CY) 4. Weak (A  C) Weak (CY)

  17. Step 2: Missing assignment: - randomly chose 80% of subjects and treat their C as missing - only 10% of individuals with outcomes in categories 2 or 3 were assigned to be missing Repeat step 2 : generate 20 replicate samples Simulation Study Step 1: Create data (N=1333) under the scenarios: • Step 3: Compare the prediction based on • an analysis using fully observed data (no imputation) • with an analysis using partially observed data (imputation). • Note: partially observed data were analyzed under various models • Covariate sub-model (examining A C) • Outcome sub-model (examining C Y) • Unified Model (examining AC and CY) • Unified Model with cut

  18. Examining the Imputation of missing covariateone level (AC) Assign higher probability of covariate pattern to subjects whose true covariates corresponding to that pattern than to those whose true pattern is different Strong AC Weak AC Ability to discriminate ture covariate pattern decrease

  19. Examining the Imputation of missing covariatetwo level (AC & C Y) Feedback form outcome model is beneficial to covariate imputation. The predicted probabilities of covariate patter (C=0,0) are better able to discriminate between subjects whose true covariates are C=0,0 or not. In particular, weak C scenarios.

  20. Examining the impact of the imputation modelon the Y-C association • Outcome VS unified model • Unified model has higher MSE than outcome model • (more missing values need to impute) • Unified VS. Unified with cut • Strong Y-C association help reduce MSE • but not weak Y-C association

  21. Real data analysis – a water company in Northern England Data: Restrict on:Singleton birth Period: Sep 2000 – Aug 2001 Subjects: Total 9278 MCS 1333 NBR 7945 + = Missing Race Missing Smoke Missing outcome at levels of 2 (LBWP) and 3 (LBWF) Complete Observed information Missing % in Race and Smoke: ~ 85% Missing % in Outcome: ~ 7%

  22. Real data analysis – a water company in northern England • Exposure variable : THMs • It was dichotomized into 2 groups • low-medium exposure group (<= 60 g/l) : 57.35 % • high exposure group (>60 g/l) : 42.65 % • Estimated in separate model for MCS and NBR (Whitaker et al, 2005) In addition to race and smoke, we also adjust for : baby’s sex mother maternal age Observed in both MCS and NBR

  23. Models for real data analysis No imputation VS. Imputation a. Multinomial logistic regression model for MCS data (Bayesian) - no imputation b. Bayesian multiple bias model for combined NBR, MCS and aggregate data - impute missing outcome and covariates

  24. Results for the real data analysis (Low birth-weight full-term VS Normal) * 95% Bayesian Credible Interval All parameter estimates adjusted for baby’s sex, mother maternal age

  25. Conclusion • There is an evidence for association of THM exposure with low birth-weight full-term. • Combining the datasets can • increase statistical power of the survey data • alleviate bias due to confounding in the administrative data • Must allow for selection mechanism of survey when combining data

  26. THANKS • Mireille Toledano • Mark Nieuwenhuijsen • James Bennett • Peter Hambly • Daniela Fecht • John Molitor

More Related