1 / 23

Data Mining Cardiovascular Bayesian N etworks

Data Mining Cardiovascular Bayesian N etworks. Charles Twardy † , Ann Nicholson † , Kevin Korb † , John McNeil ‡ (Danny Liew ‡ , Sophie Rogers ‡ , Lucas Hope † ). † School of Computer Science & Software Engineering ‡ Dept. of Epidemilogy & Preventive Medicine Monash University

elsbeth
Télécharger la présentation

Data Mining Cardiovascular Bayesian N etworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining CardiovascularBayesian Networks Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡ (Danny Liew‡, Sophie Rogers‡, Lucas Hope†) †School of Computer Science & Software Engineering ‡Dept. of Epidemilogy & Preventive Medicine Monash University www.datamining.monash.edu.au/bnepi

  2. 2. Data Mining Busselton Study data 2 epidemiological models Bayesian network software (Netica) Medical Experts Causal discovery (CaMML) + Other learners 3. Evaluation Overview Problem: assessment of risk for coronary heart disease (CHD) 1. Knowledge Engineering

  3. Knowledge Engineering BNs from the medical literature • The Australian Busselton Study • every 3 years, 1966-1981, > 8,000 participants • mortality followup via WA death register + manually • Cox proportional-hazards model, 2,258 from 1978 cohort • CHD event base rates: 23% for men, 14% for women • The German PROCAM Study • 1979-1985, followup every 2 years, > 25,000 participants • Scoring model (based on Cox), ~5,000 men • CHD event base rates: ~6% General question: are models transferable across populations?

  4. The Busselton BN: nodes

  5. P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) BNs summarize the joint distribution The Busselton BN: arcs uninformative All nodes have an associated conditional prob. distribution predictor variables 10-year risk of CHD event

  6. binary nodes discretization choices The Busselton BN: discretization

  7. The Busselton BN: reasoning

  8. The Busselton BN: reasoning

  9. Normal Bad cholesterol Heavy smoking The Busselton BN: reasoning

  10. More risk factors ! The Busselton BN: reasoning

  11. A risk assessment tool for clinicians • Previous tool: TAKEHEART • Combine risk assessment (probability) with costs.

  12. Young, predictor not observed – don’t treat Young, predictor observed – don’t treat old, predictor not observed – treat Not so old, predictor not observed – treat Risk Assessment Tool: example

  13. CaMML: a causal learner • Developed at Monash University • Data mines BNs from epidemiological data • Minimum message length (MML) metric: Trades-off complexity vs goodness of fit • MCMC search over model space

  14. CaMML: example BN

  15. CaMML: example BN

  16. Evaluation • Predicting 10 year risk of CHD using Busselton data • Metrics: • ROC Curves (area under curve) • Bayesian Information Reward (BIR) • Experiment 1: • Compare Busselton, PROCAM and CaMML BNs • Experiment 2 • Compare CaMML and other standard machine learners (from Weka) • 90-10 training/testing split, 10-fold crossvalidation

  17. Everyone at risk! Area under curve (AUC) priors No-one at risk! Experiment 1: ROC Results Extremes:

  18. Experiment 2: ROC Results

  19. Experiment 2: Bayesian Info Reward

  20. Summary of Results Experiment I (Models of whole data) • PROCAM model does at least as well as Busselton • On Busselton data • For both "relative" (ROC) and "absolute" (BIR) risk • CaMML Models do as well • But much simpler: only 4 nodes matter to CHD10! Experiment II (Cross-validation of learners) • Logistic regression does best on both metrics • Statistically powerful: only 1 parameter per arc • No search required: structure is given • No discretization necessary

  21. Conclusions • Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature • CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help.

  22. Future directions • Improve data mining by • Adding prior knowledge to search • Assessing whether data sources can be combined; if so, do so • Investigate combination of continuous and discrete variables in data mining and modeling • Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining) • with intervention modeling (Causal Reckoner) • with decision support • with GUI, usable by clinicians

  23. References • G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002. • M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998. • C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999. www.datamining.monash.edu.au/software/camml • C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.

More Related