1 / 39

Dental Data Mining: Practical Issues and Potential Pitfalls

Dental Data Mining: Practical Issues and Potential Pitfalls. Stuart A. Gansky University of California, San Francisco Center to Address Disparities in Children’s Oral Health Support: US DHHS/NIH/NIDCR U54 DE14251. What is K nowledge D iscovery and D ata Mining (KDD)?.

Leo
Télécharger la présentation

Dental Data Mining: Practical Issues and Potential Pitfalls

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dental Data Mining: Practical Issues and Potential Pitfalls Stuart A. Gansky University of California, San Francisco Center to Address Disparities in Children’s Oral Health Support: US DHHS/NIH/NIDCR U54 DE14251

  2. What is Knowledge Discovery and Data Mining (KDD)? • “Semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures in data” – MIT Tech Review (2001) • Interface of • Artificial Intelligence – Machine Language • Computer Science – Engineering – Statistics • Association for Computing Machinery Special Interest Group on Knowledge Discovery in Data and Data Mining (ACM SIGKDD sponsors KDD Cup)

  3. Pb Au Data Mining as Alchemy

  4. Some Potential KDD Applications in Oral Health Research • Large surveys (eg NHANES) • Longitudinal studies (eg VA Aging Study) • Disease registries (eg SEER) • Digital diagnostics (radiographic & others) • Molecular biology (eg PCR, microarrays) • Health services research / claims data • Provider and workforce databases

  5. Supervised Learning Regression k nearest neighbor Trees (CART, MART, boosting, bagging) Random Forests Multivariate Adaptive Regression Splines (MARS) Neural Networks Support Vector Machines Unsupervised Learning Hierarchical clustering k-means

  6. Collect & Store Pre- Process Analyze Validate Act Sample Merge Warehouse Clean Impute Transform Standardize Register Supervised Unsupervised Visualize Internal Split Sample Cross-validate Bootstrap External Intervene Set Policy KDD Steps

  7. Data Quality

  8. Example – Caries • Predicting disease with traditional logistic regression may have modelling difficulties: nonlinearity (ANN better) & interactions (CART better)(Kattan et al, Comp Biomed Res, ’98) • Want to compare the performance of logistic regression to popular data mining techniques – tree and artificial neural network models in dental caries data • CART in caries (Stewart & Stamm, JDR, ’91)

  9. Example study – child caries • Background: ~20% of children have ~80% of caries (tooth decay) • University of Rochester longitudinal study (Leverett et al, J Dent Res, 1993) • 466 1st-2nd graders caries-free at baseline • Saliva samples & exams every 6 months • Goal: Predict 24 month caries incidence (output)

  10. 18-month Predictors (Inputs) • Salivary bacteria • Mutans Streptococci (log10 CFU/ml) • Lactobacilli (log10 CFU/ml) • Salivary chemistry • Fluoride (ppm) • Calcium (mmol/l) • Phosphate (ppm)

  11. Modeling Methods Logistic Regression Neural Networks Decision Trees

  12. Logistic Regression Models Logit (Primary Dentition Caries) Schematic Surface log10 Mutans Streptococci Fluoride (F) ppm

  13. Tree Models Logit (Primary Dentition Caries) Schematic Surface log10 Mutans Streptococci Fluoride (F) ppm

  14. Artificial Neural Networks Logit (Primary Dentition Caries) Schematic Surface log10 Mutans Streptococci Fluoride (F) ppm

  15. Artificial Neural Network (p-r-1) wij x1 wj h1 x2 h2 y       hr xp inputs hidden layer (neurons) output

  16. Common Mistakes with ANN (Scwartzer et al, StatMed, 2000) • Too many parameters for sample size • No validation • No model complexity penalty (eg Akaike Information Criterion (AIC)) • Incorrect misclassification estimation • Implausible function • Incorrectly described network complexity • Inadequate statistical competitors • Insufficiently compared to stat competitors

  17. Validation • Split sample (70% training/30% validation) Validation estimates unbiased misclassification • K-fold Cross Validation Mean squared error (Brier Score)

  18. Why Validate? Example: Overfitting in 2 Dimensions

  19. Data

  20. Linear Fit to Data

  21. High Degree Polynomial Fit to Data

  22. 10-Fold Cross-validation

  23. 10-Fold Cross-validation

  24. 10-Fold Cross-validation

  25. Caries Example Model Settings • Logit • Stepwise selection • Alpha=.05 to enter, alpha=.20 to stay • AIC to judge additional predictors • Tree • Splitting criterion: Gini index • Pruning: Proportion correctly classified

  26. ANN Settings • Artifical Neural Network (5-3-1 = 22 df) • Multilayer perceptron • 5 Preliminary runs • Levenberg-Marquardt optimization • No weight decay parameter • Average error selection • 3 Hidden nodes/neurons • Activation function: hyperbolic tangent

  27. ANN Sensitivity Analyses • Random seeds: 5 values • No differences • Weight decay parameters: 0, .001, .005, .01, .25 • Only slight differences for .01 and .25 • Hidden nodes/neurons: 2, 3, 4 • 3 seems best

  28. Prevalence: Node > Overall (15%) Overall Primary Caries 15% N=322 Training N=144 Validation Prevalence: Node < Overall (15%) log10 MS <7.08 15% log10 MS 7.08 91% log10 LB <3.05 10% F  .110 0% log10 LB 3.05 23% F < .110 100% log10 MS <3.91 3% log10 MS 3.91 14% F < .056 22% F  .056 25% Tree Model

  29. Receiver Operating Characteristic (ROC) Curves

  30. Cumulative Captured Response Curves

  31. Lift Chart

  32. Logistic Regression Beta Std Err Odds Ratio 95% CI log10 MS .238 .072 1.27 1.10 – 1.46 log10 LB .311 .070 1.36 1.19 – 1.57

  33. MARS – MS at 4 Times

  34. Predicted Quintiles 2 1 0 Standard LOGMS4 -1 -2 0 1 4 2 3 Rank for Variable PR_ANN

  35. Predicted Quintiles 2 1 0 Standard LOGLB4 -1 -2 0 4 1 3 2 Rank for Variable PR_ANN

  36. 5-fold CV Results Logit Tree ANN RMS error .365 .363 .362 AUC .680 .553 .707

  37. Summary • Data quality and study design are paramount • Utilize multiple methods • Be sure to validate • Graphical displays help interpretations • KDD methods may provide advantages over traditional statistical models in dental data

  38. Prediction as good as the data and model

More Related