Data, Databases, and Discovery
180 likes | 333 Vues
Data, Databases, and Discovery. Andy Novobilski, PhD UT Chattanooga Computer Science N ut s and Bolts Research Methods Symposium UT College of Medicine Chattanooga September 29, 2006. An Introduction to Knowledge Discovery. Data Collection Data Validation Preprocessing of Data
Data, Databases, and Discovery
E N D
Presentation Transcript
Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and BoltsResearch Methods Symposium UT College of Medicine Chattanooga September 29, 2006
An Introduction to Knowledge Discovery • Data Collection • Data Validation • Preprocessing of Data • Mining the Data • Comparing Methods
Data Collection … • Paper or Electronic? • Fingernet • Continuous or Discrete? • And the Understatement of the Year …Health Insurance Portability and Accountability Act of 1996The HIPAA website http://www.hipaa.org/ links to the government’s website http://aspe.hhs.gov/admnsimp/ which states“Administrative Simplification in the Health Care Industry”
… And Raw Storage … • Alphanumeric Data • Excel Worksheets • Comma/Tab Delimited Text Files • XML: The Extensible Markup Language • http://www.xml.com/ • Binary Data • Images • GIF, BMP, EPS • Streaming Data • HL7 - http://www.hl7.org/ (http://en.wikipedia.org/wiki/HL7) • DICOM - http://medical.nema.org/
… Stored in a Relational Manner • Relational Databases • Inexpensive • MS Access • Expensive • MS SQL Server, Oracle, Sybase, … • Free (sort of … open source) • MySQL, PostgreSQL • Licensing Varies by Usage
Data Validation • Patient 002 is a … • Pregnant Male ( hit the 9 instead of 0) • With Ice Water in His Veins (misplaced decimal) • Who Might or Might Not Smoke (missing data)
Preprocessing the Data • Clean-up • Out of Scope vs. Out of Family • Feature Extraction • Data Aggregation • Feature Transformation • Normalization • Principle Component Analysis
Turning Data into Information • Data Mining … • Clustering • Decision Trees • Neural Networks • Bayesian Networks
Clustering K-Means Y N Y Y Y N N Y N N N N
Decision Trees • Division of Data Based on Information Gain • White Box Gender M F Smoker Age N Y Age Y N N Y N Y Y
Neural Networks • Functional Approximation to Data • Black Box • Most Common is Feed Forward, Back Propagation • Considerations in Training the Network • Many Types of Neural Networks • Difficulties with Discrete Data • Missing Data Requires Careful Consideration Case Data Forecast
Bayesian Networks • Belief Networks • White Box • Causal Orientation • Beliefs are Updated Based on New Information • Nodes Can Serve as Both Evidence and Query Points • Handles Missing Data Gracefully
An Example • Novobilski, Andrew, F. Fesmire, D. Sonnemaker. "Mining Bayesian Networks to Forecast Adverse Outcomes Related to Acute Coronary Syndrome." ." The 17th International FLAIRS Conference 2004.
Comparing Models – The ROC Curve • The Receiver Operating Characteristic (ROC) Curve • Plots the Percentage of True Positives against the Percentage of False Positives as the Cutoff Value is varied from everyone classified as ill to everyone classified as healthy. • Provides a consistent measure of model fitness that varies between 0 and 100.
An Illustration Healthy Cutoff Value Ill
In Summary … • A Process to Consider … • Collect, Validate, Preprocess, Mine, Compare • Excellent Software is Available • Both Commercial and Open Source • Sample Data Is Available
Thank You ! • Questions and/or Comments are Welcome … Dr. Andy NovobilskiUT Chattanooga Computer Science 615 McCallie Ave., Dept. 2302 Chattanooga, TN 37403 (423) 425-4202 Andy-Novobilski@utc.edu http://www.utc.edu/Faculty/Andy-Novobilski