Data mining and its application and usage in medicine

Data mining and its application and usage in medicine By Radhika

Data Mining and Medicine • History • Past 20 years with relational databases • More dimensions to database queries • earliest and most successful area of data mining • Mid 1800s in London hit by infectious disease • Two theories • Miasma theory  Bad air propagated disease • Germ theory  Water-borne • Advantages • Discover trends even when we don’t understand reasons • Discover irrelevant patterns that confuse than enlighten • Protection against unaided human inference of patterns provide quantifiable measures and aid human judgment • Data Mining • Patterns persistent and meaningful • Knowledge Discovery of Data

The future of data mining • 10 biggest killers in the US • Data mining = Process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data

Major Issues in Medical Data Mining • Heterogeneity of medical data • Volume and complexity • Physician’s interpretation • Poor mathematical categorization • Canonical Form • Solution: Standard vocabularies, interfaces between different sources of data integrations, design of electronic patient records • Ethical, Legal and Social Issues • Data Ownership • Lawsuits • Privacy and Security of Human Data • Expected benefits • Administrative Issues

Why Data Preprocessing? • Patient records consist of clinical, lab parameters, results of particular investigations, specific to tasks • Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • Noisy: containing errors or outliers • Inconsistent: containing discrepancies in codes or names • Temporal chronic diseases parameters • No quality data, no quality mining results! • Data warehouse needs consistent integration of quality data • Medical Domain, to handle incomplete, inconsistent or noisy data, need people with domain knowledge

What is Data Mining? The KDD Process Knowledge Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

From Tables and Spreadsheets to Data Cubes • A data warehouse is based on a multidimensional data model that views data in the form of a data cube • A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions • Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) • Fact table contains measures (such as dollars_sold) and keys to each of related dimension tables • W. H. Inmon:“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”

Data Warehouse vs. Heterogeneous DBMS • Data warehouse: update-driven, high performance • Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis • Do not contain most current information • Query processing does not interfere with processing at local sources • Store and integrate historical information • Support complex multidimensional queries

Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) • Major task of traditional relational DBMS • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • OLAP (on-line analytical processing) • Major task of data warehouse system • Data analysis and decision making • Distinct features (OLTP vs. OLAP): • User and system orientation: customer vs. market • Data contents: current, detailed vs. historical, consolidated • Database design: ER + application vs. star + subject • View: current, local vs. evolutionary, integrated • Access patterns: update vs. read-only but complex queries

Why Separate Data Warehouse? • High performance for both systems • DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery • Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation • Different functions and different data: • Missing data: Decision support requires historical data which operational DBs do not typically maintain • Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources • Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Typical OLAP Operations • Roll up (drill-up): summarize data • by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up • from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice: • project and select • Pivot (rotate): • reorient the cube, visualization, 3D to series of 2D planes. • Other operations • drill across: involving (across) more than one fact table • drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

other sources Extract Transform Load Refresh Operational DBs Multi-Tiered Architecture Monitor & Integrator OLAP Server Metadata Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources OLAP Engine Front-End Tools Data Storage

Steps of a KDD Process • Learning the application domain: • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation: • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge

Common Techniques in Data Mining • Predictive Data Mining • Most important • Classification: Relate one set of variables in data to response variables • Regression: estimate some continuous value • Descriptive Data Mining • Clustering: Discovering groups of similar instances • Association rule extraction • Variables/Observations • Summarization of group descriptions

Leukemia • Different types of cells look very similar • Given a number of samples (patients) • can we diagnose the disease accurately? • Predict the outcome of treatment? • Recommend best treatment based of previous treatments? • Solution: Data mining on micro-array data • 38 training patients, 34 testing patients ~ 7000 patient attributes • 2 classes: Acute Lymphoblastic Leukemia(ALL) vs Acute Myeloid Leukemia (AML)

Clustering/Instance Based Learning • Uses specific instances to perform classification than general IF THEN rules • Nearest Neighbor classifier • Most studied algorithms for medical purposes • Clustering– Partitioning a data set into several groups (clusters) such that • Homogeneity: Objects belonging to the same cluster are similar to each other • Separation: Objects belonging to different clusters are dissimilar to each other. • Three elements • The set of objects • The set of attributes • Distance measure

Measure the Dissimilarity of Objects • Find best matching instance • Distance function • Measure the dissimilarity between a pair of data objects • Things to consider • Usually very different for interval-scaled, boolean, nominal, ordinal and ratio-scaled variables • Weights should be associated with different variables based on applications and data semantic • Quality of a clustering result depends on both the distance measure adopted and its implementation

Minkowski Distance • Minkowski distance: a generalization • If q = 2, d is Euclidean distance • If q = 1, d is Manhattan distance xi Xi (1,7) 12 8.48 q=2 q=1 6 6 xj Xj(7,1)

Binary Variables • A contingency table for binary data • Simple matching coefficient Object j Object i

Dissimilarity between Binary Variables • Example Object 2 Object 1

K-nearest neighbors algorithm • Initialization • Arbitrarily choose k objects as the initial cluster centers (centroids) • Iteration until no change • For each object Oi • Calculate the distances between Oi and the k centroids • (Re)assign Oi to the cluster whose centroid is the closest to Oi • Update the cluster centroids based on current assignment

k-Means Clustering Method cluster mean current clusters objects relocated new clusters

Dataset • Data set from UCI repository • http://kdd.ics.uci.edu/ • 768 female Pima Indians evaluated for diabetes • After data cleaning 392 data entries

Hierarchical Clustering • Groups observations based on dissimilarity • Compacts database into “labels” that represent the observations • Measure of similarity/Dissimilarity • Euclidean Distance • Manhattan Distance • Types of Clustering • Single Link • Average Link • Complete Link

Hierarchical Clustering: Comparison 5 1 5 5 4 1 3 1 4 1 2 2 5 2 5 5 2 1 5 2 5 2 2 2 3 3 6 6 3 6 3 1 6 3 3 1 4 4 4 1 3 4 4 4 Single-link Complete-link Average-link Centroid distance

Compare Dendrograms 1 2 5 3 6 4 1 2 5 3 6 4 1 2 5 3 6 4 Single-link Complete-link Centroid distance Average-link 2 5 3 6 4 1

Which Distance Measure is Better? • Each method has both advantages and disadvantages; application-dependent • Single-link • Can find irregular-shaped clusters • Sensitive to outliers • Complete-link, Average-link, and Centroid distance • Robust to outliers • Tend to break large clusters • Prefer spherical clusters

Dendrogram from dataset • Minimum spanning tree through the observations • Single observation that is last to join the cluster is patient whose blood pressure is at bottom quartile, skin thickness is at bottom quartile and BMI is in bottom half • Insulin was however largest and she is 59-year old diabetic

Dendrogram from dataset • Maximum dissimilarity between observations in one cluster when compared to another

Dendrogram from dataset • Average dissimilarity between observations in one cluster when compared to another

Supervised versus Unsupervised Learning • Supervised learning (classification) • Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on training set • Unsupervised learning (clustering) • Class labels of training data are unknown • Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data

Classification and Prediction • Derive models that can use patient specific information, aid clinical decision making • Apriori decision on predictors and variables to predict • No method to find predictors that are not present in the data • Numeric Response • Least Squares Regression • Categorical Response • Classification trees • Neural Networks • Support Vector Machine • Decision models • Prognosis, Diagnosis and treatment planning • Embed in clinical information systems

Least Squares Regression • Find a linear function of predictor variables that minimize the sum of square difference with response • Supervised learning technique • Predict insulin in our dataset :glucose and BMI

Decision Trees • Decision tree • Each internal node tests an attribute • Each branch corresponds to attribute value • Each leaf node assigns a classification • ID3 algorithm • Based on training objects with known class labels to classify testing objects • Rank attributes with information gain measure • Minimal height • least number of tests to classify an object • Used in commercial tools eg: Clementine • ASSISTANT • Deal with medical datasets • Incomplete data • Discretize continuous variables • Prune unreliable parts of tree • Classify data

Decision Trees

Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Attributes are categorical (if continuous-valued, they are discretized in advance) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all training examples are at the root • Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) • Examples are partitioned recursively based on selected attributes

Training Dataset

Construction of A Decision Tree for “Condition X” [P4,P5,P10] Yes: 3, No:0 [P6,P14] Yes: 0, No:2 YES YES YES NO NO [P1,…P14] Yes: 9, No:5 Age? 30…40 <=30 >40 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 Vision History no yes excellent fair [P9,P11] Yes: 2, No:0 [P1,P2,P8] Yes: 0, No:3

Entropy and Information Gain • S contains si tuples of class Ci for i = {1, ..., m} • Information measures info required to classify any arbitrary tuple • Entropy of attribute A with values {a1,a2,…,av} • Information gained by branching on attribute A

Entropy and Information Gain • Select attribute with the highest information gain (or greatest entropy reduction) • Such attribute minimizes information needed to classify samples

Rule Induction • IF conditions THEN Conclusion • Eg: CN2 • Concept description: • Characterization: provides a concise and succinct summarization of given collection of data • Comparison: provides descriptions comparing two or more collections of data • Training set, testing set • Imprecise • Predictive Accuracy • P/P+N

Example used in a Clinic • Hip arthoplasty trauma surgeon predict patient’s long-term clinical status after surgery • Outcome evaluated during follow-ups for 2 years • 2 modeling techniques • Naïve Bayesian classifier • Decision trees • Bayesian classifier • P(outcome=good) = 0.55 (11/20 good) • Probability gets updated as more attributes are considered • P(timing=good|outcome=good) = 9/11 (0.846) • P(outcome = bad) = 9/20 P(timing=good|outcome=bad) = 5/9

Nomogram

Bayesian Classification • Bayesian classifier vs. decision tree • Decision tree: predict the class label • Bayesian classifier: statistical classifier;predict class membership probabilities • Based on Bayes theorem; estimate posterior probability • Naïve Bayesian classifier: • Simple classifier that assumes attribute independence • High speed when applied to large databases • Comparable in performance to decision trees

Bayes Theorem • Let X be a data sample whose class label is unknown • Let Hi be the hypothesis that X belongs to a particular class Ci • P(Hi) is class prior probability that X belongs to a particular class Ci • Can be estimated by ni/n from training data samples • n is the total number of training data samples • ni is the number of training data samples of class Ci Formula of Bayes Theorem

Data mining and its application and usage in medicine