Data Mining: Discovering Information From Bio-Data

Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell

Introduction • Data Mining Background • Process • Functionalities • Techniques • Two examples • Short Peptide • Clinical Records • Conclusion

Data Mining Background - Process

Functionalities • Classification • Cluster Analysis • Outlier Analysis • Trend Analysis • Association Analysis

Techniques • Decision Tree • Bayesian Classification • Hidden Markov Models • Support Vector Machines • Artificial Neural Networks

Technique 1 – Decision Tree

Technique 2 – Bayesian Classification • Based on Bayes Theorem • Simple but comparable to Decision Trees and Neural Networks Classifier in many applications.

Technique 3 – Hidden Markov Model

Technique 4 – Support Vector Machine • SVM find the maximum margin hyperplane that separate the classis • Hyperplane can be represented as a linear combination of training points • The algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space • Locate a separating hyperplane in the feature space and classify points in that space simply by defining a kernel function

Example 1 – Short Peptides • Problem • Identify T-cell epitopes from Melanoma antigens • Training Set: • 602 HLA-DR4 binding peptides • 713 non-binding • Solution – Neural Networks

Neural Networks – Single Computing Element

Neural Networks Classifier • Sparse Coding • Alanine 10000000000000000000 • 9 x 20 = 180 bits per Inputs

Where  is a fixed leaning rate • Where is the output of the computing element of the first layer • And  is the difference between the output y and correct output t. Neural Networks – Error Back-Propagation • Squared error: • Adjustment

Result & Remarks • Success Rate: 60% • A systematic experimental study is very expensive • Highly accurate predicting method can reduce the cost • Other alternatives exist

Datamining: Discovering Information A Clinical Records

Problem • Problem : already known data (clinical records)  predict unknown data How to analysis known data ? --- training data How to test unknown data? --- Predict data

Problem • The data has many attributes. Ex: Having 2300 combinations of attributes with 8 attributes for one class. It is impossible to calculate all manually

Problem • One Example: Eight attributes for diabetic patients: (1)Number of times pregnant (2)Plasma glucose (3)Diastolic blood pressure (4)Triceps skin fold thickness (5)Two-hour serum insulin (6)Body mass index (7)Diabetes pedigree (8)Age

CAEP-Classification by aggregating emerging patterns A classification (known data) and prediction (unknown data) algorithms.

CAEP-Classification by aggregating emerging patterns • Definition: (1)Training dataDiscovery all the emerging patterns. (2)Training data Sum and normalize the differentiating weight of these emerging patterns (3)Training data  Chooses the class with the largest normalized score as the winner. (4)Test data Computing the score of test data and making a Prediction

CAEP : Emerging Pattern • Emerging Pattern Definition: An emerging pattern is a pattern with some attributes whose frequency increases significantly from one class to another. EX:

CAEP : Classification Classification: Definition: (1) Discover the factors that differentiate the two groups (2) Find a way to use these factors to predict to Which group a new patient should belong.

CAEP : Method Method: • Discretize of the dataset into a binary one. item (attribute , interval) Ex:( age, >45) instance : a set of items such that an item (A,v) is in t if only if the value of the attribute A of t is within the interval

Clinical Record: 768 women 21% diabetic instances : 161 71% non-diabetics instances: 546

CAEP: Support Support of X (attribute) Definition: the ratio of number of items has this attribute over the number of total items in this class. Formula: suppD(x)= Meaning: If supp(x) is high which means attribute x exist in many items in this class. Example : How many people in diabetic class are older than 60? (attribute : >60) 148/161 =91%

CAEP: Growth The growth rate of X (attribute) Definition: The support comparison of same attributes from two classes. Formula: growD(x)= suppD(x) / suppD’(x) Meaning: If grow(x) is high which means more possibility of attribute X exist in class D than in class D’ Example: the patient older >60 in diabetic class is 91% the people older >60 in non-diabetic class is 10% growth(>60)= 91% / 10% = 9

CAEP: Likelihood LikelihoodD(x) Definition: the ratio of total number of items with attribute x in one class to the total number of items with attribute x in both two classes. Formula1: LikelihoodD(x)= suppD (x) * |D|_______________ suppD (x) *|D| + suppD’ (x) *|D’| Formula2: If D and D’ are roughly equal in size: LikelihoodD(x)= suppD (x) ____________ suppD (x) + suppD’ (x) Example:91% * 223___________ = 203 = 78.99% 91% *223 + 10% * 545 257 Example: 91% _______ = 91% = 90.10% 91% + 10% 101%

CAEP: Evaluation Sensitivity: the ratio of the number of correctly predicted diabetic instances to the number of diabetic instances. Example: 60correctly predicted /100diabetic=60% Specificity: the ratio of the number of correctly predicted diabetic instance to the number of predicted. Example: 60correctly predicted /120predicted=50% Accuracy: the percentage of instances correctly classified. Example: 60correctly predicted /180 =33%

CAEP: Evaluation • Using one attribute for class prediction High accuracy: Low sensitivity: only identify 30%

CAEP: Prediction • Consider all attributes: The accumulation of scores of all features it has for class D Formular: Score(t,D) =X likelihoodD(X)*suppD(x) • Prediction: Score(t,D)>score(t,D’)  t belongs to D class.

CAEP: Normalize • If the numbers of emerging patterns are different significantly. One class D has more emerging patterns than another class D’ • The score of one instance of D has higher score than the instance of D’ Score(t,D) = likelihoodD(X)*suppD(x) • Normalize the score norm_score(t,D)=score(t,D) / base_score(D) • Prediction: If norm_score(t,D)> norm_score(t,D’)  t belongs to D class.

CAEP: Comparison C4.5 and CBA

CAEP: Modify • Problem: CAEP produces a very large number of emerging patterns. Example: with 8 attribute, 2300 emerging patterns.

CAEP: Modify • Reduce emerging patterns numbers Method: Prefer strong emerging patterns over their weaker relatives Example: X1 with infinite growth,very small support X2 with less growth, much larger support, say 30 times than X2 In such case X2 is preferred because it covers many more cases than X1. • There is no lose in prediction performance using reduction of emerging patterns

CAEP: Variations • JEP: using exclusively emerging patterns whose supports increase from zero to nonzero, are called jump. Perform well when there are many jump emerging patterns • DeEP: It has more training phases is customized for that instance Slightly better , incorporate new training data easily.

Relevance analysis • Datamining algorithms are in general exponential in complexity • Relevance analysis : exclude the attributes that do not contribute to the classification process • Deal with much higher dimension datasets • Not always useful for lower ranking dimensions.

Conclusion • Classification and prediction aspect of datamining • Method includes decision trees, mathematical formula, artificial neural networks, or emerging patterns. • They are applicable in a large variety of classification applications • CAEP has good predictive accuracy on all data sets. .

Data Mining: Discovering Information From Bio-Data

Data Mining: Discovering Information From Bio-Data

Presentation Transcript

CS 345A Data Mining Lecture 1

Data Mining

Privacy Issues in Scientific Workflow Provenance

Data Mining

Data Mining

Data Mining

Data Mining: An Introduction

DATA MINING

Data Mining

CS 345 Data Mining Lecture 1

Applications and Trends in Data Mining

Data Mining

CHAPTER 17: DATA MINING BASICS

CHAPTER 17: DATA MINING BASICS

Data Mining with Big data

Data Mining with DB

DATA MINING

Spatial and Temporal Data Mining

Data Mining: Extracting Knowledge from Past Data

Data Mining