1 / 37

Data Mining: Discovering Information From Bio-Data

Data Mining: Discovering Information From Bio-Data. Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell. Introduction. Data Mining Background Process Functionalities Techniques Two examples Short Peptide Clinical Records Conclusion.

monifa
Télécharger la présentation

Data Mining: Discovering Information From Bio-Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell

  2. Introduction • Data Mining Background • Process • Functionalities • Techniques • Two examples • Short Peptide • Clinical Records • Conclusion

  3. Data Mining Background - Process

  4. Functionalities • Classification • Cluster Analysis • Outlier Analysis • Trend Analysis • Association Analysis

  5. Techniques • Decision Tree • Bayesian Classification • Hidden Markov Models • Support Vector Machines • Artificial Neural Networks

  6. Technique 1 – Decision Tree

  7. Technique 2 – Bayesian Classification • Based on Bayes Theorem • Simple but comparable to Decision Trees and Neural Networks Classifier in many applications.

  8. Technique 3 – Hidden Markov Model

  9. Technique 4 – Support Vector Machine • SVM find the maximum margin hyperplane that separate the classis • Hyperplane can be represented as a linear combination of training points • The algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space • Locate a separating hyperplane in the feature space and classify points in that space simply by defining a kernel function

  10. Example 1 – Short Peptides • Problem • Identify T-cell epitopes from Melanoma antigens • Training Set: • 602 HLA-DR4 binding peptides • 713 non-binding • Solution – Neural Networks

  11. Neural Networks – Single Computing Element

  12. Neural Networks Classifier • Sparse Coding • Alanine 10000000000000000000 • 9 x 20 = 180 bits per Inputs

  13. Where  is a fixed leaning rate • Where is the output of the computing element of the first layer • And  is the difference between the output y and correct output t. Neural Networks – Error Back-Propagation • Squared error: • Adjustment

  14. Result & Remarks • Success Rate: 60% • A systematic experimental study is very expensive • Highly accurate predicting method can reduce the cost • Other alternatives exist

  15. Datamining: Discovering Information A Clinical Records

  16. Problem • Problem : already known data (clinical records)  predict unknown data How to analysis known data ? --- training data How to test unknown data? --- Predict data

  17. Problem • The data has many attributes. Ex: Having 2300 combinations of attributes with 8 attributes for one class. It is impossible to calculate all manually

  18. Problem • One Example: Eight attributes for diabetic patients: (1)Number of times pregnant (2)Plasma glucose (3)Diastolic blood pressure (4)Triceps skin fold thickness (5)Two-hour serum insulin (6)Body mass index (7)Diabetes pedigree (8)Age

  19. CAEP-Classification by aggregating emerging patterns A classification (known data) and prediction (unknown data) algorithms.

  20. CAEP-Classification by aggregating emerging patterns • Definition: (1)Training dataDiscovery all the emerging patterns. (2)Training data Sum and normalize the differentiating weight of these emerging patterns (3)Training data  Chooses the class with the largest normalized score as the winner. (4)Test data Computing the score of test data and making a Prediction

  21. CAEP : Emerging Pattern • Emerging Pattern Definition: An emerging pattern is a pattern with some attributes whose frequency increases significantly from one class to another. EX:

  22. CAEP : Classification Classification: Definition: (1) Discover the factors that differentiate the two groups (2) Find a way to use these factors to predict to Which group a new patient should belong.

  23. CAEP : Method Method: • Discretize of the dataset into a binary one. item (attribute , interval) Ex:( age, >45) instance : a set of items such that an item (A,v) is in t if only if the value of the attribute A of t is within the interval

  24. Clinical Record: 768 women 21% diabetic instances : 161 71% non-diabetics instances: 546

  25. CAEP: Support Support of X (attribute) Definition: the ratio of number of items has this attribute over the number of total items in this class. Formula: suppD(x)= Meaning: If supp(x) is high which means attribute x exist in many items in this class. Example : How many people in diabetic class are older than 60? (attribute : >60) 148/161 =91%

  26. CAEP: Growth The growth rate of X (attribute) Definition: The support comparison of same attributes from two classes. Formula: growD(x)= suppD(x) / suppD’(x) Meaning: If grow(x) is high which means more possibility of attribute X exist in class D than in class D’ Example: the patient older >60 in diabetic class is 91% the people older >60 in non-diabetic class is 10% growth(>60)= 91% / 10% = 9

  27. CAEP: Likelihood LikelihoodD(x) Definition: the ratio of total number of items with attribute x in one class to the total number of items with attribute x in both two classes. Formula1: LikelihoodD(x)= suppD (x) * |D|_______________ suppD (x) *|D| + suppD’ (x) *|D’| Formula2: If D and D’ are roughly equal in size: LikelihoodD(x)= suppD (x) ____________ suppD (x) + suppD’ (x) Example:91% * 223___________ = 203 = 78.99% 91% *223 + 10% * 545 257 Example: 91% _______ = 91% = 90.10% 91% + 10% 101%

  28. CAEP: Evaluation Sensitivity: the ratio of the number of correctly predicted diabetic instances to the number of diabetic instances. Example: 60correctly predicted /100diabetic=60% Specificity: the ratio of the number of correctly predicted diabetic instance to the number of predicted. Example: 60correctly predicted /120predicted=50% Accuracy: the percentage of instances correctly classified. Example: 60correctly predicted /180 =33%

  29. CAEP: Evaluation • Using one attribute for class prediction High accuracy: Low sensitivity: only identify 30%

  30. CAEP: Prediction • Consider all attributes: The accumulation of scores of all features it has for class D Formular: Score(t,D) =X likelihoodD(X)*suppD(x) • Prediction: Score(t,D)>score(t,D’)  t belongs to D class.

  31. CAEP: Normalize • If the numbers of emerging patterns are different significantly. One class D has more emerging patterns than another class D’ • The score of one instance of D has higher score than the instance of D’ Score(t,D) = likelihoodD(X)*suppD(x) • Normalize the score norm_score(t,D)=score(t,D) / base_score(D) • Prediction: If norm_score(t,D)> norm_score(t,D’)  t belongs to D class.

  32. CAEP: Comparison C4.5 and CBA

  33. CAEP: Modify • Problem: CAEP produces a very large number of emerging patterns. Example: with 8 attribute, 2300 emerging patterns.

  34. CAEP: Modify • Reduce emerging patterns numbers Method: Prefer strong emerging patterns over their weaker relatives Example: X1 with infinite growth,very small support X2 with less growth, much larger support, say 30 times than X2 In such case X2 is preferred because it covers many more cases than X1. • There is no lose in prediction performance using reduction of emerging patterns

  35. CAEP: Variations • JEP: using exclusively emerging patterns whose supports increase from zero to nonzero, are called jump. Perform well when there are many jump emerging patterns • DeEP: It has more training phases is customized for that instance Slightly better , incorporate new training data easily.

  36. Relevance analysis • Datamining algorithms are in general exponential in complexity • Relevance analysis : exclude the attributes that do not contribute to the classification process • Deal with much higher dimension datasets • Not always useful for lower ranking dimensions.

  37. Conclusion • Classification and prediction aspect of datamining • Method includes decision trees, mathematical formula, artificial neural networks, or emerging patterns. • They are applicable in a large variety of classification applications • CAEP has good predictive accuracy on all data sets. .

More Related