1 / 26

Data Mining and Bioinformatics

Data Mining and Bioinformatics. April 30, 2004. What is Data Mining?. Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute)

mayda
Télécharger la présentation

Data Mining and Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining and Bioinformatics April 30, 2004

  2. What is Data Mining? • Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute) • Example: detecting suspicious transactions with credit cards

  3. A Newer Definition • Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

  4. The “Beers and Diapers” Story • Analyze sales records • Beers & diapers frequently occur together in customer orders • Put beers next to diapers • Sales volume increases dramatically • Explanation?

  5. Why Do Data Mining • Do you know the differences between the following concepts? • Data • Information • Knowledge • Difference between data mining and data analysis • The latter is more specific

  6. What do We Aim to Mine? • Relationships and summaries • Models (global summary of a data set) • Linear equations, clusters, graphs, tree structures • Prediction, classification, interpretation • Patterns (local, restricted regions) • Recurrent patterns, rules • Unusualness - Anomaly detection • Analogy to data compression

  7. The Whole KDD Process • KDD: Knowledge Discovery in Databases • Selecting the target data • Preprocessing the data • Transforming them if necessary • Performing data mining to extract patterns and relationships • Interpreting and assessing the discovered structures

  8. Data Mining Techniques • Many of them originate from statistics, machine learning, or pattern recognition • General steps • Determine the nature and structure of the represenation to be used • Deciding how to quantify and compare how well different representations fit the data (score function) • Choose an algorithm process to optimize the score function • Deciding what principles of data management are required to implement the algorithm efficiently • Example: Regression analysis X = aY + b • Credit card spending vs Annual income

  9. Techniques • Regression/Fitting • Clustering • Neural networks • Bayesian networks • Hidden Markov models

  10. Example: Naïve Bayesian

  11. Naïve Bayesian - Continued • 9 yes samples (out of 14): • 2 sunny, 3 cool, 3 high, 2 true • Prob of yes: 9/14 * 2/9 * 3/9 * 3/9 * 2/9 = 0.0053 • 5 no samples (out of 14): • 3 sunny, 1 cool, 4 high, 3 true • Prob of yes: 5/14 * 3/5 * 1/5 * 4/5 * 3/5 = 0.0206 • Yes / No = 20.5% / 79.5%

  12. Clustering • Iterative clustering • K-means • Hierarchical clustering • Agglomerative method • Probabilistic model-based clustering • EM (Expectation Minimization)

  13. Data Mining Applications • Interdisciplinary • statistics, databases, machine learning, pattern recognition, AI, visualization, etc • Applications: • Marketing – sales model, Finance – loan decision • Insurance – risk analysis, Telecom – load predication • Web/text mining, Surveillance – security • Bioinformatics …

  14. In Bioinformatics • Analysis of Microarray Data • Mining free text • Structural genomics – protein crystallization • Predicting structure from sequence • Common theme: complex data, fast growing (outgrowing our processing power)

  15. Hybridization of Sample to Probe

  16. Data Collection and Preprocessing • Microarray Expression Data • Fluorescence level • Noisy

  17. Data Representations

  18. Microarray Experiement Result

  19. Machine Learning Tasks • Design of Microarrays • Probes (67 features) w/ fluorescence value  learn to choose the best probes for a new gene • Biological Applications of Microarrays • Classify new examples • Prediction the functional category of genes • Cluster genes based on similarity • Cluster experimental conditions • Learn a Bayesian network (that captures the joint prob distribution over the expression levels of genes)

  20. A Support Vector Machine

  21. Cluster Analysis

  22. Bayesian Network

  23. Machine Learning Tasks (cont’d) • Medical Applications of Microarrays • Cell disease classification • Predicting existing disease classes • Predicting the prognsis • Predicting the drug response of different patients

  24. Disease Diagnosis Models

  25. Factors That Affect Drug Response

  26. Wrap It Up • Data mining has great potential • Danger: don’t over predict • S&P index = function of the previous year’s butter production, cheese production, sheep population in Bangladesh and US? • Finally - don’t expect it to answer all questions

More Related