1 / 117

Introduction to data mining

Peter van der Putten Leiden Institute of Advanced Computer Science Leiden University putten@liacs.nl Transcriptomics and Proteomics in Zebrafish workshop, Leiden University March 9, 2006. Introduction to data mining. Presentation Outline. Objective Present the basics of data mining

marionsmith
Télécharger la présentation

Introduction to data mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peter van der Putten Leiden Institute of Advanced Computer Science Leiden University putten@liacs.nl Transcriptomics and Proteomics in Zebrafish workshop, Leiden University March 9, 2006 Introduction to data mining

  2. Presentation Outline • Objective • Present the basics of data mining • Gain understanding of the potential for applying it in the bioinformatics domain

  3. Agenda Today • Data mining definitions • Before Starting to Mine…. • Descriptive Data Mining • Dimension Reduction & Projection • Clustering • Association rules • Predictive data mining concepts • Classification and regression • Bioinformatics applications • Predictive data mining techniques • Logistic Regression • Nearest Neighbor • Decision Trees • Naive Bayes • Neural Networks • Evaluating predictive models • Demonstration (optional)

  4. The Promise…. . . . .

  5. The Promise…. . . . .

  6. The Promise…. . . . .

  7. The Solution…. • NCBI Tools for data mining: • Nucleotide sequence analysis • Proteine sequence analysis • Structures • Genome analysis • Gene expression • Data mining or not?.

  8. What is data mining?

  9. Sources of (artificial) intelligence • Reasoning versus learning • Learning from data • Patient data • Genomics, protemics • Customer records • Stock prices • Piano music • Criminal mug shots • Websites • Robot perceptions • Etc.

  10. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

  11. Some working definitions…. • Bioinformatics = • Bioinformatics is the research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data [http://www.bisti.nih.gov/]. • Or more pragmatic: Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems [Wikipedia Nov 2005]

  12. Bio informatics and data mining • From sequence to structure to function • Genomics (DNA), Transcriptomics (RNA), Proteomics (proteins), Metabolomics (metabolites) • Pattern matching and search • Sequence matching and alignment • Structure prediction • Predicting structure from sequence • Protein secondary structure prediction • Function prediction • Predicting function from structure • Protein localization • Expression analysis • Genes: micro array data analysis etc. • Proteins • Regulation analysis

  13. Bio informatics and data mining • Classical medical and clinical studies • Medical decision support tools • Text mining on medical research literature (MEDLINE) • Spectrometry, Imaging • Systems biology and modeling biological systems • Population biology & simulation • Spin Off: Biological inspired computational learning • Evolutionary algorithms, neural networks, artificial immune systems

  14. Genomic Microarrays – Case Study • Problem: • Leukemia (different types of Leukemia cells look very similar) • Given data for a number of samples (patients), can we • Accurately diagnose the disease? • Predict outcome for given treatment? • Recommend best treatment? • Solution • Data mining on micro-array data

  15. Microarray data • 50 most important genes • Rows: genes • Columns: samples / patients

  16. Example: ALL/AML data • 38 training patients, 34 test patients, ~ 7,000 patient attributes (micro array gene data) • 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) • Use train data to build diagnostic model ALL AML • Results on test data: • 33/34 correct, 1 error may be mislabeled

  17. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

  18. The Knowledge Discovery Process

  19. Some working definitions…. • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Example: the relation between patient characteristics and the probability to be diabetic • Instances: the individual, independent examples of a concept • Example: a patient, candidate drug etc. • Attributes: measuring aspects of an instance • Example: age, weight, lab tests, microarray data etc • Pattern or attribute space

  20. Data mining tasks • Descriptive data mining • Matching & search: finding instances similar to x • Clustering: discovering groups of similar instances • Association rule extraction: if a & b then c • Summarization: summarizing group descriptions • Link detection: finding relationships • … • Predictive data mining • Classification: classify an instance into a category • Regression: estimate some continuous value

  21. Before starting to mine…. • Pima Indians Diabetes Data • X = body mass index • Y = age

  22. Before starting to mine….

  23. Before starting to mine….

  24. Before starting to mine…. • Attribute Selection • This example: InfoGain by Attribute • Keep the most important ones

  25. Before starting to mine…. • Types of Attribute Selection • Uni-variate versus multivariate (sub set selection) • The fact that attribute x is a strong uni-variate predictor does not necessarily mean it will add predictive power to a set of predictors already used by a model • Filter versus wrapper • Wrapper methods involve the subsequent learner (classifier or other)

  26. Dimension Reduction • Projecting high dimensional data into a lower dimension • Principal Component Analysis • Independent Component Analysis • Fisher Mapping, Sammon’s Mapping etc. • Multi Dimensional Scaling • ….

  27. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. weight f.e. age

  28. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user In >3 dimensions this is not possible f.e. weight f.e. age

  29. Clustering Techniques • Hierarchical algorithms • Agglomerative • Divisive • Partition based clustering • K-Means • Self Organizing Maps / Kohonen Networks • Probabilistic Model based • Expectation Maximization / Mixture Models

  30. Hierarchical clustering • Agglomerative / Bottom up • Start with single-instance clusters • At each step, join the two closest clusters • Method to compute distance between cluster x and y: single linkage (distance between closest point in cluster x and y), average linkage (average distance between all points), complete linkage (distance between furthest points), centroid • Distance measure: Euclidean, Correlation etc. • Divisive / Top Down • Start with all data in one cluster • Split into two clusters based on distance measure / split utility • Proceed recursively on each subset • Both methods produce a dendrogram

  31. Levels of Clustering Agglomerative Divisive Dunham, 2003

  32. Hierarchical Clustering Example • Clustering Microarray Gene Expression Data • Gene expression measured using microarrays studied under variety of conditions • On budding yeast Saccharomyces cerevisiae • Groups together efficiently genes of known similar function, • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

  33. Hierarchical Clustering Example • Method • Genes are the instances, samples the attributes! • Agglomerative • Distance measure = correlation • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

  34. Simple Clustering: K-means • Pick a number (k) of cluster centers (at random) • Cluster centers are sometimes called codes, and the k codes a codebook • Assign every item to its nearest cluster center • F.i. Euclidean distance • Move each cluster center to the mean of its assigned items • Repeat until convergence • change in cluster assignments less than a threshold KDnuggets

  35. K-means example, step 1 Y k1 k2 k3 X Initially distribute codes randomly in pattern space KDnuggets

  36. K-means example, step 2 Y k1 k2 k3 X Assign each point to the closest code KDnuggets

  37. K-means example, step3 Y k1 k2 k2 k1 k3 k3 X Move each code to the mean of all its assigned points KDnuggets

  38. K-means example, step 2 Y k1 k2 k3 X Repeat the process – reassign the data points to the codes Q: Which points are reassigned? KDnuggets

  39. K-means example Y k1 k3 k2 X Repeat the process – reassign the data points to the codes Q: Which points are reassigned? KDnuggets

  40. K-means example Y k1 k3 k2 X re-compute cluster means KDnuggets

  41. K-means example Y k2 k1 k3 X move cluster centers to cluster means KDnuggets

  42. Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Sensitive to outliers K-means clustering summary Extensions • Adaptive k-means • K-mediods (based on median instead of mean) • 1,2,3,4,100  average 22, median 3

  43. Biological Example • Clustering of yeast cell images • Two clusters are found • Left cluster primarily cells with thick capsule, right cluster thin capsule • caused by media, proxy for sick vs healthy

  44. Self Organizing Maps(Kohonen Maps) • Claim to fame • Simplified models of cortical maps in the brain • Things that are near in the outside world link to areas near in the cortex • For a variety of modalities: touch, motor, …. up to echolocation • Nice visualization • From a data mining perspective: • SOMs are simple extensions of k-means clustering • Codes are connected in a lattice • In each iteration codes neighboring winning code in the lattice are also allowed to move

  45. SOM 10x10 SOM Gaussian Distribution

  46. SOM

  47. SOM

  48. SOM

  49. SOM example

  50. Famous example:Phonetic Typewriter • SOM lattice below left is trained on spoken letters, after convergence codes are labeled • Creates a ‘phonotopic’ map • Spoken word creates a sequence of labels

More Related