1 / 24

Microarray Data Analysis

Microarray Data Analysis. Data preprocessing and visualization Supervised learning Machine learning approaches Unsupervised learning Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes

Télécharger la présentation

Microarray Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarray Data Analysis • Data preprocessing and visualization • Supervised learning • Machine learning approaches • Unsupervised learning • Clustering and pattern detection • Gene regulatory regions predictions based co-regulated genes • Linkage between gene expression data and gene sequence/function databases • …

  2. Unsupervised learning • Supervised methods • Can only validate or reject hypotheses • Can not lead to discovery of unexpected partitions • Unsupervised learning • No prior knowledge is used • Explore structure of data on the basis of corrections and similarities

  3. DEFINITION OF THE CLUSTERING PROBLEM Eytan Domany

  4. CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION) Eytan Domany

  5. BUT WHAT ABOUT THE OKAPI? Eytan Domany

  6. Centroid methods – K-means Data points at Xi , i= 1,...,N Centroids at Y , = 1,...,K Assign data point i to centroid  ; Si =  Cost E: E(S1 , S2 ,...,SN ; Y1 ,...YK ) = MinimizeE over Si , Y Eytan Domany

  7. K-means • “Guess” K=3 Eytan Domany

  8. K-means • Start with random positions of centroids. Iteration = 0 Eytan Domany

  9. K-means • Start with random positions of centroids. • Assign each data point to closest centroid. Iteration = 1 Eytan Domany

  10. K-means • Start with random positions of centroids. • Assign each data point to closest centroid. • Move centroids to center of assigned points Iteration = 2 Eytan Domany

  11. K-means • Start with random positions of centroids. • Assign each data point to closest centroid. • Move centroids to center of assigned points • Iterate till minimal cost Iteration = 3 Eytan Domany

  12. K-means - Summary • Fast algorithm: compute distances from data points to centroids • Result depends on initial centroids’ position • Must preset K • Fails for “non-spherical” distributions

  13. 2 4 5 3 1 1 3 2 4 5 Agglomerative Hierarchical Clustering Need to define the distance between thenew cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers at each step merge pair of nearestclusters initially – each point = cluster Distance between joined clusters The dendrogram induces a linear ordering of the data points Dendrogram Eytan Domany

  14. Hierarchical Clustering -Summary • Results depend on distance update method • Greedy iterative process • NOT robust against noise • No inherent measure to identify stable clusters • Average Linkage – the most widely used clustering method in gene expression analysis

  15. nature 2002 breast cancer Heat map

  16. Cluster both genes and samples • Sample should cluster together based on experimental design • Often a way to catch labelling errors or heterogeneity in samples

  17. Epinephrine Treated Rat Fibroblast Cell

  18. Correlation coeff Heap map Normalized across each gene

  19. Pearson distance Distance Issues • Euclidean distance g1 g3 g2 g4

  20. Exercise • Use Average Linkage Algorithm and Manhattan distance.

  21. Exercise

  22. Issues in Cluster Analysis • A lot of clustering algorithms • A lot of distance/similarity metrics • Which clustering algorithm runs faster and uses less memory? • How many clusters after all? • Are the clusters stable? • Are the clusters meaningful?

  23. Which Clustering Method Should I Use? • What is the biological question? • Do I have a preconceived notion of how many clusters there should be? • How strict do I want to be? Spilt or Join? • Can a gene be in multiple clusters? • Hard or soft boundaries between clusters

  24. The End • Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it. • We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have. • We wish you all have a wonderful summer break!

More Related