1 / 116

Gene Expression Data and Cluster Analysis

Gene Expression Data and Cluster Analysis. http://staff.washington.edu/kayee/research.html. A gene expression data set. ……. Snapshot of activities in the cell Each chip represents an experiment: time course tissue samples (normal/cancer). p experiments. n genes. X ij.

gene
Télécharger la présentation

Gene Expression Data and Cluster Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Expression Data and Cluster Analysis http://staff.washington.edu/kayee/research.html

  2. A gene expression data set …….. • Snapshot of activities in the cell • Each chip represents an experiment: • time course • tissue samples (normal/cancer) p experiments n genes Xij

  3. What is clustering? • Group similar objects together • Objects in the same cluster (group) are more similar to each other than objects in different clusters • Data exploratory tool: to find patterns in large data sets • Unsupervised approach: do not make use of prior knowledge of data

  4. Applications of clustering gene expression data • Cluster the genes  functionally related genes • Cluster the experiments  discover new subtypes of tissue samples • Cluster both genes and experiments  find sub-patterns

  5. Examples of clustering algorithms • Hierarchical clustering algorithms eg. [Eisen et al 1998] • K-means eg. [Tavazoie et al. 1999] • Self-organizing maps (SOM) eg. [Tamayo et al. 1999] • CAST [Ben-Dor, Yakhini 1999] • Model-based clustering algorithms eg. [Yeung et al. 2001]

  6. Overview • Similarity/distance measures • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Model-based clustering algorithms[Yeung et al. 2001]

  7. How to define similarity? Experiments X genes n 1 p 1 X • Similarity measures: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix

  8. Similarity measures(for those of you who enjoy equations…) • Euclidean distance • Correlation coefficient

  9. Example Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41

  10. Lessons from the example • Correlation – direction only • Euclidean distance – magnitude & direction • Array data is noisy  need many experiments to robustly estimate pairwise similarity

  11. Clustering algorithms • From pairwise similarities to groups • Inputs: • Raw data matrix or similarity matrix • Number of clusters or some other parameters

  12. Hierarchical Clustering [Hartigan 1975] • Agglomerative(bottom-up) • Algorithm: • Initialize: each item a cluster • Iterate: • select two most similarclusters • merge them • Halt: when required number of clusters is reached dendrogram

  13. Hierarchical: Single Link • cluster similarity = similarity of two most similar members -Potentially long and skinny clusters + Fast

  14. Example: single link 5 4 3 2 1

  15. Example: single link 5 4 3 2 1

  16. Example: single link 5 4 3 2 1

  17. Hierarchical: Complete Link • cluster similarity = similarity of two least similar members +tight clusters - slow

  18. Example: complete link 5 4 3 2 1

  19. Example: complete link 5 4 3 2 1

  20. Example: complete link 5 4 3 2 1

  21. Hierarchical: Average Link • cluster similarity = average similarity of all pairs +tight clusters - slow A 1 2 3

  22. Software: TreeView[Eisen et al. 1998] • Fig 1 in Eisen’s PNAS 99 paper • Time course of serum stimulation of primary human fibrolasts • cDNA arrays with approx 8600 spots • Similar to average-link • Free download at: http://rana.lbl.gov/EisenSoftware.htm

  23. Overview • Similarity/distance measures • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Model-based clustering algorithms[Yeung et al. 2001]

  24. Partitional: K-Means[MacQueen 1965] 2 1 3

  25. Details of k-means • Iterate until converge: • Assign each data point to the closest centroid • Compute new centroid Objective function: Minimize

  26. Properties of k-means • Fast • Proved to converge to local optimum • In practice, converge quickly • Tend to produce spherical, equal-sized clusters • Related to the model-based approach • Gavin Sherlock’s Xcluster: http://genome-www.stanford.edu/~sherlock/cluster.html

  27. What we have seen so far.. • Definition of clustering • Pairwise similarity: • Correlation • Euclidean distance • Clustering algorithms: • Hierarchical agglomerative • K-means • Different clustering algorithms  different clusters • Clustering algorithms always split out clusters

  28. Which clustering algorithm should I use? • Good question • No definite answer: on-going research

  29. Examples of clustering

  30. Traditional hierarchical clustering tree Networks with community structure Connecting the pair of nodes with strongest link and next strongest link …

  31. Uncovering underlying modularity No overlap  Perfect overlap Using topological overlap # of nodes to which both i and j are linked (+1 if direct link)

  32. Modules in the E. coli metabolism

  33. The structure of pyrimidine metabolism

  34. Any better methods? Using global informationnot local information (i.e. using dynamic & whole network information)

  35. Locally not importantbecause degree=2 But globally, very importantbecause connecting two groups!

  36. j i 1 1 k • Betweenness Centrality (BC) [Freeman, 1977] bij(k)  (fraction in the number of the shortest paths between i and j that pass through k.) “How much is the k-th node influential to the communication between i and j” • Example: the BC at k contributed by the communication from i to j is • Accumulate over all ordered pairs:

  37. Algorithm for finding community(destructive way using global information) • Calculate the betweeness for all edges in the network. • Remove the edge with the highest betweeness. • Recalculate betweeneess for edges affected by the removal. • Repeat from step 2 until no edges remain.

  38. Modular model of Yeast filamentation network

  39. Apply to the metabolic networks of 43 organisms

  40. Hierarchical clustering tree of T. pallidum

  41. One cluster from M. pneumoniae(sugar import & DNA replication)

  42. Community type of ordering Shell type of ordering When far from the root When close to the root

  43. Schematic picture of metabolic networks

  44. Taxonomy by using Metabolic networks J. Podani, Z.N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabasi, E. Szathmary [ Nature Genetics29 54 (2001). ]

  45. What is taxanomy??

  46. Three domains of Life

  47. What makes a eukaryote a eukaryote? The name means "true nut" or "true kernel" in Greek; the "nut" is in fact the nucleus of the eukaryotic cell, a membrane sac that contains the cell's DNA. Unlike bacteria and archaea, eukaryotes have their DNA in linear pieces that are bound up with special proteins (histones) to make chromosomes (normally visible only in dividing cells). Eukaryote

  48. Bacteria Bacteria lack the membrane-bound nuclei of eukaryotes; transmission electron micrograph of a typical bacterium, E. coli

  49. Archaea Basic Archaeal Structure : The three primary regions of an archaeal cell are the cytoplasm, cell membrane, and cell wall. Above, these three regions are labelled, with an enlargement at right of the cell membrane structure. Archaeal cell membranes are chemically different from all other living things, including a "backwards" glycerol molecule and isoprene derivatives in place of fatty acids.

  50. Prokaryotes Diameter ~2 microns. DNA small and circular. No membrane-bound organelles RNA and protein synthesized in same compartment. No cytoskeleton Cell cycle ~20 min. Eukaryotes Diameter ~20 microns. DNA long, linear, and in chromosomes. Nucleus, mitochondria, +chloroplasts RNA synthesized in nucleus, proteins made in the cytoplasm. Cytoskeleton Cell cycle ~24 hr. Differences between “typical” prokaryotes and eukaryotes

More Related