Gene Expression Data and Cluster Analysis

Gene Expression Data and Cluster Analysis http://staff.washington.edu/kayee/research.html

A gene expression data set …….. • Snapshot of activities in the cell • Each chip represents an experiment: • time course • tissue samples (normal/cancer) p experiments n genes Xij

What is clustering? • Group similar objects together • Objects in the same cluster (group) are more similar to each other than objects in different clusters • Data exploratory tool: to find patterns in large data sets • Unsupervised approach: do not make use of prior knowledge of data

Applications of clustering gene expression data • Cluster the genes  functionally related genes • Cluster the experiments  discover new subtypes of tissue samples • Cluster both genes and experiments  find sub-patterns

Examples of clustering algorithms • Hierarchical clustering algorithms eg. [Eisen et al 1998] • K-means eg. [Tavazoie et al. 1999] • Self-organizing maps (SOM) eg. [Tamayo et al. 1999] • CAST [Ben-Dor, Yakhini 1999] • Model-based clustering algorithms eg. [Yeung et al. 2001]

Overview • Similarity/distance measures • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Model-based clustering algorithms[Yeung et al. 2001]

How to define similarity? Experiments X genes n 1 p 1 X • Similarity measures: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix

Similarity measures(for those of you who enjoy equations…) • Euclidean distance • Correlation coefficient

Example Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41

Lessons from the example • Correlation – direction only • Euclidean distance – magnitude & direction • Array data is noisy  need many experiments to robustly estimate pairwise similarity

Clustering algorithms • From pairwise similarities to groups • Inputs: • Raw data matrix or similarity matrix • Number of clusters or some other parameters

Hierarchical Clustering [Hartigan 1975] • Agglomerative(bottom-up) • Algorithm: • Initialize: each item a cluster • Iterate: • select two most similarclusters • merge them • Halt: when required number of clusters is reached dendrogram

Hierarchical: Single Link • cluster similarity = similarity of two most similar members -Potentially long and skinny clusters + Fast

Example: single link 5 4 3 2 1

Hierarchical: Complete Link • cluster similarity = similarity of two least similar members +tight clusters - slow

Example: complete link 5 4 3 2 1

Hierarchical: Average Link • cluster similarity = average similarity of all pairs +tight clusters - slow A 1 2 3

Software: TreeView[Eisen et al. 1998] • Fig 1 in Eisen’s PNAS 99 paper • Time course of serum stimulation of primary human fibrolasts • cDNA arrays with approx 8600 spots • Similar to average-link • Free download at: http://rana.lbl.gov/EisenSoftware.htm

Overview • Similarity/distance measures • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Model-based clustering algorithms[Yeung et al. 2001]

Partitional: K-Means[MacQueen 1965] 2 1 3

Details of k-means • Iterate until converge: • Assign each data point to the closest centroid • Compute new centroid Objective function: Minimize

Properties of k-means • Fast • Proved to converge to local optimum • In practice, converge quickly • Tend to produce spherical, equal-sized clusters • Related to the model-based approach • Gavin Sherlock’s Xcluster: http://genome-www.stanford.edu/~sherlock/cluster.html

What we have seen so far.. • Definition of clustering • Pairwise similarity: • Correlation • Euclidean distance • Clustering algorithms: • Hierarchical agglomerative • K-means • Different clustering algorithms  different clusters • Clustering algorithms always split out clusters

Which clustering algorithm should I use? • Good question • No definite answer: on-going research

Examples of clustering

Traditional hierarchical clustering tree Networks with community structure Connecting the pair of nodes with strongest link and next strongest link …

Uncovering underlying modularity No overlap  Perfect overlap Using topological overlap # of nodes to which both i and j are linked (+1 if direct link)

Modules in the E. coli metabolism

The structure of pyrimidine metabolism

Any better methods? Using global informationnot local information (i.e. using dynamic & whole network information)

Locally not importantbecause degree=2 But globally, very importantbecause connecting two groups!

j i 1 1 k • Betweenness Centrality (BC) [Freeman, 1977] bij(k)  (fraction in the number of the shortest paths between i and j that pass through k.) “How much is the k-th node influential to the communication between i and j” • Example: the BC at k contributed by the communication from i to j is • Accumulate over all ordered pairs:

Algorithm for finding community(destructive way using global information) • Calculate the betweeness for all edges in the network. • Remove the edge with the highest betweeness. • Recalculate betweeneess for edges affected by the removal. • Repeat from step 2 until no edges remain.

Modular model of Yeast filamentation network

Apply to the metabolic networks of 43 organisms

Hierarchical clustering tree of T. pallidum

One cluster from M. pneumoniae(sugar import & DNA replication)

Community type of ordering Shell type of ordering When far from the root When close to the root

Schematic picture of metabolic networks

Taxonomy by using Metabolic networks J. Podani, Z.N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabasi, E. Szathmary [ Nature Genetics29 54 (2001). ]

What is taxanomy??

Three domains of Life

What makes a eukaryote a eukaryote? The name means "true nut" or "true kernel" in Greek; the "nut" is in fact the nucleus of the eukaryotic cell, a membrane sac that contains the cell's DNA. Unlike bacteria and archaea, eukaryotes have their DNA in linear pieces that are bound up with special proteins (histones) to make chromosomes (normally visible only in dividing cells). Eukaryote

Bacteria Bacteria lack the membrane-bound nuclei of eukaryotes; transmission electron micrograph of a typical bacterium, E. coli

Archaea Basic Archaeal Structure : The three primary regions of an archaeal cell are the cytoplasm, cell membrane, and cell wall. Above, these three regions are labelled, with an enlargement at right of the cell membrane structure. Archaeal cell membranes are chemically different from all other living things, including a "backwards" glycerol molecule and isoprene derivatives in place of fatty acids.

Prokaryotes Diameter ~2 microns. DNA small and circular. No membrane-bound organelles RNA and protein synthesized in same compartment. No cytoskeleton Cell cycle ~20 min. Eukaryotes Diameter ~20 microns. DNA long, linear, and in chromosomes. Nucleus, mitochondria, +chloroplasts RNA synthesized in nucleus, proteins made in the cytoplasm. Cytoskeleton Cell cycle ~24 hr. Differences between “typical” prokaryotes and eukaryotes

Gene Expression Data and Cluster Analysis

Gene Expression Data and Cluster Analysis

Presentation Transcript

Project 3: Cluster Analysis of Time Series Gene Expression Data

Basic Gene Expression Data Analysis--Clustering

Microarray Gene Expression Data Analysis

Analysis of Gene Expression Data

Gene Expression Analysis

Functional genomics and gene expression data analysis

Gene Expression Data Analysis Lab Session

Microarray Data Analysis Differential Gene Expression

Gene expression: Microarray data analysis

Gene expression analysis

Gene Expression Analysis

Gene Expression Analysis

Gene Expression Analysis and Modeling

4. Gene Expression Data Analysis

Gene Expression Data

More Analysis of Gene Expression Data

Cluster Analysis for Gene Expression Data

Proteome and Gene Expression Analysis

Bioinformatics : Gene Expression Data Analysis

Proteome and Gene Expression Analysis

Gene Expression Analysis