Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Learn from chips: Microarray data analysis and clusteringCS 374Yu BaiNov. 16, 2004

Outlines • Background & motivation • Algorithms overview • fuzzy k-mean clustering (1st paper) • Independent component analysis(2nd paper)

CHIP-ing away at medical questions • Why does cancer occur? • Diagnosis • Treatment • Drug design • Molecular level understanding • Snapshot of gene expression “(DNA) Microarray”

Spot your genes Cancer cell Known gene sequences Cy5 dye Isolation RNA Cy3 dye Glass slide (chip) Normal cell

Exp 2 Exp 3 Matrix of expression E 2 E 3 E 1 Gene 1 Gene 2 Exp 1 Gene N

E2 E3 E1 E2 E3 E1 Gene 1 Gene N Gene 2 Gene 1 Gene 2 Gene N Why care about “clustering” ? • Discover functional relation • Similar expression functionally • related • Assign function to unknown gene • Find which gene controls which • other genes

A review: microarray data analysis • Supervised (Classification) • Un-supervised (Clustering) • “Heuristic” methods: • - Hierarchical clustering • - k mean clustering • - Self organizing map • - Others • Probability-based methods: • - Principle component analysis (PCA) • - Independent component analysis (ICA) • -Others

Heuristic methods: distance metrix 1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2] 2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y] x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y) 3. Other choices for distances…

E1 E2 E3 Hierarchical clustering • Easy • Depends on where to • start the grouping • Trouble to interpret • “tree” structure

K-mean clustering • Overall optimization • How many (k) • How to initiate • Local minima Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose “best” algorithm

Probability-based methods: Principle component analysis (PCA) • Pearson 1901; Everitt 1992; Basilevksy 1994 • Common use: reduce dimension & filter noise • Goal: find “uncorrelated ” component(s) that account for • as much of variance by initial variables as possible • (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y

X Eigenarray Exp1-n PCA algorithm Exp1-n Exp1-n • “Column-centered” matrix: A • Covariance matrix: ATA • Eigenvalue Decomposition • ATA = U Λ UT • U: Eigenvectors • (principle components) • Λ: Eigenvalues • Digest principle components • Gaussian assumption Exp1-n genes Λ U Eigenarray

Are biologists satisfied ? Super-Gaussian model … Gene5 Gene4 Gene3 Gene2 Gene1 Biological Regulators Expression level Ribosome Biogenesis … Gene5 Gene4 Gene3 Gene2 Gene1 Energy Pathway … Gene5 Gene4 Gene3 Gene2 Gene1 • Biological process is non-Gaussian • “Faithful” vs. “meaningful”

Equal to “source separation” Source 1 Source 2 Mixture 1 ?

Independent vs. uncorrelated Source x1 Source x2 y1 y1 y2 y2 The fact that sources are independent E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y stronger than uncorrelated Two mixtures: y1= 2*x1 + 3*x2 y2= 4*x1 + x2 principle components Independent components

Independent component analysis(ICA) Simplified notation Find “unmixing” matrix A which makes s1,…, sm as independent as possible

(Linear) ICA algorithm • “Likehood function” = Log (probability of observation) Y= WX p(x) = |detW| p(y) p(y)= Πpi (yi) L(y,W) = Log p(x) = Log |detW| + ΣLog pi (yi)

Super-Gaussian model (Linear) ICA algorithm Find W maximize L(y,W)

First paper: Gasch, et al. (2002) Genome biology, 3, 1-59 Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering

Biology is “fuzzy” • Many genes are conditionally co-regulated • k-mean clustering vs. fuzzy k-mean: Xi: expression of ith gene Vj: jth cluster center

2nd cycle Remove correlatedgenes(>0.7) Σi m2XiVj WXi Xi Vj’= Σi m2XiVj WXi FuzzyK flowchart 3rd cycle 1st cycle Initial Vj = PCA eigenvectors weight WXievaluates the correlation of Xi with others

Cell wall and secretion factors FuzzyK performance • k is more “definitive” • Recover clusters in classical methods • Uncover new gene clusters • Reveal new promoter sequence

Second paper: ICA is so new… Lee, et al. (2003) Genome biology, 4, R76 Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)

From linear to non-linear Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or Non-linear ICA: X= f(AS)

How to do non-linear ICA? • Construct feature spaceF • Mapping X to Ψ in F • ICA of Ψ Input space feature space IRL IRn Normally, L>N

k(v1,v1) … k(v1, vL) : : k(vL,v1) … k(vL,vL) ΦVTΦV = [ ] ; choose vectors {v1…vL} from {xi} k(v1,v1) …k(v1, vL) : : k(vL,v1) …k(vL,vL) k(v1,xi) : k(vk,xi) Mapped points in F: Ψ[xi] =[ ]1/2 [ ] Kernel trick Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦVTΦV)=L

ICA-based clustering • Independent component yi=(yi1,yi2,…yiK), i=1,…M • “Load” – the jth entry of yi is the load of jth gene • Two clusters per component Clusteri,1 = {gene j| yij= (C%xK)largest load in yi} Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi}

Evaluate biological significance Clusters from ICs Functional Classes GO 2 Cluster 1 GO 1 Cluster 2 Cluster 3 GO m GO i Cluster n Calculate the p value for each pair : probability that they share many genes by change Cluster i GO j

f i f i g-f n-i g-f n-i ( )( ) ( )( ) g n g n ( ) ( ) Evaluate biological significance Prob of sharing i genes = Microarray data Functional class g “P-value”: p = 1-Σi=1k-1 f n i k True positive = n k Sensitivity = f

Who is better ? Conclusion: ICA based clustering Is general better

References • Su-in lee,(2002) group talk:“Microarray data analysis using ICA” • Altman, et al. 2001, “whole genome expression analysis: • challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340 • Hyvarinen, et al. 1999, “Survey on Independent Component • analysis” Neutral Comp Surv 2,94-128 • Alter, et al. 2000, “singular value decomposition for genome • wide expression data processing and modeling” PNAS, 97, 10101 • Harmeling et al. “Kernel feature spaces & nonlinear blind • source separation”

Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Presentation Transcript

Snacks - Fried Potato Chips

Cluster Analysis: Basic Concepts and Algorithms

Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis

Cluster and Outlier Analysis

BINF636 Clustering and Classification

Clustering of non-numerical data

Chapter 7. Cluster Analysis

AFFYMETRIX SNP chips

Microarray Data Analysis

Quantitative Data Analysis

Microarray Data Analysis Using BRB-ArrayTools Version 4.2.0

Classification of Microarray Gene Expression Data

Microarray Pre-Processing

Clustering IV

Computational Movement Analysis Lecture 3: Curve simplification Joachim Gudmundsson

الجلسة الرابعة التحليل العنقودي Clustering Analysis تشرح لكل الفئات

Clustering Methods

Analysis of Complex Survey Data and Survival Analysis

Introduction to Classification Issues in Microarray Data Analysis

生物晶片數據分析對近代統計方法之影響

Classification and Prediction