310 likes | 437 Vues
Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004. Outlines. Background & motivation Algorithms overview fuzzy k-mean clustering (1 st paper) Independent component analysis(2 nd paper). CHIP-ing away at medical questions . Why does cancer occur?.
E N D
Learn from chips: Microarray data analysis and clusteringCS 374Yu BaiNov. 16, 2004
Outlines • Background & motivation • Algorithms overview • fuzzy k-mean clustering (1st paper) • Independent component analysis(2nd paper)
CHIP-ing away at medical questions • Why does cancer occur? • Diagnosis • Treatment • Drug design • Molecular level understanding • Snapshot of gene expression “(DNA) Microarray”
Spot your genes Cancer cell Known gene sequences Cy5 dye Isolation RNA Cy3 dye Glass slide (chip) Normal cell
Exp 2 Exp 3 Matrix of expression E 2 E 3 E 1 Gene 1 Gene 2 Exp 1 Gene N
E2 E3 E1 E2 E3 E1 Gene 1 Gene N Gene 2 Gene 1 Gene 2 Gene N Why care about “clustering” ? • Discover functional relation • Similar expression functionally • related • Assign function to unknown gene • Find which gene controls which • other genes
A review: microarray data analysis • Supervised (Classification) • Un-supervised (Clustering) • “Heuristic” methods: • - Hierarchical clustering • - k mean clustering • - Self organizing map • - Others • Probability-based methods: • - Principle component analysis (PCA) • - Independent component analysis (ICA) • -Others
Heuristic methods: distance metrix 1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2] 2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y] x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y) 3. Other choices for distances…
E1 E2 E3 Hierarchical clustering • Easy • Depends on where to • start the grouping • Trouble to interpret • “tree” structure
K-mean clustering • Overall optimization • How many (k) • How to initiate • Local minima Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose “best” algorithm
Probability-based methods: Principle component analysis (PCA) • Pearson 1901; Everitt 1992; Basilevksy 1994 • Common use: reduce dimension & filter noise • Goal: find “uncorrelated ” component(s) that account for • as much of variance by initial variables as possible • (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y
X Eigenarray Exp1-n PCA algorithm Exp1-n Exp1-n • “Column-centered” matrix: A • Covariance matrix: ATA • Eigenvalue Decomposition • ATA = U Λ UT • U: Eigenvectors • (principle components) • Λ: Eigenvalues • Digest principle components • Gaussian assumption Exp1-n genes Λ U Eigenarray
Are biologists satisfied ? Super-Gaussian model … Gene5 Gene4 Gene3 Gene2 Gene1 Biological Regulators Expression level Ribosome Biogenesis … Gene5 Gene4 Gene3 Gene2 Gene1 Energy Pathway … Gene5 Gene4 Gene3 Gene2 Gene1 • Biological process is non-Gaussian • “Faithful” vs. “meaningful”
Equal to “source separation” Source 1 Source 2 Mixture 1 ?
Independent vs. uncorrelated Source x1 Source x2 y1 y1 y2 y2 The fact that sources are independent E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y stronger than uncorrelated Two mixtures: y1= 2*x1 + 3*x2 y2= 4*x1 + x2 principle components Independent components
Independent component analysis(ICA) Simplified notation Find “unmixing” matrix A which makes s1,…, sm as independent as possible
(Linear) ICA algorithm • “Likehood function” = Log (probability of observation) Y= WX p(x) = |detW| p(y) p(y)= Πpi (yi) L(y,W) = Log p(x) = Log |detW| + ΣLog pi (yi)
Super-Gaussian model (Linear) ICA algorithm Find W maximize L(y,W)
First paper: Gasch, et al. (2002) Genome biology, 3, 1-59 Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering
Biology is “fuzzy” • Many genes are conditionally co-regulated • k-mean clustering vs. fuzzy k-mean: Xi: expression of ith gene Vj: jth cluster center
2nd cycle Remove correlatedgenes(>0.7) Σi m2XiVj WXi Xi Vj’= Σi m2XiVj WXi FuzzyK flowchart 3rd cycle 1st cycle Initial Vj = PCA eigenvectors weight WXievaluates the correlation of Xi with others
Cell wall and secretion factors FuzzyK performance • k is more “definitive” • Recover clusters in classical methods • Uncover new gene clusters • Reveal new promoter sequence
Second paper: ICA is so new… Lee, et al. (2003) Genome biology, 4, R76 Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)
From linear to non-linear Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or Non-linear ICA: X= f(AS)
How to do non-linear ICA? • Construct feature spaceF • Mapping X to Ψ in F • ICA of Ψ Input space feature space IRL IRn Normally, L>N
k(v1,v1) … k(v1, vL) : : k(vL,v1) … k(vL,vL) ΦVTΦV = [ ] ; choose vectors {v1…vL} from {xi} k(v1,v1) …k(v1, vL) : : k(vL,v1) …k(vL,vL) k(v1,xi) : k(vk,xi) Mapped points in F: Ψ[xi] =[ ]1/2 [ ] Kernel trick Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦVTΦV)=L
ICA-based clustering • Independent component yi=(yi1,yi2,…yiK), i=1,…M • “Load” – the jth entry of yi is the load of jth gene • Two clusters per component Clusteri,1 = {gene j| yij= (C%xK)largest load in yi} Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi}
Evaluate biological significance Clusters from ICs Functional Classes GO 2 Cluster 1 GO 1 Cluster 2 Cluster 3 GO m GO i Cluster n Calculate the p value for each pair : probability that they share many genes by change Cluster i GO j
f i f i g-f n-i g-f n-i ( )( ) ( )( ) g n g n ( ) ( ) Evaluate biological significance Prob of sharing i genes = Microarray data Functional class g “P-value”: p = 1-Σi=1k-1 f n i k True positive = n k Sensitivity = f
Who is better ? Conclusion: ICA based clustering Is general better
References • Su-in lee,(2002) group talk:“Microarray data analysis using ICA” • Altman, et al. 2001, “whole genome expression analysis: • challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340 • Hyvarinen, et al. 1999, “Survey on Independent Component • analysis” Neutral Comp Surv 2,94-128 • Alter, et al. 2000, “singular value decomposition for genome • wide expression data processing and modeling” PNAS, 97, 10101 • Harmeling et al. “Kernel feature spaces & nonlinear blind • source separation”