1 / 31

Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004. Outlines. Background & motivation Algorithms overview fuzzy k-mean clustering (1 st paper) Independent component analysis(2 nd paper). CHIP-ing away at medical questions . Why does cancer occur?.

deiter
Télécharger la présentation

Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learn from chips: Microarray data analysis and clusteringCS 374Yu BaiNov. 16, 2004

  2. Outlines • Background & motivation • Algorithms overview • fuzzy k-mean clustering (1st paper) • Independent component analysis(2nd paper)

  3. CHIP-ing away at medical questions • Why does cancer occur? • Diagnosis • Treatment • Drug design • Molecular level understanding • Snapshot of gene expression “(DNA) Microarray”

  4. Spot your genes Cancer cell Known gene sequences Cy5 dye Isolation RNA Cy3 dye Glass slide (chip) Normal cell

  5. Exp 2 Exp 3 Matrix of expression E 2 E 3 E 1 Gene 1 Gene 2 Exp 1 Gene N

  6. E2 E3 E1 E2 E3 E1 Gene 1 Gene N Gene 2 Gene 1 Gene 2 Gene N Why care about “clustering” ? • Discover functional relation • Similar expression functionally • related • Assign function to unknown gene • Find which gene controls which • other genes

  7. A review: microarray data analysis • Supervised (Classification) • Un-supervised (Clustering) • “Heuristic” methods: • - Hierarchical clustering • - k mean clustering • - Self organizing map • - Others • Probability-based methods: • - Principle component analysis (PCA) • - Independent component analysis (ICA) • -Others

  8. Heuristic methods: distance metrix 1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2] 2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y] x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y) 3. Other choices for distances…

  9. E1 E2 E3 Hierarchical clustering • Easy • Depends on where to • start the grouping • Trouble to interpret • “tree” structure

  10. K-mean clustering • Overall optimization • How many (k) • How to initiate • Local minima Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose “best” algorithm

  11. Probability-based methods: Principle component analysis (PCA) • Pearson 1901; Everitt 1992; Basilevksy 1994 • Common use: reduce dimension & filter noise • Goal: find “uncorrelated ” component(s) that account for • as much of variance by initial variables as possible • (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y

  12. X Eigenarray Exp1-n PCA algorithm Exp1-n Exp1-n • “Column-centered” matrix: A • Covariance matrix: ATA • Eigenvalue Decomposition • ATA = U Λ UT • U: Eigenvectors • (principle components) • Λ: Eigenvalues • Digest principle components • Gaussian assumption Exp1-n genes Λ U Eigenarray

  13. Are biologists satisfied ? Super-Gaussian model … Gene5 Gene4 Gene3 Gene2 Gene1 Biological Regulators Expression level Ribosome Biogenesis … Gene5 Gene4 Gene3 Gene2 Gene1 Energy Pathway … Gene5 Gene4 Gene3 Gene2 Gene1 • Biological process is non-Gaussian • “Faithful” vs. “meaningful”

  14. Equal to “source separation” Source 1 Source 2 Mixture 1 ?

  15. Independent vs. uncorrelated Source x1 Source x2 y1 y1 y2 y2 The fact that sources are independent E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y stronger than uncorrelated Two mixtures: y1= 2*x1 + 3*x2 y2= 4*x1 + x2 principle components Independent components

  16. Independent component analysis(ICA) Simplified notation Find “unmixing” matrix A which makes s1,…, sm as independent as possible

  17. (Linear) ICA algorithm • “Likehood function” = Log (probability of observation) Y= WX p(x) = |detW| p(y) p(y)= Πpi (yi) L(y,W) = Log p(x) = Log |detW| + ΣLog pi (yi)

  18. Super-Gaussian model (Linear) ICA algorithm Find W maximize L(y,W)

  19. First paper: Gasch, et al. (2002) Genome biology, 3, 1-59 Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering

  20. Biology is “fuzzy” • Many genes are conditionally co-regulated • k-mean clustering vs. fuzzy k-mean: Xi: expression of ith gene Vj: jth cluster center

  21. 2nd cycle Remove correlatedgenes(>0.7) Σi m2XiVj WXi Xi Vj’= Σi m2XiVj WXi FuzzyK flowchart 3rd cycle 1st cycle Initial Vj = PCA eigenvectors weight WXievaluates the correlation of Xi with others

  22. Cell wall and secretion factors FuzzyK performance • k is more “definitive” • Recover clusters in classical methods • Uncover new gene clusters • Reveal new promoter sequence

  23. Second paper: ICA is so new… Lee, et al. (2003) Genome biology, 4, R76 Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)

  24. From linear to non-linear Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or Non-linear ICA: X= f(AS)

  25. How to do non-linear ICA? • Construct feature spaceF • Mapping X to Ψ in F • ICA of Ψ Input space feature space IRL IRn Normally, L>N

  26. k(v1,v1) … k(v1, vL) : : k(vL,v1) … k(vL,vL) ΦVTΦV = [ ] ; choose vectors {v1…vL} from {xi} k(v1,v1) …k(v1, vL) : : k(vL,v1) …k(vL,vL) k(v1,xi) : k(vk,xi) Mapped points in F: Ψ[xi] =[ ]1/2 [ ] Kernel trick Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦVTΦV)=L

  27. ICA-based clustering • Independent component yi=(yi1,yi2,…yiK), i=1,…M • “Load” – the jth entry of yi is the load of jth gene • Two clusters per component Clusteri,1 = {gene j| yij= (C%xK)largest load in yi} Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi}

  28. Evaluate biological significance Clusters from ICs Functional Classes GO 2 Cluster 1 GO 1 Cluster 2 Cluster 3 GO m GO i Cluster n Calculate the p value for each pair : probability that they share many genes by change Cluster i GO j

  29. f i f i g-f n-i g-f n-i ( )( ) ( )( ) g n g n ( ) ( ) Evaluate biological significance Prob of sharing i genes = Microarray data Functional class g “P-value”: p = 1-Σi=1k-1 f n i k True positive = n k Sensitivity = f

  30. Who is better ? Conclusion: ICA based clustering Is general better

  31. References • Su-in lee,(2002) group talk:“Microarray data analysis using ICA” • Altman, et al. 2001, “whole genome expression analysis: • challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340 • Hyvarinen, et al. 1999, “Survey on Independent Component • analysis” Neutral Comp Surv 2,94-128 • Alter, et al. 2000, “singular value decomposition for genome • wide expression data processing and modeling” PNAS, 97, 10101 • Harmeling et al. “Kernel feature spaces & nonlinear blind • source separation”

More Related