1 / 30

HCS 825 Class Project

HCS 825 Class Project. By: Jianyang Liu. Temporal gene expression mapping. Data analysis of large-scale temporal gene expression mapping of central nervous system development. Many kinds of Gene expression data: RT – PCR (used for temporal gene expression of CNS---Central Nervous System)  …

georgenej
Télécharger la présentation

HCS 825 Class Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HCS 825 Class Project By: Jianyang Liu

  2. Temporal gene expression mapping Data analysis of large-scale temporal gene expression mapping of central nervous system development

  3. Many kinds of Gene expression data: • RT – PCR (used for temporal gene expression of CNS---Central Nervous System) •  … • cDNA micro-array

  4. Data DescriptionTime point can be any other conditions…

  5. Basic Idea • Genes with similar functions should have similar expression profiles. • Sometimes the expression profile can tell us about function • Many approaches to clustering data…

  6. Euclidean distance • A measure for the difference between gene expression patterns. D(A,B) = Sum (Ai - Bi)2, i = 1 .. N • + Easy to calculate, intuitive • - Affected by amplitude

  7. Correlation1) Pearson’s correlation • Based on actual value and specified to look for linear relationship • r = SSP / √ SSX * SSY • + Both positive and negative relationships Scale invariant • - Sensitive to outliers, not intuitive

  8. 2) Spearman’s correlation • It is based on the ranks of the items rather than on their actual value. •  R = 1- 6∑D2 / N (N2 – 1) R = Rank correlation coefficient D = Difference between the ranks of two items N = The number of observation. • + Non-parametric • - Less sensitive

  9. Comparison of Pearson and Spearman corr r = Pearsons correlation coefficient for {(xi, yi)} = 0.249 rS= Spearman’s correlation = Pearsons correlation coefficient for {(ai, bi)} = 0.786

  10. Before clustering… • Thesecorrelation coefficients need to be converted as {d = (1- r)}, because clustering is based on a distance matrix. After conversion… • typical algorithm called neighbor joining: • Start with each gene in its own cluster • Pick the two closest clusters and join them • Repeat until only one cluster remains

  11. Hierarchical Classification • There are three typical algorithms to decide on the distance • between two clusters: • Choose the shortest distance between pairs of genes in the • two clusters (nearest neighbor, single linkage) • Choose the average distance (UPGMA) • Choose the longest distance (complete linkage) • Where to go? (Dxy < Dxz, Dxy < Dyz) • Z • X ____.____Y

  12. Comparison of Distance Computing Approaches hSingle Linkage Can identify long, thin cluster Can be subject to “chaining” hComplete Linkage Identifies tight, spherical clusters hAverage Linkage Compromise between single and complete linkage Less sensitive to outliers

  13. What Somogyi group got Fitch Clustering Graph

  14. Wave 1 Wave 2 Wave3 Wave4 Constant

  15. SAS code for Euclidian Distances • data one; • title'Cluster with Euclidian Distances'; • infile'c:\one.csv' delimiter = ','firstobs = 2; • input E11 E13 E15 E18 E21 P0 P7 P14 A gene $; • procclusterdata=one method=ave ***change for different distance computing (sin, com)*** /*SAS doesEuclidian Distances by default. If data is distance, type = distance is needed*/ • out=two *nosquare*; • id gene; • var E11 E13 E15 E18 E21 P0 P7 P14 A; • procTreedata = two horizontalvpages=2hpages=3maxh=2.5; • goptionshtext=.2fontres=presentation htitle=2 ; • id gene; • quit;

  16. No graphic option modified

  17. Part of Euclidian Distance (squared) Graph

  18. Pearson and Spearman corr analysis • data one; • infile'c:one.csv' delimiter = ','firstobs = 2; • input E11 E13 E15 E18 E21 P0 P7 P14 A gene $; • proctransposedata=one out=two; • id gene; • proccorrnoprintoutp=twop; ***Use outs for Speqrman corr*** • var keratin--DD632; • data three (drop=_type_ _name_ type=distance); • set twop; • gene=_name_; ***Help retain gene name on graph***; • if _type_ eq 'CORR'; • array numbs keratin--DD632; • doover numbs; • numbs = 1-numbs; ** Using 1- r for similarity ** • end; • procclusterdata=three(type=distance) method=ave out=four; • var keratin--DD632; ********* (specify distance) • id gene; • procTreedata = four; • id gene; • title‘Tree graph of pearson correlation’; • quit;

  19. Graph with and without id gene

  20. Pearson correlation cluster

  21. Spearman correlation cluster

  22. Why principle components? • Measuring more variables allows for a more exact model, but makes the correct model exponentially harder to find. • Theory: The goal of a PC analysis is to explain the variance-covariance structure of the variables with a linear combination of the variables.

  23. Proposed Methodology • From the linear combinations, the factor loadings for each component help explain which variables are contributing the most to the variance. • Hypothesis: If a gene has different expression levels for each class, then that gene will have a moderate to high degree variability. Therefore, I’m interested in those genes with high factor loadings for each component.

  24. Principle component analysis SAS code data one; title'Example of Proc princomp with Expression Data'; infile'c:\one.csv' delimiter = ','firstobs=2; input E11 E13 E15 E18 E21 P0 P7 P14 A gene $; procprincompout=prin; var E11 E13 E15 E18 E21 P0 P7 P14 A; procplot; plot prin2*prin1 $ gene / vpos=28; plot prin2*prin3 $ gene / vpos=28; run;

  25. Eigenvalues matrix

  26. Prin1 by Prin2 graph • Different Genes like Actin, NFL, NFM, NMDA1, etc… are detected as distant group. • ‚ • Prin2 ‚ • ‚ > cellubr > actin • 4 ˆ • ‚ > nestin • ‚ SC1 • ‚ cyclinBcy2lMK2 • 2 ˆ NT3 2ODC^2vCTO2 • ‚ keraIG672^vH2ACCO1 • ‚ PDGTH^^22>2CDD6>2SC2 • ‚ nPnSGI^253>I>CGAP>3EGFR • 0 ˆ nNNM5p7^6>>2GGRgGRg2 GAT1 • ‚nAChGh2*772232ts^2anACh^a7 > NFL • ‚ n1mIbavvv<vvv>oGAD67 • ‚ 5HMOGsmGvvvv> ACHE • -2 ˆ NmAGRbmGluR5 • ‚ • ‚ > GFAP > NMDA1 > NFM • ‚ • -4 ˆ > GRg1 • Šˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒ • -5 0 5 10 15 • Prin1

  27. Prin2 by Prin3 graph • Similar genes as previous are detected as distant group. ‚ ‚ actin < > cellubr 4 ˆ ‚ > nestin ‚ SC1 ‚ cycli^>CMK2 2 ˆ CCPTN33 NT3 ‚ CRISC<>DIDs2 ‚ G67I808^G<44keratin ‚ EGInT<E256TIODR1 0 ˆ BNMsySmv*5vSCFn ‚ NFL < nAprNGHL249*7tcFoFR1 ‚ mGtDGvn32>Sv100beta ‚ mACGvD6<vvMO>cNFH -2 ˆ mGluR5ACHEGRb1 ‚ ‚ > NMDA1 > GFAP > NFM ‚ -4 ˆ > GRg1 Šƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒ -5.0 -2.5 0.0 2.5 5.0 7.5 Prin3

  28. Part of Euclidian Clustering Graph (Squared) Deviating Genes like Actin, NFL, NFM, NMDA1, etc… are also detected as distant group.

  29. Acknowledgement Many thanks to: • Dr. Francis SHELIA J • Pro. BISHOP, BERT LUDVIG

  30. The End Thank You!

More Related