1 / 21

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA. Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi Information Processing in Cells and Tissues, pp. 203-212, 1998 Presented by Bin He. Motivations.

tam
Télécharger la présentation

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MINING THE GENE EXPRESSION MATRIX:INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi Information Processing in Cells and Tissues, pp. 203-212, 1998 Presented by Bin He

  2. Motivations • it is necessary to determine large-scale temporal gene expression patterns • to decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously

  3. Gene time series • assay the expression levels of large numbers of genes in a tissue at different time points • Gene time series the relative amounts of mRNA produced at these time points provide a gene expression time series for each gene

  4. Gene Expression Matrix • Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L., and Somogyi, R., 1997, Large-scale temporal gene expression mapping of CNS development, Proc. Natl. Acad. Sci., in press

  5. Previous Approach • Euclidean distance and information theoretic measures to cluster the genes into related expression time series • A significant problem with this approach is the variety of measures that can be used • Each measure produces a unique clustering of gene expression patterns

  6. Contributions • determining significant relationships between individual genes, based on: • linear correlation • rank correlation • information theory

  7. Linear correlation ------positive correlation • positive linear correlation

  8. Linear correlation ------negative correlation • negative linear correlation

  9. Linear correlation ------restriction • for 112 different genes, 112x111/2 = 6216 pairs of expression time series need to be examined • to restrict the number of relationships, we might want to test which correlations are significantly larger than a certain value

  10. Linear correlation ------restriction • For instance, to find those relationships in which at least 50% of the variance is explained by the correlation, i.e. rho2>0.5, we need |r|>0.96 to reject at the 1% significance level the null hypothesis that |rho|<0.7071

  11. Linear correlation ------visualization • residual variance based distance measurment • d=1-r2 • d=0 if perfectly correlated, d=1 if uncorrelated • multidimensional scaling • map time series into a two-dimensional plane

  12. Linear correlation ------visualization • Multidimensional scaling of 34 time series with high correlation

  13. Nonlinear correlation ------Model • Spearman rank correlation, rs • measurement for monotonic relationships • can be used for non-Gaussian distributions • 491 pairs of expression time series, involving 98 genes, which have a significant rs, ranging from -0.979 to 0.996

  14. Nonlinear correlation ------Example • High rank correlation but low linear correlation between mGluR1 and GRa2

  15. Information Theory ------mutual information • if H(A) and H(B) are the entropies of sources A and B respectively, and H(A,B) the joint entropy of the sources, then M(A,B) = H(A) + H(B) - H(A,B) • discrete form is much easier to use • We need discretize the time series by partitioning the expression levels into bins

  16. Information Theory ------Bin size • The fewer bins we use to discretize the data, the more information about the original time series we ignore. • On the other hand, too fine a binning will leave us with too few points per bin to get a reasonable estimate of the frequency of each bin

  17. Information Theory ------Mapping • Some time series map to the same discretized series • In total, from 112 unique continuous-valued time series we get 91 discretized time series

  18. Information Theory ------Mapping

  19. Information Theory ------Mapping • eliminate one-to-one mapping by permuting the bin numbers • H(A)=H(B)=M(A,B) • row 3 and row 4 • replace such time series by one single series, leaving us with a set of 77 unique, non-equivalent time series.

  20. Information Theory ------Measurement • symmetric measures • M(A,B)/max(H(A),H(B)) • M(A,B)/H(A,B) • asymmetric measures • Relative mutual information R(A,B) = M(A,B)/H(B) • R(A,B) = 1.0, means that all the information about time series B is contained in time series A

  21. Conclusion • Linear correlation can be used very effectively to detect linear relationships • detect relationships not captured by Euclidean distance, such as high negative correlations • Rank correlation can be used to detect non-linear relationships • much more robust with respect to the distribution of expression levels • Information theory can be used to detect genes whose (binned) expression patterns share information • It will detect any mapping from time series A to B

More Related