1 / 38

Data Mining the Yeast Genome Expression and Sequence Data

Data Mining the Yeast Genome Expression and Sequence Data. Alvis Brazma European Bioinformatics Institute. Why the yeast is interesting to the industry. Easy to work with (first) fully sequenced eukaryotic model organism 30% of genes have analogs in human

Télécharger la présentation

Data Mining the Yeast Genome Expression and Sequence Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute

  2. Why the yeast is interesting to the industry • Easy to work with (first) fully sequenced eukaryotic model organism • 30% of genes have analogs in human • most known human disease genes have homologues in the yeast • for food industry interesting in itself

  3. Genetic networks promoter1 gene1 promoter2 gene2 promoter3 gene3 promoter4 gene4 DNA transcription transcription factors RNA translation proteins

  4. Mining the Yeast Expression Data • The long term goals: • reconstructing the gene regulation networks and relating it to metabolic pathways • Short term goals: • correlating gene expression profiles with gene functional classes and using this for prediction of gene functions • correlating gene expression profiles with promoter regions

  5. Yeast microarray

  6. Yeast gene expression during diauxic shift (DeRisi et al) Yeast cells from an exponentially growing yeast culture were inoculated into fresh medium and after some initial period were harvested at seven 2-hour intervals. Their mRNA were isolated, and fluorescently labeled cDNA prepared. Two different fluorescents were used - one from cells harvested in each of the successive time-points, other from the cells harvested at the first time-point (reference measurement). The cDNA from each time-point together with the reference cDNA were hybridized to the microarray with approximately 6400 DNA sequences representing ORFs of the yeast genome. Measurements of the relative fluerescence intensity for each element reflect the relative abundance of the corresponding mRNA.

  7. Visualizing the data (expression profile of the “first” 250 genes)

  8. Average expression level of genes at the respective time-points

  9. Three approaches • Finding correlations between gene expression profiles and their functional classes • Building decision trees for predicting gene functional classes from their expression data • In silico discovery of putative transcription factor binding sites in the regions upstream to the genes with similar expression profiles (to appear in Genome Research, Dec. 1998)

  10. Gene distribution across the functional classes

  11. Energy gene subclasses in the yeast (less frequent merged in one)

  12. Gene expression for energy genes during the diauxic shift at the seven time-points

  13. Expression profiles of respiration genes

  14. Expression profiles of fermentation genes

  15. Average expression levels at the 7 time-points and for energy class genes duringdiauxic shift

  16. Average expression levels at all time-points and for all energy classes

  17. Energy classes distribution

  18. Energy classes distribution

  19. Decision tree for respiration genes

  20. Decision tree for fermentation

  21. Tricarboxilacid, respiration and reserves decision tree

  22. Time points 1 2 3 4 5 6 7 Clustering the gene expression profiles by discretization of gene expression measurment space Logarithm of expression ratio 2 1 0 -1 Corresponding discrete pattern: 000012-1 Put the genes mapping to the same discrete pattern in a cluster -2

  23. Organizationof a typical yeast promoter RNA URS URS TATA I 40 - 120 bp 20 - 700 bp Coding Region 40 - 60 bp

  24. In silico discovery of transcription factor binding sites from expression data Take data from gene expression level measurements (from DNA array technologies) -> Cluster together genes with similar expression profiles -> Take sequences upstream from the genes in each cluster -> Look for sequence patterns overrepresented in a cluster

  25. Clustering genes by similar expression profiles • Put in each cluster all genes that map to the same discrete pattern • Different thresholds give different clustering systems • We obtained 32 different clusters containing from 10 to 77 genes and 11 clusters containing at least 25 genes

  26. Hypothesis to test Genes with similar expression profiles may be regulated by similar expression mechanisms and thus may contain similar transcription factor binding sites

  27. Discovering regulatory elements in gene upstream sequences • Take the sequences of a certain length (e.g., 300 bp) upstream to all genes with a certain expression profile • Look for a priori unknown sequence patterns that are over-represented in these regions (taking into account the other upstream regions as background)

  28. Pattern discovery in bioseqeucnes • Group together sequences thought to have common biological (structural, functional) properties, ignoring the purely sequence (syntactic) properties • Study the purely syntactic properties of these sequences ignoring their biological (semantic) properties.

  29. Problem of “noise” • Gene expression measurement accuracy is bout factor of 2 (in 95% cases) • Clusters very dependant on the clustering method or thresholds • The same expression profile does not necessarily mean the same regulation mechanism

  30. Dealing with noise • One cannot look for patterns common to the set of strings, but for patterns overrepresented in the set • looking for sets of patterns covering the set • Use of “negative” or background setquences

  31. More powerful algorithms than the currently existing are needed • We used such new, more powerful algorithm, based on suffix-tree representation of the sequence space (implemented by Jaak Vilo at Helsinki University) • We looked systematically for all patterns discriminating the upstream regions in the clusters from randomly selected upstream regions

  32. Use of negative sequences Looking for patterns that are overrepresented in the sequences upstream from genes in a cluster in comparison to all other upstream sequences

  33. The rating function • Given two sets S+ and S- and a pattern P, return rating R(S+, S-,P) • Two rating functions that we used: • ratio: nr of sequences in S+ matching P divided by nr of sequences in S-matching P • probability that the pattern can occur in S+ “by chance” assuming that the occurrences in S- are “by chance” and using binomial distribution

  34. The sequence pattern discovery experiment • We run the algorithm on upstream sequences (length 2 * 300) of all the 32 gene clusters • Each cluster produced hundreds of overrepresented patterns • The problem of validation

  35. Some discovered sequence patterns from clusters of upstream sequences • Clusters with the increase in the expression level after time-point 6: CCCCT - known to be a stress responsive motif • Clusters with the decrease in the expression level after time-point 6: ATCC..T..A - RAP1 protein ATC..TAC - RAP1, REB1, BAF1 ATTTCA…T - GA-BF protein

  36. Statistical validation of the discovered patterns • For each cluster choose a random set of upstream regions of the same number • Run the pattern discovery algorithm on the random regions set in addition to the cluster • Compare the scores of the discovered patterns from the cluster and random set

  37. Conclusions • The discovered patterns are in accordance with the existing knowledge • Transcription factor binding sites can be discovered in silico from gene expression data • More refined and validated gene expression measurements are needed

  38. Acknowledgements • Inge Jonassen (Bergen) • Jaak Vilo , Esko Ukkonen (Helsinki) • Alistair Ewing, Neil Skilling (Quadstone Ltd - developers of Decisionhouse data mining software) • BIOVIS and BIOSTANDARDS projects from the EU at EBI

More Related