1 / 48

Gene discovery using combined signals from genome sequence and natural selection

Gene discovery using combined signals from genome sequence and natural selection. Michael Brent Washington University. The mouse genome analysis group. & processing. Genes are read out via mRNA. RNA Processing. A typical human gene structure. In a mammalian genome.

lee
Télécharger la présentation

Gene discovery using combined signals from genome sequence and natural selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

  2. & processing Genes are read out via mRNA GENSIPS

  3. RNA Processing GENSIPS

  4. A typical human gene structure GENSIPS

  5. In a mammalian genome • Finding all the genes is hard • Mammalian genomes are large • 5,051 miles of 10pt type • Raleigh to Tripoli, Libya • Only about 1.5% protein coding • Raleigh to Winston-Salem GENSIPS

  6. Genes are fairly unconstrained • Intron length is highly variable • ~5% are 40-100 nt long • ~3% are longer than 30,000 nt • Distance between genes is highly variable • From 103 to 106 nt or more (probably) GENSIPS

  7. Exons per gene (RefSeq) GENSIPS

  8. Background is not random • Segmental duplications • Entire regions duplicate, then diverge slowly • Processed pseudogenes • Spliced transcripts integrate back into the genome • Sequence is similar to source genes • Generally not functional GENSIPS

  9. Gene prediction: two approaches • 1. Transcript-based (E.g., GeneWise) • Map experimentally determined sequences of spliced transcripts to their genomic source • Map transcript sequences to genomic regions that could produce similar transcripts • 2. De novo (genome only) • Model DNA patterns characteristic of gene components • Splice donor and accepter • Protein coding sequence • Translation start and stop GENSIPS

  10. Advantages and disadvantages • Transcript-based • Advantage: conservative • Evidence of transcription for every exon • Disadvantage: conservative • Can’t find “truly novel” genes • Still subject to error GENSIPS

  11. Advantages and disadvantages • De novo • Advantage 1: Less biased toward • Known transcripts • Transcripts that can be sequenced easily • Advantage 2: Genome sequencing is easy • Disadvantages • No direct evidence of transcription • Presumably, more false positives GENSIPS

  12. Single-genome denovo: Genscan • Strengths • For mammalian sequence, one of the best single-genome, de novo gene predictors • Widely used to great practical advantage • De facto standard for mammalian sequence • Limitations • Predicts >45K genes (best est.: 25-30K) • Predicts >315K exons (best est. 200K-250K) • Gets only 9% of known genes exactly right* GENSIPS

  13. Dual genome de novo • We developed algorithms that use two genomes to • Reduce the number of false positives • Refined the details of the structures GENSIPS

  14. Single-genome de novo method • Probability model • Assigns probability to annotated DNA sequences: • 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ • Optimization algorithm • Given a DNA sequence, find the most probable annotation, according to the model Intron 5’ UTR Exon GENSIPS

  15. Genscan’s generative model Intron Exon Intron CCATGGCGTCTTCAGGCAGTGACTC GENSIPS

  16. Genscan’s generative model • States correspond to gene features • Model generates DNA sequence by passing through states • The probability of annotated DNA sequence is the probability of • generating the DNA sequence • by passing through states corre-sponding to the annotation. Generalized HMM GENSIPS

  17. Dual genome prediction • Input • Target and informant genomes • Idea • Patterns of evolution since the last common ancestor may reveal gene structure GENSIPS

  18. Two conservation signals • 1. Local alignment signal • Selective pressures differ by feature • This leaves a characteristic signature • 2. Structural signal • Locations of introns tend to be conserved GENSIPS

  19. Characteristic local alignments Coding exon human TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC mouse Intron (non-coding) human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT mouse GENSIPS

  20. Conservation of intron location GENSIPS

  21. Align→predict→filter→test Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC TWINSCAN WU-BLAST TCTGCCACC || || || TCAGCTACT TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC GENSIPS

  22. TWINSCAN Representation change gHMM decoding Conservation sequence TCTGCCACC || || || TCAGCTACT TCTGCCACC ||:||:|| GENSIPS

  23. BLAST Alignments Target Informant GENSIPS

  24. Projecting BLAST Alignments Target Informant GENSIPS

  25. Projecting BLAST Alignments Target Informant GENSIPS

  26. Projecting BLAST Alignments Target Informant GENSIPS

  27. Projecting BLAST Alignments Target Informant GENSIPS

  28. Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| | ||||||||| || || || mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS

  29. Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| |:|||||||||::||:|| ||: mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS

  30. Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC ||||||. . . . . . . . .|:|||||||||::||:|| ||: mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS

  31. Conservation sequence Conservation sequence • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC ||||||. . . . . . . . .|:|||||||||::||:|| ||: GENSIPS

  32. Conservation sequence Conservation sequence • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC ||||||. . . . . . . . .|:|||||||||::||:||||: GENSIPS

  33. Twinscan: Extending the model • Probability model • Assigns probability to annotated DNA: • 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ • |||........|:||||:|||||||||:||::|| • Optimization • Given DNA and conservation sequence, find the most probable annotation, according to the model Intron 5’ UTR Exon GENSIPS

  34. Twinscan • Each state “generates” DNA and conservation sequence independently • Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states GENSIPS

  35. Performance Evaluation • RefSeq • A set ~13,000 “Known” mRNAs • Represents ~40-50% of human genes • Usually, only one of several splices • Mapping to genome is imperfect • Best available gold standard GENSIPS

  36. GENSIPS

  37. GENSIPS

  38. GENSIPS

  39. GENSIPS

  40. Short term goal • All multi-exon human genes • Predict accurately • Integrate information from more genomes • Verify at least one intron experimentally • Follow up with full-length verification GENSIPS

  41. Acknowledgments • Funding agencies • National Institutes of Health (NHGRI) • National Science Foundation (DBI) • Sequencing centers • Sanger, Whitehead, Wash. U. • My group • Ian Korf, Paul Flicek, Evan Keibler, Ping Hu • Collaborators • Roderic Guigo, Josep Abril, Genis Parra • Pankaj Agarwal • Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis GENSIPS

  42. Other clades • Plants • Arabidopsisthaliana, cabbage, rice • Nematodes • C. elegans, C. briggsae • Fungi • Cryptococcus neoformans (JEC21, H99) GENSIPS

  43. Pair HMM algorithms (SLAM,…) • Input is orthologous sequences. • Aligns and predicts simultaneously, using a joint probability model • Predicts orthologous genes in 2 sequences • All predicted CDS is aligned • Some aligned regions are not predicted CDS • Labeled conserved non-coding sequence GENSIPS

  44. The algorithms (SLAM,…) • sgp2 • Alignment before prediction (tblastx) • Predicts genes in target sequence only • Don’t need orthologous input sequences • Paralogs & low-coverage shotgun can help • Modifies scores of all potential exons, by • At each base, add tblastx score of best overlapping local alignment (roughly) • To gene-id scores of that potential exon GENSIPS

  45. The algorithms • TWINSCAN • Alignment before prediction (blastn) • Predicts in target sequence only • Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by • At each base, apply a feature-specific scoring model (estimated for this purpose) • to the best overlapping local alignment, and adding the result • To Genscan scores for that feature GENSIPS

  46. % Aligned, CDS vs. other GENSIPS

  47. tblastxHSPs HSPsProjections QuerySequence geneidExons SGPExons Syntenic Gene Prediction (sgp2) GENSIPS

  48. Why work on gene finding? • Genes are • Components responsible for biological function • Variations cause human disease / susceptibility • Controls for modifying biological function • Human gene therapy • Agriculture • Nanotechnology, etc. GENSIPS

More Related