480 likes | 562 Vues
Gene discovery using combined signals from genome sequence and natural selection. Michael Brent Washington University. The mouse genome analysis group. & processing. Genes are read out via mRNA. RNA Processing. A typical human gene structure. In a mammalian genome.
E N D
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group
& processing Genes are read out via mRNA GENSIPS
RNA Processing GENSIPS
A typical human gene structure GENSIPS
In a mammalian genome • Finding all the genes is hard • Mammalian genomes are large • 5,051 miles of 10pt type • Raleigh to Tripoli, Libya • Only about 1.5% protein coding • Raleigh to Winston-Salem GENSIPS
Genes are fairly unconstrained • Intron length is highly variable • ~5% are 40-100 nt long • ~3% are longer than 30,000 nt • Distance between genes is highly variable • From 103 to 106 nt or more (probably) GENSIPS
Exons per gene (RefSeq) GENSIPS
Background is not random • Segmental duplications • Entire regions duplicate, then diverge slowly • Processed pseudogenes • Spliced transcripts integrate back into the genome • Sequence is similar to source genes • Generally not functional GENSIPS
Gene prediction: two approaches • 1. Transcript-based (E.g., GeneWise) • Map experimentally determined sequences of spliced transcripts to their genomic source • Map transcript sequences to genomic regions that could produce similar transcripts • 2. De novo (genome only) • Model DNA patterns characteristic of gene components • Splice donor and accepter • Protein coding sequence • Translation start and stop GENSIPS
Advantages and disadvantages • Transcript-based • Advantage: conservative • Evidence of transcription for every exon • Disadvantage: conservative • Can’t find “truly novel” genes • Still subject to error GENSIPS
Advantages and disadvantages • De novo • Advantage 1: Less biased toward • Known transcripts • Transcripts that can be sequenced easily • Advantage 2: Genome sequencing is easy • Disadvantages • No direct evidence of transcription • Presumably, more false positives GENSIPS
Single-genome denovo: Genscan • Strengths • For mammalian sequence, one of the best single-genome, de novo gene predictors • Widely used to great practical advantage • De facto standard for mammalian sequence • Limitations • Predicts >45K genes (best est.: 25-30K) • Predicts >315K exons (best est. 200K-250K) • Gets only 9% of known genes exactly right* GENSIPS
Dual genome de novo • We developed algorithms that use two genomes to • Reduce the number of false positives • Refined the details of the structures GENSIPS
Single-genome de novo method • Probability model • Assigns probability to annotated DNA sequences: • 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ • Optimization algorithm • Given a DNA sequence, find the most probable annotation, according to the model Intron 5’ UTR Exon GENSIPS
Genscan’s generative model Intron Exon Intron CCATGGCGTCTTCAGGCAGTGACTC GENSIPS
Genscan’s generative model • States correspond to gene features • Model generates DNA sequence by passing through states • The probability of annotated DNA sequence is the probability of • generating the DNA sequence • by passing through states corre-sponding to the annotation. Generalized HMM GENSIPS
Dual genome prediction • Input • Target and informant genomes • Idea • Patterns of evolution since the last common ancestor may reveal gene structure GENSIPS
Two conservation signals • 1. Local alignment signal • Selective pressures differ by feature • This leaves a characteristic signature • 2. Structural signal • Locations of introns tend to be conserved GENSIPS
Characteristic local alignments Coding exon human TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC mouse Intron (non-coding) human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT mouse GENSIPS
Conservation of intron location GENSIPS
Align→predict→filter→test Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC TWINSCAN WU-BLAST TCTGCCACC || || || TCAGCTACT TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC GENSIPS
TWINSCAN Representation change gHMM decoding Conservation sequence TCTGCCACC || || || TCAGCTACT TCTGCCACC ||:||:|| GENSIPS
BLAST Alignments Target Informant GENSIPS
Projecting BLAST Alignments Target Informant GENSIPS
Projecting BLAST Alignments Target Informant GENSIPS
Projecting BLAST Alignments Target Informant GENSIPS
Projecting BLAST Alignments Target Informant GENSIPS
Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| | ||||||||| || || || mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS
Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| |:|||||||||::||:|| ||: mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS
Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC ||||||. . . . . . . . .|:|||||||||::||:|| ||: mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS
Conservation sequence Conservation sequence • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC ||||||. . . . . . . . .|:|||||||||::||:|| ||: GENSIPS
Conservation sequence Conservation sequence • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC ||||||. . . . . . . . .|:|||||||||::||:||||: GENSIPS
Twinscan: Extending the model • Probability model • Assigns probability to annotated DNA: • 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ • |||........|:||||:|||||||||:||::|| • Optimization • Given DNA and conservation sequence, find the most probable annotation, according to the model Intron 5’ UTR Exon GENSIPS
Twinscan • Each state “generates” DNA and conservation sequence independently • Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states GENSIPS
Performance Evaluation • RefSeq • A set ~13,000 “Known” mRNAs • Represents ~40-50% of human genes • Usually, only one of several splices • Mapping to genome is imperfect • Best available gold standard GENSIPS
Short term goal • All multi-exon human genes • Predict accurately • Integrate information from more genomes • Verify at least one intron experimentally • Follow up with full-length verification GENSIPS
Acknowledgments • Funding agencies • National Institutes of Health (NHGRI) • National Science Foundation (DBI) • Sequencing centers • Sanger, Whitehead, Wash. U. • My group • Ian Korf, Paul Flicek, Evan Keibler, Ping Hu • Collaborators • Roderic Guigo, Josep Abril, Genis Parra • Pankaj Agarwal • Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis GENSIPS
Other clades • Plants • Arabidopsisthaliana, cabbage, rice • Nematodes • C. elegans, C. briggsae • Fungi • Cryptococcus neoformans (JEC21, H99) GENSIPS
Pair HMM algorithms (SLAM,…) • Input is orthologous sequences. • Aligns and predicts simultaneously, using a joint probability model • Predicts orthologous genes in 2 sequences • All predicted CDS is aligned • Some aligned regions are not predicted CDS • Labeled conserved non-coding sequence GENSIPS
The algorithms (SLAM,…) • sgp2 • Alignment before prediction (tblastx) • Predicts genes in target sequence only • Don’t need orthologous input sequences • Paralogs & low-coverage shotgun can help • Modifies scores of all potential exons, by • At each base, add tblastx score of best overlapping local alignment (roughly) • To gene-id scores of that potential exon GENSIPS
The algorithms • TWINSCAN • Alignment before prediction (blastn) • Predicts in target sequence only • Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by • At each base, apply a feature-specific scoring model (estimated for this purpose) • to the best overlapping local alignment, and adding the result • To Genscan scores for that feature GENSIPS
% Aligned, CDS vs. other GENSIPS
tblastxHSPs HSPsProjections QuerySequence geneidExons SGPExons Syntenic Gene Prediction (sgp2) GENSIPS
Why work on gene finding? • Genes are • Components responsible for biological function • Variations cause human disease / susceptibility • Controls for modifying biological function • Human gene therapy • Agriculture • Nanotechnology, etc. GENSIPS