Gene discovery using combined signals from genome sequence and natural selection

Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group

& processing Genes are read out via mRNA GENSIPS

RNA Processing GENSIPS

A typical human gene structure GENSIPS

In a mammalian genome • Finding all the genes is hard • Mammalian genomes are large • 5,051 miles of 10pt type • Raleigh to Tripoli, Libya • Only about 1.5% protein coding • Raleigh to Winston-Salem GENSIPS

Genes are fairly unconstrained • Intron length is highly variable • ~5% are 40-100 nt long • ~3% are longer than 30,000 nt • Distance between genes is highly variable • From 103 to 106 nt or more (probably) GENSIPS

Exons per gene (RefSeq) GENSIPS

Background is not random • Segmental duplications • Entire regions duplicate, then diverge slowly • Processed pseudogenes • Spliced transcripts integrate back into the genome • Sequence is similar to source genes • Generally not functional GENSIPS

Gene prediction: two approaches • 1. Transcript-based (E.g., GeneWise) • Map experimentally determined sequences of spliced transcripts to their genomic source • Map transcript sequences to genomic regions that could produce similar transcripts • 2. De novo (genome only) • Model DNA patterns characteristic of gene components • Splice donor and accepter • Protein coding sequence • Translation start and stop GENSIPS

Advantages and disadvantages • Transcript-based • Advantage: conservative • Evidence of transcription for every exon • Disadvantage: conservative • Can’t find “truly novel” genes • Still subject to error GENSIPS

Advantages and disadvantages • De novo • Advantage 1: Less biased toward • Known transcripts • Transcripts that can be sequenced easily • Advantage 2: Genome sequencing is easy • Disadvantages • No direct evidence of transcription • Presumably, more false positives GENSIPS

Single-genome denovo: Genscan • Strengths • For mammalian sequence, one of the best single-genome, de novo gene predictors • Widely used to great practical advantage • De facto standard for mammalian sequence • Limitations • Predicts >45K genes (best est.: 25-30K) • Predicts >315K exons (best est. 200K-250K) • Gets only 9% of known genes exactly right* GENSIPS

Dual genome de novo • We developed algorithms that use two genomes to • Reduce the number of false positives • Refined the details of the structures GENSIPS

Single-genome de novo method • Probability model • Assigns probability to annotated DNA sequences: • 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ • Optimization algorithm • Given a DNA sequence, find the most probable annotation, according to the model Intron 5’ UTR Exon GENSIPS

Genscan’s generative model Intron Exon Intron CCATGGCGTCTTCAGGCAGTGACTC GENSIPS

Genscan’s generative model • States correspond to gene features • Model generates DNA sequence by passing through states • The probability of annotated DNA sequence is the probability of • generating the DNA sequence • by passing through states corre-sponding to the annotation. Generalized HMM GENSIPS

Dual genome prediction • Input • Target and informant genomes • Idea • Patterns of evolution since the last common ancestor may reveal gene structure GENSIPS

Two conservation signals • 1. Local alignment signal • Selective pressures differ by feature • This leaves a characteristic signature • 2. Structural signal • Locations of introns tend to be conserved GENSIPS

Characteristic local alignments Coding exon human TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC mouse Intron (non-coding) human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT mouse GENSIPS

Conservation of intron location GENSIPS

Align→predict→filter→test Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC TWINSCAN WU-BLAST TCTGCCACC || || || TCAGCTACT TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC GENSIPS

TWINSCAN Representation change gHMM decoding Conservation sequence TCTGCCACC || || || TCAGCTACT TCTGCCACC ||:||:|| GENSIPS

BLAST Alignments Target Informant GENSIPS

Projecting BLAST Alignments Target Informant GENSIPS

Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| | ||||||||| || || || mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS

Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| |:|||||||||::||:|| ||: mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS

Conservation sequence Synthetic (projected) local alignment • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC ||||||. . . . . . . . .|:|||||||||::||:|| ||: mouse CTAGAG AGACAGGTACCATAGGGCTCTCCT GENSIPS

Conservation sequence Conservation sequence • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC ||||||. . . . . . . . .|:|||||||||::||:|| ||: GENSIPS

Conservation sequence Conservation sequence • Pair each nucleotide of the target with • “|” if it is aligned and identical • “:” if it is aligned to mismatch or gap • “.” if it is unaligned human CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC ||||||. . . . . . . . .|:|||||||||::||:||||: GENSIPS

Twinscan: Extending the model • Probability model • Assigns probability to annotated DNA: • 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ • |||........|:||||:|||||||||:||::|| • Optimization • Given DNA and conservation sequence, find the most probable annotation, according to the model Intron 5’ UTR Exon GENSIPS

Twinscan • Each state “generates” DNA and conservation sequence independently • Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states GENSIPS

Performance Evaluation • RefSeq • A set ~13,000 “Known” mRNAs • Represents ~40-50% of human genes • Usually, only one of several splices • Mapping to genome is imperfect • Best available gold standard GENSIPS

GENSIPS

Short term goal • All multi-exon human genes • Predict accurately • Integrate information from more genomes • Verify at least one intron experimentally • Follow up with full-length verification GENSIPS

Acknowledgments • Funding agencies • National Institutes of Health (NHGRI) • National Science Foundation (DBI) • Sequencing centers • Sanger, Whitehead, Wash. U. • My group • Ian Korf, Paul Flicek, Evan Keibler, Ping Hu • Collaborators • Roderic Guigo, Josep Abril, Genis Parra • Pankaj Agarwal • Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis GENSIPS

Other clades • Plants • Arabidopsisthaliana, cabbage, rice • Nematodes • C. elegans, C. briggsae • Fungi • Cryptococcus neoformans (JEC21, H99) GENSIPS

Pair HMM algorithms (SLAM,…) • Input is orthologous sequences. • Aligns and predicts simultaneously, using a joint probability model • Predicts orthologous genes in 2 sequences • All predicted CDS is aligned • Some aligned regions are not predicted CDS • Labeled conserved non-coding sequence GENSIPS

The algorithms (SLAM,…) • sgp2 • Alignment before prediction (tblastx) • Predicts genes in target sequence only • Don’t need orthologous input sequences • Paralogs & low-coverage shotgun can help • Modifies scores of all potential exons, by • At each base, add tblastx score of best overlapping local alignment (roughly) • To gene-id scores of that potential exon GENSIPS

The algorithms • TWINSCAN • Alignment before prediction (blastn) • Predicts in target sequence only • Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by • At each base, apply a feature-specific scoring model (estimated for this purpose) • to the best overlapping local alignment, and adding the result • To Genscan scores for that feature GENSIPS

% Aligned, CDS vs. other GENSIPS

tblastxHSPs HSPsProjections QuerySequence geneidExons SGPExons Syntenic Gene Prediction (sgp2) GENSIPS

Why work on gene finding? • Genes are • Components responsible for biological function • Variations cause human disease / susceptibility • Controls for modifying biological function • Human gene therapy • Agriculture • Nanotechnology, etc. GENSIPS

Gene discovery using combined signals from genome sequence and natural selection