280 likes | 395 Vues
This document outlines innovative genome annotation techniques that leverage knowledge-based approaches using mRNA and protein methods. The focus is on improving the accuracy of cDNA-to-genome alignments through refined parameter estimation, addressing sequencing errors and polymorphisms, and ensuring no junk splice sites are present. This workshop summarizes advancements from the ENCODE project, highlighting the integration of EST information with modeling tools like N-SCAN and Twinscan to enhance gene prediction capabilities. The emphasis is on generating high-quality alignments for reliable gene structure definitions.
E N D
A knowledge-based approach tointegrated genome annotation Michael Brent Washington University
Outline of our process MGC validated clones + RefSeq NM’s Remove all with frame shifts Fill with spliced Hs mRNA & EST Threaded de novo predict- ions Paragon aligner BLAT N-SCAN +EST ENCODE Workshop
Paragon aligner Manimozhiyan Arumugam with Chaochun Wei
Better EST/cDNA-to-genome alignment • Idea • Go beyond minimizing mismatches and gaps • Accurate probabilities in correct alignments • Estimate parameters for each sequence set ENCODE Workshop
Better EST/cDNA alignment • Two sources of mismatches & gaps • Error (sequencing, RT) • Quals give local probs. Not used here. • Polymorphism (RNA vs. genome strains) • Gap vs. indel rates are different • Parameters must vary with sequence quality & source strains/polymorphism rates • E.g. prefer non-matches in low quality bases ENCODE Workshop
Better EST/cDNA alignment • Introns • Accurate probabilities in correct alignments • GT/AG vs. GC/AG vs. AT/AC • Absolutely no junk splice sites • Not clear what to do with polymorphic sites • Long introns are rarer than short introns ENCODE Workshop
Small exon in finished cDNA STANDARD TOOL (EST_GENOME) GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGACG-Agcgcc.......tgtagCACTTCT 16339 ||||||||||||||||| || ||| |>>>>> 15907 >>>>> ||||| BC000810 101 CCGGGACTACCTCATGA-GT-ACGCA.................--CTTCT 129 GENOME 16340 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 16389 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 130 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 179 OUR PAIR HMM GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGAC.......AATAGTACGGTAAG...... 13006 ||||||||||||||||||>>>>> 12584 >>>>>||||>>>>> 3326 BC000810 101 CCGGGACTACCTCATGAG.................TACG........... 122 GENOME 13007 TGTAGCACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 13046 >>>>>||||||||||||||||||||||||||||||||||||||||||||| BC000810 123 .....CACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 167 ENCODE Workshop
Blind test • Test set • 100 alignment pairs of MGC clones to genome • Paragon & EST_genome differ on all of them • Output format identical • Evaluation • Curator attempting to explain discrepancies • Result • 37 cases where biological evidence favors 1 • In 31/37 Paragon alignment is supported ENCODE Workshop
Future directions • UTR vs. ORF • Polymorphism is more common in UTR • And 3rd position in ORF • Conservation • Use alignments to distinguish true from false • Splice sites, introns • Codons • Polymorphisms (analogous to quality values) ENCODE Workshop
Conceptual shift • Traditional view • cDNA data “speaks for itself”. Theory neutral. • Alignment = counting matches, mismatches, gaps • cDNA = genome annotation ENCODE Workshop
Conceptual shift • Our view • More knowledge = better alignments & annotations • cDNA is very useful evidence re: gene structure • Need to align it correctly • Need to determine its completeness • If not complete, predict the remainder • Gene prediction & cDNA alignment are the same problem • cDNA/EST just adds another information source ENCODE Workshop
N-SCAN_EST Chaochun Wei
TWINSCAN/N-SCAN_EST • Goal: • Integrate EST information with TWINSCAN to • improve accuracy where EST evidence exits • without losing the ability to predict novel genes. ENCODE Workshop
Twinscan_est ENCODE Workshop
Generating EST-alignment Sequence ENCODE Workshop
Modeling EST alignment sequence • Probability models • In each HMM state • Separate models for EST alignment sequence • Probabilities of DNA, conservation sequence, and EST sequence are multiplied. • Very similar to models of genomic alignments ENCODE Workshop
Multi-genome methods:N-SCAN Samuel Gross with Randall Brown
N-SCAN:Using multi-genome alignments • Motivation • Many genomes should give stronger signal of negative selection than two • Lots of genomes are being sequenced • Methods • Extend Twinscan to a phylogenetic tree model • At each site, mutation rate & pattern of tolerated substitutions depend on function ENCODE Workshop
Example • A multiple alignment that (A) is and (B) is not typical of the splice boundary shown ENCODE Workshop
Using mutation patterns for improving gene prediction • Tree hidden Markov model • Each state • generates columns of a multiple alignment • by a substitution process • along the branches of a phylogenetic tree ENCODE Workshop
Challenges • Columns are not correct, orthologous • Sequencing error • Alignment error • Change of function (I am not a mouse!) ENCODE Workshop
Differences from EXONIPHY • Approach • Estimate models of actual alignments, not evolutionary processes • Model • Independent substitution probabilities on each branch of the tree • 6 characters: A, C, G, T, gap, unaligned • Condition backwards from target genome ENCODE Workshop
Using mutation patterns for improving gene prediction • Traditional factorization • Pr(a2) Pr(a1|a2) Pr(h|a1) Pr(m|a1) Pr(c|a2) • N-SCAN factorization • Pr(h) Pr(a1|h) Pr(a2|a1) Pr(m|a1) Pr(c|a2) ENCODE Workshop
Preliminary study in human ENCODE Workshop
Preliminary study in human ENCODE Workshop
Fin ENCODE Workshop