530 likes | 650 Vues
Gene prediction has evolved significantly from early models like GENSCAN to advanced techniques such as TWINSCAN and Hidden Markov Models (HMMs). Understanding the structure of genes, including attributes like open reading frames (ORFs), introns, and exons, is crucial. Current methods utilize probabilistic models and comparative genomics to enhance prediction accuracy, revealing insights into the human genome's complexity. As we continue to discover novel gene variants and splicing patterns, computational approaches will play a vital role in mapping the remaining unknown genes.
E N D
Genes ATG TAG TAA TGA • Gene RNA Protein • Proteins are about 500 AA long • Genes are about 1500bp long
ORF Scanning • In “lower” organisms, genes are contiguous • We expect about 1 stop codon per 64bp • If we see a long ORF, it’s probably a gene! • And conversely, all genes are long ORFs
Introns AG GT AG GT TGA TAA TAG ATG • Drosophila: • 3.4 introns per gene on average • mean intron length 475, mean exon length 397 • Human: • 8.8 introns per gene on average • mean intron length 4400, mean exon length 165 • ORF scanning is defeated
Splicing AG GT AG GT TGA TAA TAG ATG AG GT AG GT AG TGA TAA TAG ATG
Needles in a Haystack • Human genome is about 3.2Gbp • 20,000 – 25,000 genes • 78% intergenic, 20% introns, 2% coding
Gotta Find ‘Em All • 60-85% of all human genes have been found, mostly by random EST sequencing • This probably won’t work for the rest • For most genes, only one splice variant is known • If we can computationally predict a gene, we have a cheap experiment (RT-PCR) to verify
Looking For Clues • Signals used by the cell • 99% of introns begin with GT, end with AG • 0.8% of introns begin with GC, end with AG • Gene begins with ATG • Gene ends with TAG, TAA, or TGA • Other properties of genes • Exons have characteristic lengths • Base composition of exons is characteristic due to genetic code • Exons tend to be conserved between species • Pattern of conservation is three-periodic
Three-Periodicity • Most amino acids can be coded for by more than one DNA triplet (codon) • Usually, the degeneracy is in the last position Human CCTGTT (Proline, Valine) Mouse CCAGTC (Proline, Valine) Rat CCAGTC (Proline, Valine) Dog CCGGTA (Proline, Valine) Chicken CCCGTG (Proline, Valine)
Hidden Markov Models • The de facto standard for gene prediction • Probabilistic finite state machine • Transition to a state, emit a character, transition to a new state • Many independence assumptions CDS NC ACG
HMMs For Gene Prediction • Generative model • Define P(X, Y) as a product of many independent terms P(ACG) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) • Terms are of the forms P(yi | yi-1) and P(xi | yi) • Trained by collecting counts
HMMs For Gene Prediction • To predict genes given a sequence X, calculate argmaxY P(Y | X) = argmaxY P(X, Y) / P(X) = argmaxY P(X, Y)
Generalized Hidden Markov Models • Like a HMM, but state durations are explicit • Transition to a state, pick a duration d, emit d characters, transition to a new state • Dynamic programming algorithm complexity goes from O(N2L) to O(N2LK) • K is the maximum state duration • Not so bad in practice
Predicting Genes With HMMs • Given a sequence, we can calculate the most likely annotation Intron Initial Exon Internal Exon Final Exon Single Exon Inter- genic GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA
The Past: GENSCAN • Chris Burge, Stanford, 1997 • Before the Human Genome Project • No alignments available • People still thought there were 100,000 human genes
The GENSCAN Model • Output probabilities for NC and CDS depend on previous 5 bases (5th-order) • P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) • Each CDS frame has its own model • Special 2nd-order positional models for start codon, stop codon, and acceptor site • Even fancier model for donor sites • Maximal dependence decomposition (MDD) • Long-range dependencies • Separate model for different isochores
GENSCAN Performance • First program to do well on realistic sequences • Multiple genes in both orientations • Pretty good sensitivity, poor specificity • 70% exon Sn, 40% exon Sp • Not enough exons per gene • Was the best gene predictor for about 4 years
Mouse A A G G T G Mouse A A T G T G Chicken A A G G T G Chicken A A _ A C G Comparative Gene Prediction A B Intron Intron Exon Exon -3 -2 -1 +1 +2 +3 -3 -2 -1 +1 +2 +3 Human A A G G T G Human A A G G T G
The Recent Past: TWINSCAN • Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001 • Uses an informant sequence to help predict genes • For human, informant is normally mouse • Informant sequence consists of three characters • Match: | • Mismatch: : • Unaligned: . • Informant sequence assumed independent of target sequence
The TWINSCAN Model • Just like GENSCAN, except adds models for conservation sequence • 5th-order models for CDS and NC, 2nd-order models for start and stop codons and splice sites • One CDS model for all frames • Many informants tried, but mouse seems to be at the “sweet spot”
TWINSCAN Performance • Slightly more sensitive than GENSCAN, much more specific • Exon sensitivity/specificity about 75% • Much better at the gene level • Most genes are mostly right, about 25% exactly right • Was the best gene predictor for about 4 years
The Present: N-SCAN • Gross and Brent, Washington University in St. Louis, 2005 • If one informant sequence is good, let’s try more! • Also several other improvements on TWINSCAN
N-SCAN Improvements • Multiple informants • Richer models of sequence evolution • Frame-specific CDS conservation model • Conserved noncoding sequence model • 5’ UTR structure model
HMM Outputs • GENSCAN • TWINSCAN • N-SCAN Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation |||:||:||:|||||:||||||||...... sequence Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1 GGTCAGC___CCAAGAACGTGTAG...... Informant2 GATCAGC___CCAAGAACGTGTAG...... Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA ...
Two-Component Output Distributions • Target sequence model • Phylogenetic model for informants • Product gives the probability of a multiple alignment column
Inference • Slightly-modified version of Felsenstein’s algorithm • At each of the O(N) nodes, we calculate 6o+1 summations over 6o+1 values • Total time complexity is O(N • 62(o+1))
Training • Simple with labeled multiple alignment of all sequences • Can use known genes as a labeling • Don’t know ancestral genome sequences • Treat them as missing data and use EM
CPD Parameterizations • Each Bayesian network of order o has (2N-1)(6o+1)(6o+1-1) free parameters • We can reduce this number by restricting the form of the CPDs • Partially reversible models • Relative frequency of DNA k-mers remains constant as sequence evolves • Gaps and unaligned regions introduced over time
N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models • Root (target) node is observed • Can use existing single-sequence models • Can use higher-order models • Can estimate target sequence model optimally
N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models • No assumption of homogeneous substitution process • Gaps and unaligned regions can be treated naturally • Robust against • Function-changing mutation • Alignment error • Sequencing error • The price is many more parameters
Conservation Score Coefficient • N-SCAN uses log-likelihood scores internally. The score of a position i under state S is • Values of k between 0.3 and 0.6 result in the best performance • Performance is roughly constant in this range
Whole-Genome Human Gene Prediction • Annotations used were cleaned RefSeqs • 16,259 genes • 20,837 transcripts • N-SCAN used human, mouse, rat, chicken alignment
The Future(?): CONTRAST • New gene predictor currently in the works • Based not on a generalized HMM, but a semi-Markov conditional random field (SCRF)
HMMs For Gene Prediction • Generative model • Define P(X, Y) as a product of many independent terms P(ACG) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) • Terms are of the forms P(yi | yi-1) and P(xi | yi) • Trained by collecting counts
HMMs For Gene Prediction • To predict genes given a sequence X, calculate argmaxY P(Y | X) = argmaxY P(X, Y) / P(X) = argmaxY P(X, Y) • Advantage: simplicity • Extremely fast training, efficient inference • Disadvantage: simplicity • Makes many unwarranted independence assumptions • Inaccurate model will get us into trouble
When HMMs Go Wrong • Normal HMM training optimizes wrong function • We use P(Y | X) for prediction, but we’re optimizing P(X, Y) = P(Y | X) P(X) • This means we may prefer parameters that lead to worse predictions if they assign a higher probability to the sequence
NNC A 2% B 2% C 96% NC A 3% B 2% C 95% CDS A 49% B 49% C 2% CNS A 96% B 2% C 2% CDS A 49% B 49% C 2% NC A 3% B 2% C 95% CDS A 3% B 95% C 2% When HMMs Go Wrong …CCCCCCCCCCCCCAAAAAAAAAACCCC…CCCCCCCBBABAAABBABBABCC… A = Conserved triplet B = Synonymous substitution C = Nonsynonymous substitution
Can We Fix It? • Directly optimize • No closed form solution • But function and gradient can be calculated efficiently using DP • If we’re going to numerically optimize anyway, might as well switch to a more expressive model
CRFs For Gene Prediction • Discriminative model • Define P(Y | X) as a product of many terms • Individual terms are not probabilities! • Terms are of the form fj(yi-1, yi, X, i) wj • The Good • Independence assumptions much weaker than in HMMs • Inference complexity is the same as for HMM • The Bad • Training requires numerical optimization of (convex) likelihood function
The Math CRFs HMMs
HMMs vs. CRFs y1 y2 y3 y4 y5 y6 … HMM x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 … CRF x1 x2 x3 x4 x5 x6