290 likes | 406 Vues
Doug Raiford Lesson 3. Gene Prediction. What’s the problem. Have a fully sequenced genome How identify the genes? What do we know so far?. Look for start and stop codons. Remember Start codon codes for methionine Stop codons do not code for an amino acid
E N D
Doug Raiford Lesson 3 Gene Prediction
What’s the problem • Have a fully sequenced genome • How identify the genes? • What do we know so far? Gene Prediction
Look for start and stop codons • Remember • Start codon codes for methionine • Stop codons do not code for an amino acid • Does every ATG mark the beginning of a gene? • Does every TAG, TAA, or TGA mark the end? Start codon: ATG Stop codons: TAG, TAA, or TGA Gene Prediction
In frame • The start and stop codons must be “in frame” • A set of codons must fit between them • Length evenly divisible by three • Open reading frame • Series of codons bracketed by start and stop codons (in frame) Gene Prediction
Gene length • The distance between start and stop codons tends to be longer than expected • How long would we expect that distance to be? Gene Prediction
Randomly drawn nucleotides • There are 64 different codons • A given codon should show-up randomly around once every 64 codons or 192 nts (64*3) • 3 stop codons • Expect 3 in every 64 codons or once every 21 1/3 codons(21 1/3 * 3 = 64 nts) Gene Prediction
How far beyond 64? • Number of genes in E. coli is 4356 • Min 44 nts, max 8621 • 8 are < 64 • 143 < 128 (3%) • Good start but must be more • Approximately 77,000 ORFs > 2* expected on each strand Escherichia coli Gene Prediction
Parts of a gene • To “find” a gene would look for nt sequences that look like the parts of a gene RNA polymerase Promoter Region Coding region Terminator Region Start Codon ‘ATG’ = Methionine Stop Codon: non coding ‘TAA’, ‘TAG’, or ‘TGA’ Gene Prediction
Upstream region • Attract polymerase • Specific sequences • Gene regulation • Each promoter has unique pattern • Motifs for -35 sequence T T G A C A for -10 sequence T A T A A T -35 -10 Ribosomal binding site Coding region Polymerase binding Start Codon Transcription start site Gene Prediction
Genes of a feather… • Slightly different -35 and -10 motifs attract different sigma factors • Genes with similar upstream regions tend to be related: they express similarly Gene Prediction
Termination region (downstream) • Hairpin • Followed by U-run (A-run in the DNA) Gene Prediction
Termination • Week uracil bindings coupled with hairpin binding with nusA protein bound to polymerase mRNA Polymerase UUUUUUU DNA AAAAAAAA Gene Prediction
Motifs • How find? • Difficult: fuzzy, not carved in stone for -35 sequence T T G A C A for -10 sequence T A T A A T -35 -10 Ribosomal binding site Coding region Polymerase binding Start Codon Transcription start site Gene Prediction
Motiffs • Hidden Markov Models often used • All about the statistics • Markov Chain: series of events along with probabilities G or C or A or T Start Yay! I found one T A T A A T A Gene Prediction
Hidden Markov Models • Previous was a “state machine” representation • Should have states and observations • The states are “hidden” 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 T A T A A T Gene Prediction
Probabilities • Each state has a probability of “emitting” any given observation • Each state has a probability of “transitioning” to any given next state 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 T A T A A T Gene Prediction
Representing a model: two matrices • Transition probability matrix • Rows represent current state • Columns represent state to which a transition will occur • Entry is the probability associated with that transition • Emission probability matrix • Rows represent states • Columns represent which observation is emitted • Entry is the probability associated with that emission TRANS EMIS Gene Prediction
General approach • Requires a subject matter expert to build a model • Often start with a state for each position in a possible match • Example looking for something similar to • TATAAT • Might not have both A’s • Might have extra one in first slot • Never have G’s or C’s 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 A T A T A T Gene Prediction
Model • Also need a state for non-participating regions 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 T A T A A T Gene Prediction
Model • First guess as to probabilities • Maybe from state associated with first T to A 100% • Then 50% 50% whether A or T • Then 50% 50% whether A or T • Then 100% T 1 A C G T A C G T 3 .25 .25 .25 .25 .25 .25 .25 .25 .5 1 1 1 .5 1 1 0 1 2 4 5 6 7 8 .1 1 1 1 1 1 1 .99 1 A T A T A T Gene Prediction
Can “train” the model • Baum-Welch or Viterbi algorithm • Pass the algorithm a sequence of observations and first guess as to probabilities • It refines the probability matrices • Assumes that the sequence adheres to the • underlying probabilities. • Traverses states keeping track of actual • frequency of emissions and transitions • Adjusts matrices accordingly Gene Prediction
Using the model • Called checking the posterior probabilities • Given a sequence, check all possible paths through the model • Multiply the associated probabilities • Path with the highest probability is likely the path through the hidden states • Can use the “forward algorithm” to cut down the number of paths (dynamic programming) • Location in sequence where most probable states are “TATAAT” is a match A C G T A C G T .25 .25 .25 .25 .25 .25 .25 .25 1 1 0 1 2 3 4 5 1 1 1/17 1 1 1 1 16/17 1 T A T A Gene Prediction
Example using Matlab • Matlab very useful at matrix operations seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c'] seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2] EMIS = [.25,.25,.25,.25;#ACGT 0,0,0,1; 1,0,0,0; 0,0,0,1; 1,0,0,0; .25,.25,.25,.25] TRANS = [16/17,1/17,0,0,0,0; 0,0,1,0,0,0; 0,0,0,1,0,0; 0,0,0,0,1,0; 0,0,0,0,0,1; 0,0,0,0,0,1] A C G T A C G T .25 .25 .25 .25 .25 .25 .25 .25 1 1 0 1 2 3 4 5 1 1 1/17 1 1 1 1 16/17 1 T A T A Gene Prediction
Sites • Gene mark georgia institute • http://exon.biology.gatech.edu/ • Genscan • http://genes.mit.edu/GENSCAN.html • Genie Berkeley • http://www.fruitfly.org/seq_tools/genie.html • Glimmer university of maryland • http://www.cbcb.umd.edu/software/GlimmerHMM/ Gene Prediction
Elaborate models • Can include all regions in the model • States for each position in each region • Coding region could be simple set of three regions for -35 sequence T T G A C A for -10 sequence T A T A A T -35 -10 Ribosomal binding site Coding region Termination region Polymerase binding Start Codon Transcription start site Gene Prediction
Used in many applications • Classic example: states are rainy or sunny • If know whether someone is walking, shopping or cleaning, can predict state states Emissions Observations Gene Prediction
State is hidden • If something that is observable is dependent on an underlying state can use HMM • In motifs sequence is visible, whether or not a region is a promoter site is not Gene Prediction
Probabilities • Each state has a probability of emitting any given observation • Each state has a probability of transitioning to any given next state Probabilistic parameters of a hidden Markov model (example)x — statesy — possible observationsa — state transition probabilitiesb — output probabilities Gene Prediction