210 likes | 329 Vues
The Gibbs Motif Sampler. By: Andrea Smith Introduction to Bioinformatics December 9 , 2004. Background:. Designed by Eric C. Rouchka and Bill Thompson Identifies motifs or conserved regions within sequences of DNA or protein. Used by several tools such as AlignACE. Background:.
E N D
The Gibbs Motif Sampler By: Andrea Smith Introduction to Bioinformatics December 9 , 2004
Background: • Designed by Eric C. Rouchka and Bill Thompson • Identifies motifs or conserved regions within sequences of DNA or protein. • Used by several tools such as AlignACE.
Background: • “Site” sampling assumes that a given nucleotide or amino acid sequence contains only one motif element of a particular motif type. • “Motif” and “recursive” sampling assume • that a given nucleotide or amino acid sequence • can contain zero or more.
The Gibbs Sampler uses a stochastic • process. Each time the program is run a • different result may be displayed. • To account for this the • Gibbs sampler can be run several times to • see which motifs are selected • most often.
There are two steps to the Gibbs algorithm used to find the most probable motif pattern: Initialization and Predictive Uptake step: Random locations are selected in each sequence but one to create an initial “null” model. This initial alignment is used to calculate the residue frequencies for both the motif and the “background”
Sampling step: The background probabilities are given weights and the motif start position is chosen based on these weights. Basically the sampler randomly selects locations that could be the motif based on the first model. This cycle is continued until a motif model that makes the most sense based on the calculated probabilities.
Example 1: Saccharomyces cervisiae PHO genes • gene or cluster of genes submitted to be • searched using FASTA format: >PHO5 PHO5 upstream sequence, from -800 to -1, size 800 TTTTACACATCGGACTGATAAGTTACTACTGCACATTGGCATTAGCTAGGAGGGCATCCA AGTAATAATTGCGAGAAACGTGACCCAACTTTGTTGTAGGTCCGCTCCTTCTAATAATCG CTTGTATCTCTACATATGTTCTATTTACTGACCGAAAGTAGCTCGCTACAATAATAATGTTGACCTGATGTCAGTCCCCACGCTAATAGCGGCGTGTCGCACGCTCTCTTTACAGGACG CGGAGACCGGCATTACAAGGATCCGAAAGTTGTATTCAACAAGAATGCGCAAATATGTC ACGTATTTGGAAGTCATCTTATGTGCGCTGCTTTAATGTTTTCTCATGTAAGCGGACGTC GTCTATAAACTTCAAACGAAGGTAAAAGGTTCATAGCGCTTTTTCTTTGTCTGCACAAAG AAATATATATTAAATTAGCACGTTTTCGCATAGAACGCAACTGCACAATGCCAAAAAAAG TAAAAGTGATTAAAAGAGTTAATTGAATAGGCAATCTCTAAATGAATCGATACAACCTTG GCACTCACACGTGGGACTAGCACAGACTAAATTTATGATTCTGGTCCCTGTTTTCGAAGA ATCGCACATGCCAAATTATCAAATTGGTCACCTTACTTGGCAAGGCATATACCCATTTG GGATAAGGGTAAACATCTTTGAATTGTCGAAATGAAACGTATATAAGCGCTGATGTTTTG CTAAGTCGAGGTTAGTATGGCTTCATCTCTCATGAGAATAAGAACAACAACAAATAGAGC AAGCAAATTCGAGATTACCA • results are processed and displayed via e-mail:
Gibbs 2.06.018 Sep 22 2004Data file: /tmp/gibbs2530/data.txtOutput file: /tmp/gibbs2530/outfile.txtBackground Composition Model file: /tmp/gibbs2530/data.txt_info-detCurrent directory: /tmp/gibbs2530The following options are set:Concentrated Region False Sequence type FalseCollapsed Alphabet False Pseudocount weight True Use Expectation/Maximization False Don't Xnu sequence FalseHelp flag False Near optimal cutoff True Number of iterations True Don't fragment FalseDon't use map maximization False Repeat regions FalseOutput file True Informed priors file FalsePlateau periods True palindromic sequence FalseDon't Reverse complement False Number of seeds True Seed Value False Pseudosite weight True Suboptimal sampler output False Overlap FalseAllow width to vary False Wilcoxon signed rank FalseSample along length False Output Scan File FalseOutput prior file False Modular Sampler FalseIgnore Spacing Model False Sample Background FalseBkgnd Comp Model True Init from prior FalseHomologous Seq pairs False Parallel Tempering FalseGroup Sampler False No progress info FalseFragment from middle False Verify Mode FalseAlternate sample on k False No freq. soln. True Calc. def. pseudo wt. False Motif/Recur smpl FalsePhylogenetic Sampling False Supress Near Opt. True site_samp = 1nMotifLen = 10nAlphaLen = 4nNumMotifs = 5dPseudoCntWt = 0.1dPseudoSiteWt = 0.8nMaxIterations = 500lSeedVal = 1101873584nPlateauPeriods = 50nSeeds = 20nNumMotifTypes = 1dCutoff = 0.5RevComplement = 1glOverlapParam = 0Rcutoff factor = 0.001Post Plateau Samples = 0Frag/Shft Per. = 5Frag width = 15 First is a list of options specified for the run…
(Advanced options include varying motif width, allowing fragmentation, sampling reverse complements, etc. For this example only the minimum amount of data required for sampling was provided- FASTA format and motif width.)
Next is a table indicating the frequency of each residue in each position of the motif throughout the alignment: Motif model (residue frequency x 100)____________________________________________Pos. # a t c g Info_____________________________ 1 | 100 . . . 1.2 3 | 100 . . . 1.2 4 | . . 100 . 1.9 5 | . . 16 83 1.4 6 | . 100 . . 1.4 7 | 66 33 . . 0.6 8 | . 100 . . 1.4 9 | 100 . . . 1.2 10 | . 100 . . 1.4 11 | 100 . . . 1.2nonsite 33 28 19 18site 46 33 11 8
From the frequency table and prior calculations a probability model (MAP) is calculated indicating the probability of each residue appearing in each position within the motif: Motif probability model____________________________________________Pos. # a t c g ____________________________________________ 1 | 0.949 0.022 0.015 0.014 3 | 0.949 0.022 0.015 0.014 4 | 0.026 0.022 0.938 0.014 5 | 0.026 0.022 0.169 0.783 6 | 0.026 0.945 0.015 0.014 7 | 0.641 0.330 0.015 0.014 8 | 0.026 0.945 0.015 0.014 9 | 0.949 0.022 0.015 0.014 10 | 0.026 0.945 0.015 0.014 11 | 0.949 0.022 0.015 0.014 Background probability model 0.326 0.283 0.195 0.181
Next is the alignment of the motifs found: 10 columnsNum Motifs: 6 1, 1 695 aaatg AAACGTATATA agcgc 705 1.00 F PHO5 PHO5 upstream sequence, from -800 to -1, size 2, 1 80 tatct AAACGTTTATA ttttt 90 0.96 F PHO8 PHO8 upstream sequence, from -180 to -1, size 3, 1 672 aagta AAACGTATATA agctc 682 1.00 F PHO11 PHO11 upstream sequence, from -800 to -1, si 4, 1 286 gccaa AGACGTATATA ctaag 276 1.00 R PHO81 PHO81 upstream sequence, from -800 to -1, si 4, 2 429 gctaa AGACCTTTATA atcaa 419 0.00 R PHO81 PHO81 upstream sequence, from -800 to -1, si 5, 1 668 ctcgt ATACGTATATA tataa 678 1.00 F PHO84 PHO84 upstream sequence, from -800 to -1, si * *********
Finally, the values used and map probability are provided so that the results are reproducible: Log Motif portion of MAP for motif a = -28.16387Log Fragmentation portion of MAP for motif a = -2.19722Log Background portion of Map = -4518.82616Log Alignment portion of Map = -44.12383Log Site/seq portion of Map = 0.00000Log Null Map = -4600.15275Log Map = 6.84166
Gibbs 2.06.018 Sep 22 2004Data file: /tmp/gibbs8904/data.txtOutput file: /tmp/gibbs8904/outfile.txtCurrent directory: /tmp/gibbs8904The following options are set:Concentrated Region False Sequence type FalseCollapsed Alphabet False Pseudocount weight True Use Expectation/Maximization False Don't Xnu sequence FalseHelp flag False Near optimal cutoff True Number of iterations True Don't fragment FalseDon't use map maximization False Repeat regions FalseOutput file True Informed priors file FalsePlateau periods True palindromic sequence FalseDon't Reverse complement False Number of seeds True Seed Value False Pseudosite weight True Suboptimal sampler output False Overlap FalseAllow width to vary False Wilcoxon signed rank FalseSample along length False Output Scan File FalseOutput prior file False Modular Sampler FalseIgnore Spacing Model False Sample Background FalseBkgnd Comp Model False Init from prior FalseHomologous Seq pairs False Parallel Tempering FalseGroup Sampler False No progress info FalseFragment from middle False Verify Mode FalseAlternate sample on k False No freq. soln. True Calc. def. pseudo wt. False Motif/Recur smpl FalsePhylogenetic Sampling False Supress Near Opt. True site_samp = 1nMotifLen = 10nAlphaLen = 20nNumMotifs = 3dPseudoCntWt = 0.1dPseudoSiteWt = 0.8nMaxIterations = 500lSeedVal = 1101948109nPlateauPeriods = 50nSeeds = 20nNumMotifTypes = 1dCutoff = 0.5RevComplement = 0glOverlapParam = 0Rcutoff factor = 0.001Post Plateau Samples = 0Frag/Shft Per. = 5Frag width = 15Sequences to be Searched:_________________________#1 gi|54042252|sp|P66940|TPIS_MYCTU Triosephosphate isomerase (TIM) (Triose-phosphate isomerase)#2 gi|32130219|sp|Q8CTD5|TPIS_STAEP Triosephosphate isomerase (TIM) (Triose-phosphate isomerase)#3 gi|20140628|sp|Q8ZJK9|TPIS_YERPE Triosephosphate isomerase (TIM) (Triose-phosphate isomerase)Processed Sequence Length: 769 Total sequence length: 769 Example 2 uses the amino acid sequence of the enzyme triose phosphate from three different organisms to find motifs… The parameters provided and the sequences to be searched are listed first.
Again, two matrices are provided, one with the residue frequencies and the resulting probability matrix: Motif model (residue frequency x 100)____________________________________________Pos. # a v c d e f g h i w k l m n y p q r s t Info_____________________________________________________________________________________________ 1 | . 100 . . . . . . . . . . . . . . . . . . 2.8 2 | . . . . . . . . 100 . . . . . . . . . . . 3.2 3 | 100 . . . . . . . . . . . . . . . . . . . 2.3 4 | . . . . . . . . . . . . . . 100 . . . . . 4.3 5 | . . . . 100 . . . . . . . . . . . . . . . 3.1 6 | . . . . . . . . . . . . . . . 100 . . . . 4.0 7 | . 33 . . . . . . 66 . . . . . . . . . . . 2.3 8 | . . . . . . . . . 100 . . . . . . . . . . 5.9 9 | 100 . . . . . . . . . . . . . . . . . . . 2.3 10 | . . . . . . . . 100 . . . . . . . . . . . 3.2nonsite 13 8 1 5 6 1 9 2 6 . 6 7 1 3 2 3 3 3 5 4site 20 13 . . 10 . . . 26 10 . . . . 10 10 . . . .
Probability Matrix: Motif probability model____________________________________________Pos. # a v c d e f g h i w k l m n y p q r s t ____________________________________________ 1 | 0.013 0.917 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 2 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.915 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 3 | 0.922 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 4 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.912 0.003 0.003 0.003 0.005 0.004 5 | 0.013 0.008 0.001 0.005 0.915 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 6 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.912 0.003 0.003 0.005 0.004 7 | 0.013 0.311 0.001 0.005 0.006 0.002 0.008 0.002 0.612 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 8 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.910 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 9 | 0.922 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 10 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.915 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 Background probability model 0.130 0.083 0.012 0.059 0.065 0.020 0.092 0.027 0.056 0.004 0.064 0.079 0.014 0.038 0.023 0.030 0.038 0.031 0.053 0.043
Motif Alignment: 10 columnsNum Motifs: 3 1, 1 168 qigsv VIAYEPVWAI gtgrv 177 1.00 F gi|54042252|sp|P66940|TPIS_MYCTU Triosephosphate i 2, 1 165 qlkev VIAYEPIWAI gtgks 174 1.00 F gi|32130219|sp|Q8CTD5|TPIS_STAEP Triosephosphate i 3, 1 163 afega VIAYEPIWAI gtgks 172 1.00 F gi|20140628|sp|Q8ZJK9|TPIS_YERPE Triosephosphate i **********Column 1 : Sequence Number, Site NumberColumn 2 : Left End LocationColumn 4 : Motif ElementColumn 5 : Right End LocationColumn 6 : Probability of ElementColumn 7 : Forward Motif (F) or Reverse Complement (R) Column 8 : Sequence Description from Fast A input
Values Input: Log Motif portion of MAP for motif a = -37.08646Log Fragmentation portion of MAP for motif a = -0.00000Log Background portion of Map = -2165.36239Log Alignment portion of Map = -19.63910Log Site/seq portion of Map = 0.00000Log Null Map = -2254.63689Log Map = 32.54894log MAP = sum of motif and fragmentation parts of MAP + background + alignment + sites/seq - NullFrequency Map = 32.548944Nearopt Map = 32.548944Maximal Map = 32.548944Elapsed time: 2.650000 secs
Things to watch out for: sequence not in FASTA format, motif length too long, motif length too short, results show repeats in low complexity regions • For example, by changing the motif width parameter from 10 to 20, the results look significantly different: Num Motifs: 3 1, 1 76 lsphd SGAYTGDVSGAFLAKLGCSYVVVGHSERR tyhne 104 1.00 F gi|54042252|sp|P66940|TPIS_MYCTU Triosephosphate i 2, 1 73 ayfee SGAYTGETSPVALSELGVKYVVIGHSERR dyfhe 101 1.00 F gi|32130219|sp|Q8CTD5|TPIS_STAEP Triosephosphate i 3, 1 71 vdvnl SGAFTGETSAEMLKDIGAQYIIIGHSERR tyhqe 99 1.00 F gi|20140628|sp|Q8ZJK9|TPIS_YERPE Triosephosphate i
Conclusion: • The Gibbs method is that it can predict biological motifs • in what was formerly called “junk DNA”. • Only about 3% of the genome codes • for genes. • Using these statistical methods • regulatory signals within the genomic junk can be • identified (potentially leading to treatment of genetic conditions • ranging from birth defects to cancer.)
References • http://bayesweb.wadsworth.org/gibbs/gibbs.html • http://kbrin.a-bldg.louisville.edu/~rouchka/CECS694/Lecture4.ppt • http://rsat.ulb.ac.be/rsat/tutorials/tut_gibbs.html