220 likes | 314 Vues
Explore how BLAST parameters impact alignment outcomes and optimize settings for varying sequence homology. Use mutated sequences to observe sequence evolution with the Point Accepted Mutation (PAM) model. Understand BLAST heuristic approaches, including Suffix Tree and Lookup table. Learn about BLAST versions, arguments, and practical applications for aligning real sequences. Experiment with BLAST on your machine for multiple sequence analysis. Utilize BLAST output examples to interpret alignment scores. Enhance your understanding of substitution scores and apply concepts to blast mRNA sequences for human and cattle genetics. Expected homology levels and efficiency in BLAST alignments are discussed.
E N D
Blast in practice BINF350, Tutorial 4 Karen Marshall
Aim • Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome • To optimise blast parameters for alignments with different levels of sequence homology
v Practical: Part 1 • Start with an ~200 bp original DNA sequence • Simulation mutation events over time and collect sequences • Blast original sequence against mutated sequences • Repeat blasts using different parameters Mutated sequences Blast Original sequence
Simulation of mutated sequences • Point accepted mutation (PAM) model of molecular evolution • 1 PAM = 1 mutation per 100 bases on average • 1 PAM 99.0% sequence homology • 10 PAM 90.6% sequence homology • 50 PAM 63.5% sequence homology • Concept of forward and backwards mutation
for each ‘successive PAM’ for each ‘nucleotide’ if (rand > 0.01) do not mutate else if (rand <=0.01) mutate by random selection from the non-identical bases
Step 1 2 3 BLAST - Heuristic • Suffix Tree • Lookup table • Words/seeds • Location • Threshold T • Larger seq file
BLAST http://www.ncbi.nih.gov/BLAST/blast_whatsnew.shtml • BLAST 2.2.8 release notes • Correction to tblastx alignment computation • ia32-linux now requires glibc 2.2.5 • Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040204/ncbi.tar.gz . • Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/ . • BLAST 2.2.7 release notes • Standalone BLAST is now available for amd64-linux. • formatdb now restricts volume sizes to 1G on 32-bit platforms for performance reasons. • The -A option has been removed from formatdb, that is, all databases will be created with ASN.1 deflines. • tblastn query concatenation now works correctly on 64-bit platforms. • The wwwblast source code has been merged into the C toolkit tree and is no longer distributed with the binaries. • Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040202/ncbi.tar.gz . • Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/ .
BLAST on your own machine • Allows you to BLAST multiple sequences • most web versions are single sequence only • Steps • Sequence files in FASTA format Can have multiple sequences in each file but no duplicates • Format larger sequence file into a database Formatdb –i dbfile.txt –p F –o T • Perform BLAST using appropriate switches BLASTALL –p BLASTN –d dbfile.txt –i comp.txt –o out.txt
BLAST 2.2.8 • Arguments see appendix of handout • –W for seed word length (default = 11) • -r reward for a match (default = 1) • -q penalty for a mismatch (default = 3) • -G cost to open a gap • -E cost to extend a gap • -F filter query sequence • -e to set threshold expectation (threshold for HSP before gaps are included) • -m to specify different output options
Example of BLAST output: -m3 Score E Sequences producing significant alignments: (bits) Value 1_10 170 3e-046 0_0 170 3e-046 4_10 115 2e-029 2_10 107 4e-027 5_10 96 2e-023 3_10 96 2e-023 4_20 68 3e-015 2_20 68 3e-015 5_20 56 1e-011 QUERY 1 agattcactggtgtggcaagttgtctctcagactgtacatgcattaaaattttgcttggc 60 1_10 1 ............................................................ 60 0_0 1 ............................................................ 60 4_10 3 ....t.....c......ag..................a.................... 60 2_10 1 ............a..c....a...........a................g.......... 60 5_10 2 ........c......a.........g............................c.... 60 3_10 1 .................g........t.....................c.....a..... 60 4_20 3 ....t.....c......ag....a.....g.......a.................... 60 2_20 1 ............a..c...ta...........aa......c..a.....g..... 55 5_20 4 ......c..c...a....g....g..............a......c......c.... 60
Substitution scores • Optimal substitution scores were derived for different PAM distances / sequence homologies (States et al., 1991) • Of importance is the match to mismatch score ratio
Substitution scores • ‘Better’ substitution matrices exist, but not yet implemented in most BLAST software
Practical: Part 2 • Apply concepts from Part 1 to ‘real sequences’ • BLAST mRNA sequence for human and cattle INFG to an ~1/2 Mb sequence of human DNA • Use optimal blast parameters for expected homology Human DNA Blast Human INFG mRNA Cattle INFG mRNA
Expected levels of sequence homology • Varies for sequences being considered and genomic region Human to mouse comparison, from …
Efficiency of BLAST • Human to cattle coding sequence ~85% homology (~PAM 15)
INFG mRNA sequences • Extracted from NCBI website using batch entrez INFG_refseq.txt >gi|10835170|ref|NM_000619.1| Homo sapiens interferon, gamma (IFNG), mRNA TGAAGATCAGCTATTAGAAGAGAAAGATCAGTTAAGTCCTTTGGACCTGATCAGCTTGATACAAGAACTA CTGATTTCAACTTCTTTGGCTTAATTCTCTCGGAAACGATGAAATATACAAGTTATATCTTGGCTTTTCA GCTCTGCATCGTTTTGGGTTCTCTTGGCTGTTACTGCCAGGACCCATATGTAAAAGAAGCAGAAAACCTT AAGAAATATTTTAATGCAGGTCATTCAGATGTAGCGGATAATGGAACTCTTTTCTTAGGCATTTTGAAGA ATTGGAAAGAGGAGAGTGACAGAAAAATAATGCAGAGCCAAATTGTCTCCTTTTACTTCAAACTTTTTAA AAACTTTAAAGATGACCAGAGCATCCAAAAGAGTGTGGAGACCATCAAGGAAGACATGAATGTCAAGTTT TTCAATAGCAACAAAAAGAAACGAGATGACTTCGAAAAGCTGACTAATTATTCGGTAACTGACTTGAATG TCCAACGCAAAGCAATACATGAACTCATCCAAGTGATGGCTGAACTGTCGCCAGCAGCTAAAACAGGGAA GCGAAAAAGGAGTCAGATGCTGTTTCAAGGTCGAAGAGCATCCCAGTAATGGTTGTCCTGCCTGCAATAT TTGAATTTTAAATCTAAATCTATTTATTAATATTTAACATTATTTATATGGGGAATATATTTTTAGACTC ATCAATCAAATAAGTATTTATAATAGCAACTTTTGTGTAATGAAAATGAATATCTATTAATATATGTATT ATTTATAATTCCTATATCCTGTGACTGTCTCACTTAATCCTTTGTTTTCTGACTAATTAGGCAAGGCTAT GTGATTACAAGGCTTTATCTCAGGGGCCAACTAGGCAGCCAACCTAAGCAAGATCCCATGGGTTGTGTGT TTATTTCACTTGATGATACAATGAACACTTATAAGTGAAGTGATACTATCCAGTTACTGCCGGTTTGAAA ATATGCCTGCAATCTGAGCCAGTGCTTTAATGGCATGTCAGACAGAACTTGAATGTGTCAGGTGACCCTG ATGAAAACATAGCATCTCAGGAGATTTCATGCCTGGTGCTTCCAAATATTGTTGACAACTGTGACTGTAC CCAAATGGAAAGTAACTCATTTGTTAAAATTATCAATATCTAATATATATGAATAAAGTGTAAGTTCACA ACT >gi|31982948|ref|NM_174086.1| Bos taurus interferon, gamma or immune type [interferon gamma type 2] (IFNG), mRNA ATTAGAAAAGAAAGATCAGCTACCTCCTTGGGACCTGATCATAACACAGGAGCTACCGATTTCAACTACT CCGGCCTAACTCTCTCCTAAACAATGAAATATACAAGCTATTTCTTAGCTTTACTGCTCTGTGGGCTTTT GGGTTTTTCTGGTTCTTATGGCCAGGGCCAATTTTTTAGAGAAATAGAAAACTTAAAGGAGTATTTTAAT GCAAGTAGCCCAGATGTAGCTAAGGGTGGGCCTCTCTTCTCAGAAATTTTGAAGAATTGGAAAGATGAAA
Human Chr12 sub-sequence • Extracted from USCS ‘Golden Path’ website • chr12:66,589,493-67,085,092 ~ ½ Mb • does contain INFG gene • Repeats masked to lower case >hg16_dna range=chr12:66589493-67085092 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=lower CATTCATTACTTTTATAAGGTTTCTCTCTGGTATGCATCTGACTTACATC ATGGGAAAGCTAGTTTCATGACTCCTTTGGAATAGTTGTGGTCCTGAATA TGGAAAATCAATTAATGAATAGCTTAAAGCACAATAGTCAACAAATAGAT GTGAAAATTCTTTGTGAACTTTAAAGTCTTACTTAAACGTGAGATATTAT ATACAGTGTTTTATGTtagactgtgagcttgttaaagaaagaactatgcc ttctttttctttctaccagttccagtgcctcgtacaacatagaaaccata agtgtttttgaaagagcaaatGAATATTGGAAGGAGTAAGGTGATAGCTA AAGCTAAAACAATGTTTAGGGAGAACAACTGAAACAAAAGCAGCATTTGT GTCTTAAACTCATGGCCTCTGAAACAGCCTTGATAGATAGTAGAGAGGGT CAGATAGAGAGAGCCTGACTCAGAGATTGGGAAGCCCTATATGGTTGGAA GAGAAAGTAAGAGGAGACCCAAAGTATTAGACCACAGAAAGAAGTTCTAA TAGTCAGTGTCAAGAGATTCAGCAGGAGGTTGTGTATCAGGATTTGGGTT TGGGAGTGGTATGGAGCTTACCTATCTCTAAAACGAGCAGGAGGGCAAAA ATGAATCCCAGTCCCAAAGAATTCACTAATGGCCAGCAAACCAACACAGG AACCCCAGCACAGACACACAAGATAGGAAACCAGTTGTTGAAACTACAAT GTAACGGGGCTGATTTAATAAAAACCTGTTACATGAGTTATAGGtttttt ttttttttttttttttttAATGTATGTGCCCCACCTTAGGAAAGCCAGAA ATAATGGCAACGAAGAAATATTCATTCACAGTGAGAAAGCCATTAGAACG TTGGCTGGAACCTAGGGGCATATCGAGGGCCCACGTGGGAAGGACAATGA CAACTTGTTTAGTCCTCACTGGTTTCCCAGTCTGTGGATCTTATTTGAAT hs_chr12_subseq.txt
Human INFG gene • From USCS ‘Golden Path website’ genome browser
Assessment • Submit • for either Part 1 or Part 2 the BLAST output, concatenated into one file and annotated • a short summary / discussion of the concepts covered in this practical (< 500 words)
References • Strongly recommend BLAST tutorial on NCBI site • http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html • Further “Bioinformatics for quantitative geneticists course notes” J. McEwan • http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC