250 likes | 426 Vues
Using Local Tools: BLAST. BCHB524 2008 Lecture 11. Outline. Install and run blast from NCBI Download Format sequence databases Run by hand Running blast and interpreting results Directly and using BioPython Exercises Lecture 9 exercises. Local Tools.
E N D
Using Local Tools: BLAST BCHB5242008Lecture 11 BCHB524 - 2008 - Edwards
Outline • Install and run blast from NCBI • Download • Format sequence databases • Run by hand • Running blast and interpreting results • Directly and using BioPython • Exercises • Lecture 9 exercises BCHB524 - 2008 - Edwards
Local Tools • Sometimes web-based services don't do it. • For blast: • Too many query sequences • Need to search a novel sequence database • Need to change rarely used parameters • Web-service is too slow • For other tools: • No web-service? • No interactive web-site? • Insufficient back-end computational resources? BCHB524 - 2008 - Edwards
Download standalone blast • In Windows, make a folder "BLAST" in your "My Documents" folder • Google "NCBI Blast" • …or go to http://www.ncbi.nlm.nih.gov/BLAST • Click on "Help" tab • Under "Other BLAST Information", • Click on "Download BLAST Software and Databases" • From the table under "Executables", find the download link at row "win32-ia32" and column "blast" • Right-click on the download link and Save As… • Put the file in your new "BLAST" folder • In Windows, double-click on the downloaded file. BCHB524 - 2008 - Edwards
Folders: bin, data, doc Create folder: db Download standalone blast BCHB524 - 2008 - Edwards
Look in doc: Double-click to open web-page documentation Download gunzip.py from course homepage into db Download standalone blast BCHB524 - 2008 - Edwards
Download BLAST databases • Follow the link (above Executables) for the NCBI BLAST database FTP site: • ftp://ftp.ncbi.nlm.nih.gov/blast/db/ • The .tar.gz files contain databases already formatted for BLAST • The FASTA directory contains compressed (.gz) FASTA format sequence databases. • We'll download yeast.aa.gz and yeast.nt.gz to the db folder BCHB524 - 2008 - Edwards
Download BLAST databases BCHB524 - 2008 - Edwards
Uncompress FASTA databases • Select "Run…" from the "Start" menu • In the "Open" dialog box, type "cmd" and click OK BCHB524 - 2008 - Edwards
Uncompress FASTA databases • cd My Documents • cd BLAST • cd db • dir BCHB524 - 2008 - Edwards
Uncompress FASTA databases • gunzip.py yeast.*.gz • dir BCHB524 - 2008 - Edwards
Format FASTA databases • cd .. • bin\formatdb.exe -i db\yeast.aa -p T -o T • bin\formatdb.exe -i db\yeast.nt -p F -o T • dir db BCHB524 - 2008 - Edwards
Download formatdb databases • The .tar.gz files contain databases already formatted for BLAST • Download to BLAST\db and use the gunzip.py program to uncompress and unpack • For example, download • refseq_protein.00.tar.gz and refseq_protein.01.tar.gz • Uncompress and unpack • gunzip.py refseq_protein.*.tar.gz BCHB524 - 2008 - Edwards
Running BLAST from the command-line • We need a query sequence to search: • Copy and paste this FASTA file into notepad and save as "query.fasta" in the BLAST folder >gi|6319267|ref|NP_009350.1| Yal049cp MASNQPGKCCFEGVCHDGTPKGRREEIFGLDTYAAGSTSPKEKVIVILTDVYGNKFNNVLLTADKFASAGYMVFVPDILF GDAISSDKPIDRDAWFQRHSPEVTKKIVDGFMKLLKLEYDPKFIGVVGYCFGAKFAVQHISGDGGLANAAAIAHPSFVSI EEIEAIDSKKPILISAAEEDHIFPANLRHLTEEKLKDNHATYQLDLFSGVAHGFAARGDISIPAVKYAKEKVLLDQIYWF NHFSNV >gi|6319268|ref|NP_009351.1| Yal048cp MTKETIRVVICGDEGVGKSSLIVSLTKAEFIPTIQDVLPPISIPRDFSSSPTYSPKNTVLIDTSDSDLIALDHELKSADV IWLVYCDHESYDHVSLFWLPHFRSLGLNIPVILCKNKCDSISNVNANAMVVSENSDDDIDTKVEDEEFIPILMEFKEIDT CIKTSAKTQFDLNQAFYLCQRAITHPISPLFDAMVGELKPLAVMALKRIFLLSDLNQDSYLDDNEILGLQKKCFNKSIDV NELNFIKDLLLDISKHDQEYINRKLYVPGKGITKDGFLVLNKIYAERGRHETTWAILRTFHYTDSLCINDKILHPRLVVP DTSSVELSPKGYRFLVDIFLKFDIDNDGGLNNQELHRLFKCTPGLPKLWTSTNFPFSTVVNNKGCITLQGWLAQWSMTTF LNYSTTTAYLVYFGFQEDARLALQVTKPRKMRRRSGKLYRSNINDRKVFNCFVIGKPCCGKSSLLEAFLGRSFSEEYSPT IKPRIAVNSLELKGGKQYYLILQELGEQEYAILENKDKLKECDVICLTYDSSDPESFSYLVSLLDKFTHLQDLPLVFVAS KADLDKQQQRCQIQPDELADELFVNHPLHISSRWLSSLNELFIKITEAALDPGKNTPGLPEETAAKDVDYRQTALIFGST VGFVALCSFTLMKLFKSSKFSK BCHB524 - 2008 - Edwards
Running BLAST from the command-line • Run the BLAST command: • …and check out the result in query.txt. BCHB524 - 2008 - Edwards
Interpreting blast results • Parsing text-format BLAST results is hard: • Use XML format output where possible (-m 7) • Use BioPython's BLAST parser from Bio.Blast import NCBIXML result_handle = open("query.xml") for blast_result in NCBIXML.parse(result_handle): for alignment in blast_result.alignments: for hsp in alignment.hsps: if hsp.expect < 1e-5: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect BCHB524 - 2008 - Edwards
Running BLAST from Python • Python can run other programs, including blast and capture the output import os command = r'bin\blastall.exe -p blastp -i query.fasta -d db\yeast.aa' result_handle = os.popen(command) for l in result_handle: if l.startswith('Query='): print '\n'+l.rstrip()+'\n' if l.startswith('ref|'): print l.rstrip() BCHB524 - 2008 - Edwards
Running BLAST from BioPython • Will automatically format results as XML from Bio.Blast import NCBIStandalone blast_db = r'db\yeast.aa' blast_query = r'query.fasta' blast_exe = r'bin\blastall.exe' result_handle, error_handle = NCBIStandalone.blastall(blast_exe, "blastp", blast_db, blast_query) BCHB524 - 2008 - Edwards
NCBI Blast Parsing • Results need to be parsed in order to be useful… from Bio.Blast import NCBIXML for blast_result in NCBIXML.parse(result_handle): for alignment in blast_result.alignments: for hsp in alignment.hsps: if hsp.expect < 1e-5: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '...' print hsp.match[0:75] + '...' print hsp.sbjct[0:75] + '...' BCHB524 - 2008 - Edwards
Each blast result contains multiple alignments of a query sequence to a database sequence Each alignment consists of multiple high-scoring pairs (HSPs) Each HSP has stats like expect, score, gaps, and aligned sequence chunks NCBI Blast Parsing BCHB524 - 2008 - Edwards
NCBI Blast Parsing • Blast parsing skeleton from Bio.Blast import NCBIXML for blast_result in NCBIXML.parse(result_handle): # each blast_result corresponds to one query sequence # blast_result.query is query description, etc. # blast_result.descriptions contains one-line summary of alignments for alignment in blast_result.alignments: # each alignment corresponds to one database sequence # alignment.title is database description for hsp in alignment.hsps: # each query/database alignment consists of multiple # high-scoring pair alignment "chunks" # HSP statistics are here # hsp.expect, hsp.score, hsp.positives, hsp.gaps BCHB524 - 2008 - Edwards
Lab exercises • Try each of the examples shown in these slides. • Read through NCBI's documentation for the standalone tools. • Experiment with the different BLAST tools (blastn, tblastx, etc…) and programs included (blastclust,megablast). BCHB524 - 2008 - Edwards
Lab exercises • Find putative fruit fly / yeast orthologs • Download FASTA file drosph.aa.gz from NCBI • Download FASTA file yeast.aa.gz from NCBI • Uncompress and format each FASTA file for BLAST • Search fruit fly proteins against yeast proteins • For each fruit fly query, output the best yeast protein with a significant HSP • For each yeast query, output the best fruit fly protein with a significant HSP • Find fruit fly / yeast protein pairs which are mutual best hits. BCHB524 - 2008 - Edwards