1 / 28

A new and improved system to order, produce, search, and maintain BLAST databases

A new and improved system to order, produce, search, and maintain BLAST databases. Tom M adden IEB seminar May 19, 2011. What is BLAST?. B asic L ocal A lignment S earch T ool Calculates similarity for biological sequences.

larue
Télécharger la présentation

A new and improved system to order, produce, search, and maintain BLAST databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A new and improved system to order, produce, search, and maintain BLAST databases Tom Madden IEB seminar May 19, 2011

  2. What is BLAST? • Basic Local Alignment Search Tool • Calculates similarity for biological sequences. • Produces local alignments: only a portion of each sequence must be aligned. • Uses statistical theory to determine if a match might have occurred by chance.

  3. Projects since last review • BLAST web page redesign and MyBLAST system. • BLAST+ library and applications. • Delta-BLAST. • BLAST Database Pipeline redesign and BlastDBInfo.

  4. Outline • Summary of the problem. • The “current” system. • The “new” system (BlastDB-pipeline). • Future plans.

  5. The problem • There are BLAST databases such as nt, est, htgs, gss, nr produced by the ID team. • The NCBI also provides domain specific databases with contents specified by different NCBI groups. • Example: HTGS with phases 0, 1, 2, and 3 for a specified organism; RefSeqRNAs annotated on genomic RefSeqs included in an annotation run; DNA sequences used in GEO. • BLAST users need to be able to find and search these databases.

  6. BLAST database statistics • 15140 DNA databases, 3903 protein databases. • Largest database: WGS with 190 billion bases. • Smallest database: Escherichia_coli_o157_h7_str__ec4486_WGS with 444 bases. • How many contain only genomic DNA? • How many contain only cDNA? • How many databases for any given taxid? • How many contain only RefSeq entries?

  7. Current system • Built in the last century. • Different groups at the NCBI produce many of the domain specific databases. • They must assemble the required sequences (in FASTA?), produce BLAST database, and then request rdist. • Problems with the current system: • Redundant effort (many groups writing the same script). • No overall tracking of databases. • Issues with full disks, bad scripts, empty files. • Issues with outdated databases. • Issues with documentation.

  8. New system: Blastdb-pipeline • Joint effort of BLAST and ID teams. • Group at NCBI "orders" a database. Contents of the database determined by the group. • Metadata is produced for each database, it can be retrieved through eutils. • Can be used to customize web pages. • Will replace the current system.

  9. BLASTDB-Pipeline (sketch) Database on disk ID Team BLAST search page order/update Blastdbinfo (entrez database) BLASTDB-SVC NCBI staff BLAST DB order/update BLAST search page Produce metadata and statistics

  10. Ordering a database (two and a half ways). • Define database as an entrezquery, example queries: • RefSeqGene: “refseqgene[keyword]” • Geo: “nucleotide_geoprofiles [filter]” • Mouse ESTs: “txid10090[orgn] AND (gbdiv_est[prop])” • Specify a GenColl accession. • Upload a "raw" database. Discouraged, but needed for gnomon, UniVec, etc.

  11. Metadata part 1: The sequence Sequence sources: SNP, GenBank, Gnomon, RefSeq, SRA, trace, PDB, or SwissProt.

  12. What can we learn from an Entrez query? Database: Pongoabelii ESTs Entrez query: txid9601[orgn] AND (gbdiv_est[prop]) Pongoabelii (Taxid: 9601) Type: cDNA Strategy: EST Source: GenBank

  13. Metadata PART 2: the Rest • Species level taxid (e.g, 9606 for Homo sapiens). • BioProject ID. • Title. • Description (extended title). • Genome collection assembly name. • Entrez query. • Keywords.

  14. Other Metadata sources • Genome collections can provide metadata. • Submitter provides metadata for uploaded databases. • Trace has XML dump.

  15. Eutils access

  16. uploaded database

  17. BLASTDB-Pipeline (sketch) Database on disk ID Team BLAST search page order/update Blastdbinfo (entrez database) BLASTDB-SVC NCBI staff BLAST DB order/update BLAST search page Produce metadata and statistics

  18. BlastDBInfo query: “guinea pig” AND genomic [SeqType]

  19. BlastDBInfo query: clostridium difficile

  20. Blastdbinfo statistics • 821 cDNA databases. • 7320 genomic databases. • 513 RefSeq databases. • 10 Caviaporcellus (guinea pig) databases.

  21. New system supports production databases • GEO • RefSeqGene • SNP • Top-level databases (nt, est, htgs, nr) • SRA • Trace • RefSeq Assembled Genomes

  22. Blast.cgi produces Assembled Genomes pages

  23. How to use the new system • Discuss the need for a BLAST database with your supervisor. • Look at the “Blastdb-Pipeline end user manual”, available in Sharepointat‪IEB> ‪Molecular Software Section > ‪BLAST > ‪BLAST db dump process redesign‬ • Login to NCBILS once (with NIH username and password). • Have your supervisor email blastsoft@ncbi.nlm.nih.gov and request that you be given permissions in Blastdb-Pipeline. • Submit your database order.

  24. Future plans(wild-eyed speculation) • Find databases and/or pages based upon organism or some other criteria. • Produce on-the-fly reports about a BLAST database. • Add link from BLAST report back to BioProjects. • Add link to WGS master record to a BLAST page.

  25. Finding Databases

  26. Database documentation

  27. Acknowledgements • Yan Raytselis • Christiam Camacho • Yuri Merezhuk • Irena Zaretskaya • IlyaDondoshansky • MishaKimelman • Eugene Yaschenko • Anatoly Mnev • Mike DiCuccio • AviKimchi • Paul Kitts • Francoise Thibaud-Nissen • Sergei Resenchuk • Deanna Church • GrishaStarchenko • Aaron Gussman • PramodParanthaman • Mark Johnson • AmanjeevSethi • Jeff Beck • Michael Domrachev • Eric Sayers • Tao Tao • Peter Cooper • Wayne Matten • Scott McGinnis

More Related