210 likes | 387 Vues
EMBOSS as a DAS Client. Peter Rice pmr@ebi.ac.uk Mahmut Uludag uludag@ebi.ac.uk 3rd March 2011. EMBOSS: A quick introduction. European Molecular Biology Open Software Suite Open source package for sequence analysis ANSI C source code GPL licensed applications, LGPL libraries
E N D
EMBOSS as a DAS Client Peter Rice pmr@ebi.ac.uk Mahmut Uludag uludag@ebi.ac.uk 3rd March 2011.
EMBOSS: A quick introduction • European Molecular Biology Open Software Suite • Open source package for sequence analysis • ANSI C source code • GPL licensed applications, LGPL libraries • 200+ applications • 100+ third party applications in 15 associated packages • Project started 1996 at Sanger Centre and HGMP • Now based at EBI • Release 6.3.0 15th July 2010 • Funded by UK-BBSRC and EMBL-EBI EMBOSS as a DAS Client
EMBOSS history • Project started at Sanger Centre and SEQNET August 1996 • Alan moved from SEQNET 1997 (Wellcome funding) • Peter moved to Lion Bioscience 2000 (CCP11-BBSRC/MRC) • Peter moved to EBI 2003 • HGMP closed 2005: Alan+Jon moved to EBI • BBSRC funding (limited) 2006-2009 • BBSRC BBR funding 2009-2011 • Major new developments • New data types • New data sources • Built-in ontologies EMBOSS as a DAS Client
EMBOSS command line interface • EMBOSS applications run from the command line • This is not the only interface • There are over 100 interfaces and packaged systems available • Web interfaces • Graphical user interfaces (GUIs) • Web services • All applications have a command definition file (.acd) • Defines all inputs, outputs, and other options • Read at startup • Contains all command line options with descriptions • Template for any other interface EMBOSS as a DAS Client
EMBOSS command line example % antigenic Input protein sequence(s): uniprot:actb1_fugru Minimum length of antigenic region [6]: Output report [actb1_fugru.antigenic]: % antigenic uniprot:actb1_fugru -auto EMBOSS as a DAS Client
EMBOSS ACD File integer: minlen [ standard: "Y" minimum: "1" maximum: "50" default: "6" information: "Minimum length of antigenic region" ] endsection: required section: output [ information: "Output section” type: "page” ] report: outfile [ parameter: "Y" rformat: "motif" multiple: "Y" taglist: "int:pos=Max_score_pos" ] endsection: output application: antigenic [ documentation: "Finds antigenic sites in proteins" groups: "Protein:Motifs" ] section: input [ information: "Input section” type: "page“ ] seqall: sequence [ parameter: "Y" type: “proteinstandard" ] endsection: input section: required [ information: "Required section” type: "page” ] EMBOSS as a DAS Client
EMBOSS ACD File with EDAM Annotation integer: minlen [ standard: "Y" minimum: "1" maximum: "50" default: "6" information: "Minimum length of antigenic region" relations: "EDAM:0001249 data Sequence length“ ] endsection: required section: output [ information: "Output section” type: "page” ] report: outfile [ parameter: "Y" rformat: "motif" multiple: "Y" taglist: "int:pos=Max_score_pos" relations: "EDAM:0001534 data Peptide immunogenicity report“ ] endsection: output application: antigenic [ documentation: "Finds antigenic sites in proteins" groups: "Protein:Motifs" relations: "EDAM:0000201 topic Immunological analysis" relations: "EDAM:0000416 operation Epitope mapping“ ] section: input [ information: "Input section“ type: "page” ] seqall: sequence [ parameter: "Y" type: “proteinstandard" relations: "EDAM:0001219 data Pure protein sequence" relations: "EDAM:0000849 data Sequence record" relations: "EDAM:0002178 data 1 or more“ ] endsection: input section: required [ information: "Required section” type: "page” ] EMBOSS as a DAS Client
Documentation & books Three books at typesetting stage. • Administrators’ Manual • Users’ Manual • Developers’ Manual Concomitant major revision of EMBOSS website. Automation of website content addition. Books to form basis of new website content. EMBOSS as a DAS Client
EMBOSS: Sequences Uniform Sequence Address (USA): URL-style naming Derived from the familiar "VMS logical name" syntax used by SRS and GCG. database : entryname • embl : ecompa ID or accession can be used in this way • uniprot-id : opsd_bovin SRS syntax for query by ID • embl-acc : x13776 SRS syntax for query by accession format :: filename • fasta :: /users/pmr/paamir.fa Filename with specific format • ecoompa.genbank With no format, can try all formats format :: filename : entryname • fasta :: unfinished : AH6.1 Most formats allow multiple sequences Also @listfile and asis::gctgactgactgatg Queries database-field:query SRS syntax for id, acc, sv, des, key, org EMBOSS as a DAS Client
New data resources • Aim to read “all” public data resources • Follow cross-references (explicit and implied) • UniProt • EMBL/GenBank/DDBJ • Other • Servers • Multiple data resources through a single server definition • DAS, Ensembl, BioMart, WsEbeye, DbFetch, SRS • Cache files of resource definitions for server • Data resource catalogue (drcat) • 600+ data resources • Query terms and URLs • EDAM annotation of resources, formats, identifiers, terms EMBOSS as a DAS Client
Data resource catalogue (drcat) ID ArachnoServer Acc DB-0145 Name ArachnoServer Desc Spider toxin database URL http://www.arachnoserver.org Cat Organism-specific databases Taxon 6845 | Arachnida EDAMres 0000621 | Organism-specific EDAMdat 0002400 | Toxin annotation EDAMid 0002578 | ArachnoServer ID Xref SP_explicit | ArachnoServer ID;Toxin name Query Toxin annotation | HTML | ArachnoServer ID | www.arachnoserver.org/toxincard.html?id=%s Example ArachnoServer ID | AS000014 CCmisc BMC Genomics 10:375-375(2009); [Pubmed: 19674480] EMBOSS as a DAS Client
EMBOSS Data Types • Sequences • Nucleotide (DNA and RNA) • Protein • Features • Attached to sequences • Independent data objects • Bio-Ontologies (OBO) • Taxonomy (NCBI) • Data Resources • Assembled reads • Text • Text, HTML, XML EMBOSS Datatypes
New data types • Reuse “USA” syntax • [Server:] Dbname : identifier Database has an access method • [Server:] Dbname – field : query General field names • Data types: features, bio-ontologies, taxonomy, etc. • Access methods: HTTP, DAS, BioMart, Ensembl, ... • Multiple types and formats for a server/resource • type: “sequence features” • format: “embl fasta” EMBOSS as a DAS Client
EMBOSS Query Language • Query fields are now made general • Any field queriable by the access method (DAS, SRS, …) • Any index created by indexing applications • Any query term in the data resource catalogue • Multiple queries combined • For one data resource • AND, OR, … to combine queries EMBOSS as a DAS Client
DAS Server Definitions SERVER das [ method: "dassource" type: "sequence, features" url: "http://www.dasregistry.org/das/" comment: "access sequence/feature sources listed on das registry (http://www.dasregistry.org/das/)" cachefile: "server.dassource" ] EMBOSS as a DAS Client
DAS Server Definitions SERVER ensembldas [ method: "dassource" type: "sequence, features" url: "http://www.ensembl.org/das/" comment: "access sequence/feature sources on ensembl das server (http://www.ensembl.org/das/)" cachefile: "server.ensembldas" ] EMBOSS as a DAS Client
DAS Example DB Ensembl_Human_Genes [ method: das type: "Sequence, Features“ taxon: "9606“ format: "das, dasgff“ url: http://www.ebi.ac.uk/das-srv/genedas/das/ Homo_sapiens.Gene_ID.reference example: "ENSG00000139618“ comment: "The Ensembl human Gene_ID reference source, serving sequences and non-location features.“ hasaccession: "N“ identifier: "segment“ fields: "segment, type, category, categorize, feature_id“ ] EMBOSS as a DAS Client
Ensembl DAS Example DB Felis_catus_CAT_prediction_transcript [ method: das type: "Nucfeatures“ taxon: "9685“ format: "dasgff“ url: http://www.ensembl.org/das/Felis_catus.CAT.prediction_transcript example: "scaffold_209987[1:550]“ comment: "Annotation source for Felis_catus prediction_transcript“ hasaccession: "N“ identifier: "segment“ fields: "segment, type, category, categorize, feature_id“ ] EMBOSS as a DAS Client
EMBOSS Query Language • das: ensembl_human_genes: ENSG00000139618 • ensembldas: Felis_catus_CAT_prediction_transcript: scaffold_209987 [1:550] • das: Homo_sapiens_GRCh37_transcript: 10 [32889611:32973347] • das: uniprot: P00280 • das: cath: 5pti • das: uniparc: UPI000000000A • das: Homo_sapiens_GRCh37_reference- {segment: 11 & type: supercontig} EMBOSS as a DAS Client
EMBOSS Query Language: Future • Ontology-based searches of data resources • Taxonomy • EDAM terms • Resources • Data types • Identifiers • Descriptions • Search for applications matching data types • Sequences and features • Nucleotide and protein • … • Support for DAS advanced query ... EMBOSS as a DAS Client
Acknowledgements • EBI: Peter Rice, Alan Bleasby, Jon Ison, Mahmut Uludag, Martin Senger, Tom Oinn, Jaina Mistry, Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam • RFCGR/HGMP: Alan Bleasby, Jon Ison, Tim Carver, Hugh Morgan, Claude Beazley, Lisa Mullan, Damian Counsell, Gary Williams, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop • Sanger Institute: Ian Longden, Richard Bruskiewich, Simon Kelley • LION: Mahmut Uludag, Thomas Laurent, Bijay Jassal, Bren Vaughan, Thure Etzold • National bioinformatics service providers in: Norway, Spain, Italy, Netherlands, Germany, Belgium, Russia, China, Canada, Australia, Argentina • Others: Catherine Letondal, Don Gilbert, Rodger Staden, Bill Pearson, Webb Miller, Marie-Laetitia Denayer, Amandine Schurmann, Gabriele Weiler, Luke McCarthy, David Mathog, David Bauer, Henrikki Almusa, Thomas Siegmund, Scott Markel, Darryl Leon, Bastien Chevreux, Ivo Hofacker, ... • IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun, LION bioscience, SciTegic, Cambridge University Press • Open-Bio Foundation, Sourceforge, Debian, Fedora, CEH ... And the British Antarctic Survey http://emboss.sourceforge.net http://emboss.open-bio.org/wiki/Latest_developments EMBOSS as a DAS Client