EMBRACE and EMBOSS

EMBRACE and EMBOSS Integrating everything and Integrated by everything Peter Rice, EBI (pmr@ebi.ac.uk) June 2006

EMBRACE and EMBOSS EMBRACE is an EC-funded Network of Excellence with 18 partners, developing an integrated set of services for the major bioinformatics data resources and analysis tools. The EMB name was selected after two previous names were rejected. It stands for "European Model for Bioinformatics Research And Community Education" .... and has no connection with EMBL. EMBOSS is now 10 years old, with the project team hosted by EMBL-EBI, providing open source libraries and over 200 applications for sequence analysis. EMBOSS has its roots at EMBL Heidelberg, but started at the Sanger Centre and the UK EMBnet node. The EMB name reflects the EMBL and EMBnet origins as "European Molecular Biology Open Software Suite"

EMBRACE Network of Excellence - 18 partners with data resources, analysis tools, expertise in grid technology and experimental biologists. Graham Cameron, Peter Rice, Alan Bleasby — EBI, Cambridge, GB Toby Gibson — EMBL, Heidelberg, DE Andreas Gisel — Institute of Biomedical Technologies, Section Bari, CNR, IT Teresa Attwood — University of Manchester, GB Marco Pagni—Swiss Institute of Bioinformatics, CH Erik Bongcam-Rudloff — LCB/BMC, Uppsala, SE Vincent Breton — CNRS, Clermont Ferrand, FR Søren Brunak — CBS, Lyngby, DK José-María Carazo — CNB, Madrid, ES Arne Elofsson — DBB, Stockholm, SE Daniel Kahn — INRA/CNRS, Toulouse, FR Ralf Herwig — MPI für Molekulare Genetik, Berlin, DE Eija Korpelainen — CSC, Espoo, FI Christine Orengo — University College London, GB Yitzhak Pilpel — Weizmann Institute of Science, IL Gert Vriend — CMBI, Nijmegen, NL Alfonso Valencia — INTA-CAB, Madrid, ES Christian Bryne — University of Bergen, NO

EMBRACE Overview This kind of programming is hard to do. EMBRACE aims to make it easier, and within the reach of experimental biologists. To do this, we need an interoperable set of services and clients that can both find and make use of them.

EMBRACE aims to enable ... • a scientist to evoke the latest and best version of a given program without any concern for its physical location • the program to find the most up-to-date data without help from the user • workflows to automatically take advantage of whatever compute power is available • workflows to deliver results in a way which any user can understand • the scientist to follow connections to other relevant data and tools using all the straightforward idioms of web browsing and hyperlinks.

EMBRACE: Interconnectivity Application interface User interface Application

EMBRACE: Approaches • Defining an application interface • Design from the view of the user/application • Browser example • User provides a query and a data type • Generate a list of results by data resource • Expand and browse the list, following links • Select some or all as input to analysis tools • Requires human-readable definitions • Automation • A similar example, but with a program selecting and launching the analysis • Requires machine-readable definitions

EMBRACE Data Content • DNA sequence information • Protein sequence information • Genome annotation • Macromolecular Structure Data • Expression information • Literature • Orthologs • Untranslated regions Protein Families Alignments Protein/protein-associations Structural domains Gene3D ORFandDB SNPs in regulatory regions 3D Electron Microscopy data

EMBRACE Analysis Tools • EMBOSS • DNA sequence analysis • Protein sequence analysis • Pattern matching • Genome annotation • Expert systems • Hidden Markov Models • Homology searches • Phylogenetic analysis • Protein structure analysis • Protein structure comparison • Protein domain mapping • Microarrays and gene expression • Bioinformatics workflows • Bioinformatics tool environments • Protein structure prediction • Electron microscopy • Electron microscope tomography • Systems biology modelling • Text mining

Web services Grid services Information world Infrastructure world EMBRACEgrid Requires: Data management Data replication Service discovery Computing OK ?? OK KO KO ?? KO OK Lack of infrastructure providing low-level services Instability and lack of robustness Standards still evolving, and implementations lying behind

EMBRACE: Data Content Services • Promised deliverables are prototypes • Webservice technology • Content provided by EBI and EMBL Heidelberg • Access to: • Nucleotide sequence data resources • Protein sequence data resources • Protein motif resources • Technology choices kept flexible • SOAP webservices from EBI • BioMart from EBI • Existing services from other partners

EMBRACE: Analysis Tools Services • Promised deliverables are prototypes • Webservice technology • Content provided by EBI • Access to: • Sequence analysis tools (EMBOSS etc.) • Protein structure analysis tools (EMBOSS/EMBASSY etc.) • Technology choices kept flexible • SOAP webservices • SOAPlab project (EBI/MyGrid) • Life Science Analysis Engine standard (OMG) • Integration also implies • Tools will access data resources via EMBRACE interfaces

EMBRACE: Technology Choice • Promised deliverable is a survey of webservice and grid technologies • Will be made publicly available • To cover: • European Grids and Bioinformatics (EGEE etc.) • Webservice standards • Grid service standards • Current standards • Emerging standards • Recommendations on technology adoption • Recommendations on further technology watch • Technology test cases • Designed to demonstrate technology • Designed to show improvements in technology • Designed to highlight problems

EMBRACE: Test Cases • EMBRACE is driven by biological test cases: • 4 initial test cases in the proposal • Workshop (Uppsala, 2005) defined new test cases • Partners illustrating use of their content/tool resources • Test cases described in detail • Template adopted from BioMOBY • Implement template solutions • Identify missing components • Set priorities • ... and fill in the gaps

EMBRACE: Outreach • First workshops have been internal (inreach) • In 2006, workshops will be mixed with outreach • EMBRACE is aimed at skilled bioinformaticians • Need to address needs of biological researchers • EMBRACE provides a programming interface to services • Biologists need a simple "browser" • EMBRACE will need a simple interface to demonstrate utility • Example interfaces: • Taverna (EBI/MyGrid/OMII-UK) • Other workflow systems • Simple program examples • Simple script examples • "The Big Red Button"

EMBRACE Year Two • Prototype content services to become standard • Prototype tool services to become standard • Further prototypes beyond sequence data • Established technology choice • Well documented test cases • Good links to biological research community • Selected collaborators • Willing to explore emerging technologies • Biological (and practical) use cases

EMBOSS started in March 1996 First requirements based on a list of long-standing problems in existing commercial software (GCG), and the need for public source code First "ajax" library written August 1996 30 potential developer/user sites identified November 1996 (EMBnet Helsinki) Wellcome Trust proposal February 1997 (Sanger, HGMP and EBI) Accepted August 1997 Project started November 1997. EMBOSS 1.0.0 released on 15th July 2000. EMBOSS 2.0.0 released on 15th July 2002. EMBOSS 3.0.0 released on 15th July 2005 EMBOSS 4.0.0 will be released on 15th July 2006 EMBOSS: History

Each of the following groups had their own special needs which EMBOSS aimed to satisfy: Sanger Centre genomic sequencing and analysis groups RFCGR/HGMP registered academic users (about 10,000) EMBnet service providers in 30+ other countries with over 30,000 users Academic users everywhere Pharmaceutical and biotechnology industry Bioinformatics developers Original Target Users

Seqret is a very simple application It reads a sequence USA (in any format, from anywhere) It writes a sequence USA (in any format) If you tell it the sequence has feature annotation: It reads the features (in any format) It writes the features (in any format) Seqret has 13 lines of code Seqret

#include "emboss.h" int main(int argc, char **argv) { AjPSeqall seqall; AjPSeqout outseq; AjPSeq seq = NULL; embInit("seqret", argc, argv); seqall = ajAcdGetSeqall ("sequence"); outseq = ajAcdGetSeqout ("seqout"); while (ajSeqallNext (seqall, &seq)) ajSeqWrite (outseq, seq); ajSeqWriteClose (outseq); ajExit(); } The source code seqret.c

Nightly build with no compiler warnings 2,000 test runs (including expected fail conditions) 150 valgrind memory leak tests Code documentation validation and indexing ACD file validation ACD documentation completeness Program documentation: description, command line qualifiers, example run(s) and input/output files Web site updates EMBOSS Quality Control

Disaster proof software licences

1977 Fred Sanger sequences ΦX174 with computing by Rodger Staden 1996 EMBOSS started by Peter Rice (Sanger) and Alan Bleasby (SEQNET Daresbury), in collaboration with Thure Etzold (EBI) 1997 funding approved by the Wellcome Trust 1998 SEQNET relocated to Hinxton (HGMP) 1999 Thure goes to LION Bioscience 2000 Peter leaves Sanger – EMBOSS goes to Alan at HGMP 2001 LION (Peter) adds EMBOSS to SRS and updates EMBOSS CCP11 funding for EMBOSS development 2002 Peter leaves LION 2003 Peter joins EBI – integrating EMBOSS in myGrid services Medical Research Council terminates funding for Rodger Staden MRC still "owns" the Staden package. Rodger Staden retires. HGMP is renamed after Rosalind Franklin (by MRC) 2004 April 1st: MRC announces RFCGR will be closed within 15 months 2005 Alan Bleasby and Jon Ison move to EBI; Tim Carver moves to Sanger All the code is still licensed to everyone under (L)GPL. Disaster proof software licences

Users: Are you a Man or a Mouse?

EMBOSS has many possible command lines: Prompting for required values % seqret What sequence []: embl:paamir Output file [paamir.fasta]: Unix style % seqret embl:paamir –send 100 -auto % seqret embl:paamir –se 100 -auto % seqret –se 100 embl:paamir -auto GCG style % seqret embl:paamir –send=100 –auto Command Line

Web Interface (wEMBOSS)

Web interface (SRS)

GUI Interfaces: Jemboss

GUI Interfaces: Taverna

Where are we now?

For the new grant we were asked to present a vision: Genomics (whole genome analysis) Phylogenetics (beyond phylip) Gene expression (microarray data standards) Biostatistics (R and BioConductor) Proteomics (2d gel, MS, etc) Genetic linkage Chemistry (small molecules) All these ideas came from the 2005 User Survey We have funding only for core development (so far New grant vision

There are many other things we can do: Workflows Automatic support for the 100+ interfaces Generating XML files Notification of changes to ACD standard Testing Ontologies Graphics library Database indexing Non-sequence data access Extending core EMBOSS

Three books are planned after 4.0.0 Text ownership stays with the EMBOSS team for reuse Publishers Cambridge University Press Programmer's guide After a major code refactoring effort Automated generation of code examples Administrator's guide Installing and maintaining EMBOSS code Managing data resources Supporting in-house developers User's guide Aimed at experimental biologists EMBOSS Books

Celera were the first industrial users And the first to provide funding (for the SRS interface) Hardware manufacturers offer machines and compilers IBM, HP, Apple Our latest partners are SciTegic/Accelrys Pipeline Pilot Independent Software Vendor partnership EMBOSS and Industry

Pipelining Heterogeneous Tools Heterogeneous [BioJava, Perl, PROSITE, EMBOSS, (& GCG)] tools for sequence annotation

Pipeline Pilot runs on Linux BioPerl interface to launch EMBOSS EMBOSS team to maintain the BioPerl code Pipeline Pilot runs on Windows EMBOSS team to support EMBOSSWIN Why? Because we can do it, and we expect the GCG development team will find it difficult! The SciTegic Challenge

Encouraging more developers CUP books Developer training courses - not in Hinxton Course in Indiana May 2005 Sponsorship offer from Newcastle, UK Willing to travel anywhere!!! Emboss-submit@emboss.open-bio.org Henrikki Almusa and Medicel (Helsinki) Suggestions for new applications Collaborations in proposed new areas. We need help

(HGMP/RFCGR): Gary Williams, Tim Carver, Hugh Morgan, Claude Beesley, Damian Counsell, Val Curwen, Mark Faller, Sinead O’Leary, Thon deBoer, Martin Bishop LION: (Thomas Laurent), (Bijay Jassal), Thure Etzold Sanger: (Ian Longden), (Richard Bruskiewich), Simon Kelley, (Ewan Birney) EBI:Peter Rice, Alan Bleasby, Jon Ison, Lisa Mullan, (Martin Senger), Tom Oinn, Rodrigo Lopez, Mahmut Uludag, Shaun McGlinchey EMBnet: UK, Norway, Italy, Germany, Belgium, Argentina, China, Turkey, Israel, Canada, Manchester Others: Don Gilbert, Will Gilbert, Rodger Staden, Bill Pearson, Catherine Letondal, Luke McCarthy, Susan Jean Johns, David Bauer, Andrew Lyall, Henrikki Almusa, Melody Clark, .... Acknowledgements

EMBRACE and EMBOSS

EMBRACE and EMBOSS

Presentation Transcript

Relax…Embrace Change

New EMBOSS Web Service

Embrace Technology

Embrace

Team Embrace

Embrace

Embrace Cortel

Embrace Group

Embrace Right Living

GCG vs EMBOSS

Embrace Growth and Learning Opportunities…

EMBRACE Workshop

Embrace Your Geekiness

EMBOSS INTERFACES

Introduction to EMBOSS

EMBOSS as a DAS Client

EMBOSS (2)

EMBRACE

EMBOSS GUI

Embrace Learning

EMBRACE STEMI

EMBOSS – an application suite for Bioinformatics