250 likes | 406 Vues
Wrapping third-party analytical services for caBIG. Taverna-caBIG project. Stian Soiland-Reyes Alexandra Nenadic University of Manchester, UK. September 2009. http://www.mygrid.org.uk/dev/wiki/display/caGrid. Agenda. Project overview Primary goals Service selection Why these services?
 
                
                E N D
Wrapping third-party analytical services for caBIG • Taverna-caBIG project Stian Soiland-ReyesAlexandra NenadicUniversity of Manchester, UK September 2009 http://www.mygrid.org.uk/dev/wiki/display/caGrid
Agenda • Project overview • Primary goals • Service selection • Why these services? • Why wrapping? • Wrapping benefits? • How we did it • How does it work • Architecture • UML models • Example client and outputs • Project info
Project overview • Taverna-caBIG cooperation on several levels: • caGrid-enabling third party analytical services • Taverna Workbench enhancements for: • Semantic search of caBIG services • Invocation of caBIG services from Taverna workflows • Support for secure caBIG services (interacting with GAARDS infrastructure prior to service invocation) • This presentation addresses caGrid-enablement of third party analytical services (wrapping + achieving silver level of compatibility)
Primary goals • Identify two publicly available analytical services currently accessible through Taverna • Wrap, i.e. caGrid-enable, the services: • Design the wrapper services in UML and semantically describe/annotate them using caBIG’s tooling (EA + SIW) • Wrap/implement and deploy them as standard caBIG services on caGrid (Introduce)
Analytical service selection • Services have been selected in collaboration with caBIG Workflow Working Group, lead by Juli Klemm • Winners: • NCBI BLAST service hosted by EBI (European Bioinformatics Institute) • Protein and nucleotide sequence similarity search service • InterProScan service hosted by EBI • Scans a range of protein signatures in InterPro warehouse against a protein sequence
Why these services? • Freely available • Highly reliable, hosted by EBI • Widely used by the scientific community • Can be combined with existing caBIG tools in biologically meaningful workflows • caBIO, GridPIR, etc.
NCBI BLAST service • A popular sequence similarity search tool using local sequence alignment • Supports sequences of proteins, DNA, RNA • Searches sequences in a whole range of databases: • UNIPROT, NCBI, EMBL, etc. • SOAP web service hosted by EMBL-EBI
InterProScan service • InterPro warehouse integrates various databases of protein domains and functional sites • Searches the InterPro warehouse using protein signature recognition methods, e.g. blastprodom, gene3d, hmmpfam, hmmsmart, scanregexp, profilescan.. • SOAP web service hosted by EMBL-EBI
Why wrapping the services? • Original services use various data formats for inputs/outputs (although xml) • Does not conform to the caBIG compatibility rules • Output format was not even compatible with input format • The requirement for the wrapped service: • Translate the input data from caBIG-compatible xml to xml format understood by analytical services • Convert the received results back to a format understood by caBIG clients
NCBI BLAST Output (Untranslated) <?xml version="1.0"?> <EBIApplicationResult xmlns="http://www.ebi.ac.uk/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.ebi.ac.uk/schema/ApplicationResult.xsd"> <Header>...</Header> <SequenceSimilaritySearchResult> <hits total="1"> <hit number="1" database="uniprot" id="WAP_RAT" ac="P01174" length="137" description="Whey acidic protein OS=Rattus norvegicus GN=Wap PE=1 SV=2"> <alignments total="1"> <alignment number="1"> <score>763</score> <bits>298</bits> <expectation>8e-80</expectation> <identity>100</identity> <positives>100</positives> <querySeq start="1" end="137">MRCSISLVLGLLALEVALARNLQ EHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRS CKTPVNIEVQKAGRCPWNPIQMIAAGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ</querySeq> <pattern>MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDS FSEDTECINCQTNEECAQNDMCCPSSC GRSCKTPVNIEVQKAGRCPWNPIQMIAAGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ</pattern> <matchSeq start="1" end="137">MRCSISLVLGLLALEVALARNLQ EHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKT PVNIEVQKAGRCPWNPIQMIAAGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ</matchSeq> </alignment> </alignments> </hit> </hits> </SequenceSimilaritySearchResult> </EBIApplicationResult>
InterProScan Output (Untranslated) <EBIInterProScanResults xmlns="http://www.ebi.ac.uk/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.ebi.ac.uk/schema/InterProScanResult.xsd"> <Header>...</Header> <interpro_matches> <protein id="uniprot|P01174|WAP_RAT" length="137" crc64="1C2E8ADA9FD97949" > <interpro id="IPR008197" name="Whey acidic protein, 4-disulphide core" type="Domain" parent_id="IPR015874"> <child_list><rel_ref ipr_ref="IPR008198"/></child_list> <contains><rel_ref ipr_ref="IPR002098"/></contains> <classification id="GO:0030414" class_type="GO"> <category>Molecular Function</category> <description>protease inhibitor activity</description> </classification> <match id="G3DSA:4.10.75.10" name="Whey_acidic_protein_4-diS_core" dbname="GENE3D"> <location start="77" end="128" score="9.899996308397199E-5" status="T" evidence="Gene3D" /> </match> <match id="PF00095" name="WAP" dbname="PFAM"> <location start="30" end="72" score="6.30000254573025E-5" status="T" evidence="HMMPfam" /> <location start="79" end="126" score="1.59999889349247E-14" status="T" evidence="HMMPfam" /> </match> </interpro> <interpro id="IPR008198" name="Proteinase inhibitor I17" type="Domain" parent_id="IPR008197"> ...</interpro> </protein> </interpro_matches> </EBIInterProScanResults>
Motivational workflow http://www.myexperiment.org/workflows/230 Nested workflow that internally invokes NCBI BLAST and checks job status before fetching results Web Service that looks up protein sequences in a database. Will be replaced with the caBIG service caBIO. This Taverna workflow uses both Blast and InterProScan which can be replaced with wrapped versions of the services Shim that splits a stringinto a list of Fasta strings Nested workflow that internally invokes InterProScan and checks job status before fetching results
Benefits of wrapped services • Making analytical services from other service providers available to caBIG users • Wrapped services are caBIG Silver Level compatible: • Ensures shared meaning and interoperability between these and other caBIG services • Data can be exchanged and understood between services
How we wrapped the services (1) • Making the services ‘silver’ encompassed: • Modelled data in UML using Enterprise Architect (EA) • Exported model to XMI from EA • Using the SIW tool, the XMI file has been semantically annotated using caBIG’s vocabularies/ontologies • Common Data Elements (CDEs) have been generated for services inputs/outputs, reviewed by the curation team and loaded into caDSR production database • Annotated XMI loaded back to the EA to update UML
How we wrapped the services (2) • From the EA, the UML model was exported to a set of xsd files • The xsd files have been imported into the Introduce tool, which was used to generate the skeleton APIs of the wrapped services • Axis 2 was used to invoke the original InterPro and NCBI BLAST services from the wrapper services • The wrapped services are asynchronous; job status and results are available as WSRF resource properties and can be subscribed to using WS-Notifications. There is also a synchronous version where polling is done from the client side.
How it works • Client: using client library, calls wrapped WSRF web service • Service: convert input to original format, submit converted input to original service, return a Job Resource that references the jobID • Client: Subscribe to notifications from job resource • Job Monitor (server): For all jobs, check status using jobID, notify client on completion • Client library: Request output data • Job Resource: Convert data from original format,Return converted data to client
Reused several data elements • Green classes in diagram reused from IRWG • Sequence, NucleicAcidSequence • DatabaseCrossReference • GeneGenomicIdentifieret al. • Red UML classes in diagram reused from PIR • ProteinSequence • Partial reuse of attributes in ProteinDomainLocation
Example client NCBI Blast NCBIBlastClient client = new NCBIBlastClient(url); NCBIBlastInput input = new NCBIBlastInput(); ProteinSequenceRepresentation sequenceRepresentation = new ProteinSequenceRepresentation(); ProteinGenomicIdentifier proteinId = new ProteinGenomicIdentifier(); proteinId.setDataSourceName("uniprot"); proteinId.setCrossReferenceId("wap_rat"); sequenceRepresentation.setProteinId(proteinId); input.setSequenceRepresentation(sequenceRepresentation); NCBIBlastInputParameters params = new NCBIBlastInputParameters(); params.setEmail("mannen@soiland-reyes.com"); params.setQueryDatabase(new MolecularSequenceDatabase("", "uniprot")); params.setBlastProgram(BLASTProgram.BLASTP); input.setNcbiBLASTInputParameters(params); NCBIBlastClientUtils clientUtils = new NCBIBlastClientUtils(client); NCBIBlastOutput ncbiBlastOut = clientUtils.ncbiBlastSync(input, TIMEOUT_SECONDS * 1000); SequenceSimilarity[] similarities = ncbiBlastOut.getSequenceSimilarities(); for (SequenceSimilarity similarity : similarities) { for (Alignment align : similarity.getAlignments()) { SequenceFragment querySequenceFragment = align.getQuerySequenceFragment(); System.out.print("Q: " + querySequenceFragment.getSequence().getValue()); (..) id data
Example SOAP input NCBI Blast <service:NcbiBlastRequestxmlns:service="http://www.mygrid.org.uk/2009/cagrid/servicewrapper/service/NCBIBlast"xmlns="gme://Taverna-caGrid.caBIG/1.0/uk.org.mygrid.cagrid.domain.ncbiblast"xmlns:irwg="http://www.mygrid.org.uk/2009/cagrid/servicewrapper/imported/IRWG"xmlns:common="gme://Taverna-caGrid.caBIG/1.0/uk.org.mygrid.cagrid.domain.common"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <service:nCBIBlastInput> <NCBIBlastInput> <ncbiBLASTInputParameters> <blastProgram>BLASTP</blastProgram> <email>mannen@soiland-reyes.com</email> <queryDatabase> <common:name>uniprot</common:name> <common:description/> </queryDatabase> </ncbiBLASTInputParameters> <sequenceRepresentationxsi:type="irwg:ProteinSequenceRepresentation"> <irwg:proteinId> <irwg:crossReferenceId>wap_rat</irwg:crossReferenceId> <irwg:dataSourceName>uniprot</irwg:dataSourceName> </irwg:proteinId> </sequenceRepresentation> </NCBIBlastInput> </service:nCBIBlastInput> </service:NcbiBlastRequest> id data reused
Example client output NCBI Blast Running NCBI Blast client uk.org.mygrid.cagrid.servicewrapper.service.ncbiblast.example. ExampleNCBIBlastClient -url <service url> -- Using default service at http://cagrid.taverna.org.uk:8080/wsrf/services/cagrid/NCBIBlast Calling NCBI Blast synchronously (Set -DGLOBUS_LOCATION=/Users/bob/cagrid/ws-core-4.0.3 to do asynchronous client calls) Found 50 similarities Similarity in uniprot:WAP_RAT (sequence length:137) 1 alignments Alignment score=763.0 bits=298.0 eValue=1.0E-79 Q: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA AGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ 1-137 P: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA AGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ M: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA AGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ 1-137 Similarity in uniprot:Q3UQ94_MOUSE (sequence length:140) 1 alignments Alignment score=465.0 bits=183.0 eValue=4.0E-45 Q: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA A-GPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVI 1-134 P: MRC ISLVLGLLALEVALA+NL+E VFNSVQSM S E TECI CQTNEECAQN MCCP SCGR+ KTPVNI V KAG CPWN +QMI+ + GPCP CS D +CSG MKCC C+M+C P P+ ++I M: MRCLISLVLGLLALEVALAQNLEEQVFNSVQSMFPKASPIEGTECIICQTNEECAQNAMCCPGSCGRTRKTPVNIGVPKAGFCPWNLLQMIS STGPCPMKIECSSDRECSGNMKCCNVDCVMTCTPPVPEVWSII 1-134 data id
Example SOAP output NCBI Blast <NCBIBlastOutputxmlns:xsd="http://www.w3.org/2001/XMLSchema" ...> <sequenceSimilarities> <accessionNumber>P01174</accessionNumber> <description>Whey acidic protein OS=Rattusnorvegicus GN=Wap PE=1 SV=2</description> <sequenceLength>137</sequenceLength> <sequenceId> <irwg:crossReferenceId>WAP_RAT</irwg:crossReferenceId> <irwg:dataSourceName>uniprot</irwg:dataSourceName> </sequenceId> <alignments> <bits>298.0</bits> <eValue>1.0E-79</eValue> <identity>100</identity> <positives>100</positives> <score>763.0</score> <sequenceSimilarityPattern>MRCSISLVLGLLALEVAL..ISFQ</sequenceSimilarityPattern> <matchSequenceFragment> <end>137</end> <start>1</start> <sequence> <irwg:value>MRCSISLVLGLLALEVAL..ISFQ</irwg:value> <irwg:valueInFastaFormatxsi:nil="true"/> </sequence> </matchSequenceFragment> <querySequenceFragment> <end>137</end> <start>1</start> <sequence> <irwg:value>MRCSISLVLGLLALEVAL..ISFQ</irwg:value> <irwg:valueInFastaFormatxsi:nil="true"/> </sequence> </querySequenceFragment> </alignments> </sequenceSimilarities> id reused data
Project info • On gForge: https://gforge.nci.nih.gov/projects/taverna-cagrid/ • On myGrid wiki: http://www.mygrid.org.uk/dev/wiki/display/caGrid/Home • Source and documentation available via Subversion: https://gforge.nci.nih.gov/svnroot/taverna-cagrid/