320 likes | 497 Vues
PANACEA WP3 The Platform. WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF (marc.pochriera@upf.edu). Summary. Objectives Platform components / Demo Achievements Functional platform
E N D
PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19th February 2013 Marc Poch, UPF (marc.pochriera@upf.edu)
Summary • Objectives • Platform components / Demo • Achievements • Functional platform • Interoperability: Travelling Object, Common Interfaces, format converters, etc. • Scalability • WP7 Evaluation • Conclusions and future work
Objectives • Development of a platform (a space of interoperability defined by standardized protocols and common interfaces) for the easy integration of a variety of software components, tools and methodologies deployed as web services to configure a factory for the automation of acquisition, processing and annotation of language resources. • WP3.1. (T1-T6) Architecture and design of the platform • WP3.2 (T15-T30) Work Flow editor and engine • WP3.3. (T7-T30) Common interfaces, middleware and temporal files, journaling, etc. • WP3.4 (T15-T30) The Registry • WP3.5 (T7-T30) Deployment of web services of the components supplied by WP4 to WP6
From local tools to sharing workflows
Platform tools and portals PANACEA Platform: uses, adapts and improves myGrid tools for eScience(used in biology, social science, music, astronomy, multimedia and chemistry). Share tools (remotely run distributed tools) Share and find Web Services Call / chain Web Services Share and find workflows Registry Workflows Social Network Web Services Biocatalogue myExperiment SOAP or REST Taverna www.taverna.org.uk Clients: Java, Python, Perl, etc. PANACEA Registry: registry.elda.org Soaplab PANACEA myExperiment: myexperiment.elda.org JAX-WS, Axis, CXF, etc.
Technological option:Web Services • Easy deployment of command line tools as WS. (Java, Python, C++, UIMA, etc. ) • Clients: Java, Python, Perl, Taverna, etc. • No coding needed! Only metadata • “Polling” techniques for long lasting tasks • Web form to run the web services • URL input / output ready • PANACEA improvement for SOAP messaging (network usage and memory) • PANACEA limit multiple users SOAPLAB 2 (SOAP) Web Services TAVERNA Workflow editor BioCatalogue Registry myExperiment Social network
Technological option:Registry • User friendly GUI • Free, open source, Continuously maintained • Search function • Users rating (users feedback) • Service annotations and Language Categorization (PANACEA) • Monitoring system (web service status and data results) SOAPLAB 2 (SOAP) Web Services TAVERNA Workflow editor BioCatalogue Registry Passed Warning Failed Unchecked myExperiment Social network
Technological option:Taverna • User friendly GUI • Free and open source • Continuously maintained (v. 2.4) • SOAP and REST web services • Credentials manger (passwords, certificates, etc.) • Multiple files processing (“lists”) • PANACEA Workflows, best practises, videos, etc. : • Parallelization, Error recovery: “retries”, Polling • PANACEA collaboration: bug fixing and pre-release tests SOAPLAB 2 (SOAP) Web Services TAVERNA Workflow editor BioCatalogue Registry myExperiment Social network
Demos • Previous Review: • PANACEA Registry / PANACEA myExperiment • Run Web Services and Workflows • Design and merging of workflows in Taverna • Final Review: Specific examples • Creation of a bilingual dictionary • Twitter NLP • Web cleaner and anonymizer • PANACEA Registry / PANACEA myExperiment
Demos I Creation of a bilingual dictionary • http://myexperiment.elda.org/workflows/93 • Input: Pairs of Basic Xces Documents • English: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/1.xml • French: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/191.xml • Sentence alignment: Hunalign(3rd party tool) Interoperability • PoS tagging: Treetagger(3rd party tool) Interoperability • Build phrase tables: Moses (3rd party tool) Interoperability • Bilingual dictionary extractor Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_bilingual_dictionary_extraction_v01.mp4
Demos II Twitter NLP + Registry (3rd party tool) • This web service is based on the Twitter NLP tool developed by Noah's ARK group. • Noah's ARK group is Noah Smith's research group at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University. • Search the WS in the Registry • Check monitoring system • Use web client with example data
Demos III Web cleaner and anonymizer http://myexperiment.elda.org/workflows/98 • Input: a list of URLs to process • Example: a web article from www.fifa.com • ILSP Web cleaner and text extractor WS • UPF Anonymizer WS • Internally calls Freeling NER WS (3rd party tool) Interoperability • Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_web_cleaner_and_anonymization_v01.mp4
WP3 Achievements • Functional and Operational Platform • Multiple tools, webs and features • Ready to use • Usability • Real Users • Interoperability • Common Interfaces • Travelling Object • 3rd party tools Integration • Format converters • Scalability • Web service scalability: long lasting tasks • Workflow design optimization: robustness • Machine resources: handling parallel requests
Functional and Operational Platform • PANACEA Registry • 157 web services PANACEA WS benefits: WS are easy to deploy (low maintenance cost) • More than 1300 annotations Usability / Doc. • A cloud of 164 tags • Monitoring system: WS up and running 94.82% since their deployment (97%) Availability • PANACEA myExperiment • 74 shared workflows • Storage System Usability
Functional and Operational Platform:Tutorials and Documentation • Tutorials • Specific and General tutorials • More than 12 videos Usability • Frequently Asked Questions • Documentation • Registry annotations, tags and Categories • Common Interfaces documentation: xml, web, etc. • Travelling Objects documentation
Functional and Operational Platform:Users • WP7 Validators • Linguatech (WP8) • Qualia(Business intelligence) • CNGL (Centre for Next Generation Localisation) • INCYTA (Translation) • Master and Phd Students make use of the PANACEA platform • http://ws02.iula.upf.edu/panacea/statistics/upf-statistics.html
Interoperability • Three levels of interoperability: • COMMUNICATION PROTOCOLS: Soap, Rest • DATA • PARAMETERS • Tool A • Tool A • Tool B • Tool B Tool B does not “understand” format N! All tools understand the previous format A B C D A B C D A B C D Y T Q Z
Common Interface • A Common Interface (CI) defines the mandatory parameters for every functionality: http://panacea-lr.eu/en/info-for-professionals/documents/ http://registry.elda.org
Travelling Object • The Travelling Object (TO) is the common data and metadata format used in PANACEA to make components understand each other. (Interoperability) • TO1 is the minimal common vertical in-line format used by the deployed tools since the first version of the platform using XCES standard • TO2 GrAF standard: The Graph Annotation Format (Ide and Sudermam, 2007) is the XML serialization of LAF (ISO 24612, 2009) • LMF for lexical resources • CONLL for parsers • Converters and adapted WS outputs
Format Converters 31 Format converters on the PANACEA Registry • Freeling to TO. CNR http://registry.elda.org/services/207 • KAF to TO. CNR http://registry.elda.org/services/208 • Basic Xces to txt. CNR http://registry.elda.org/services/209 • PoS tag. (Freelingtreetagger) to GrAF. UPF http://registry.elda.org/services/142 • Dependency parsing (Freeling) to GrAF. UPF http://registry.elda.org/services/197 • Dependency CoNLL to GrAF. CNR http://registry.elda.org/services/254 • Word doc to txt. UPF http://registry.elda.org/services/112 • In-house mwe to LMF. CNR http://registry.elda.org/services/296 • Pdf to text. UPF http://registry.elda.org/services/116 • Multi. encodings converter (ISO, UTF, etc.). UPF http://registry.elda.org/services/114 • Aligner to TO. DCU http://registry.elda.org/services/69 • Sentence alignment to TMX. DCU http://registry.elda.org/services/219 • Treetagger to MOSES. DCU http://registry.elda.org/services/275 • UIMA to GrAF. ILSP http://registry.elda.org/services/182 • METASHAREmetadata generators http://myexperiment.elda.org/workflows/96
3rd party tools integration • PANACEA WS wrapper (Soaplab) and the CI make it easy for WS Providers to integrate 3rd party tools. • ILSP tools are UIMA tools UIMA • Freeling UPC • Treetagger University of Stuttgart • Twitter NLP Carnegie Mellon University • MALT Parser Uppsala University • DeSRUniversitàdi Pisa • MOSES / Giza++ • DELiC4MT (MT evaluation) DCU • Berckeley tagger, parser, aligner Berkeley University California
Web ServicesScalability • Web services are being deployed using Soaplab 2.3.2: • Service providers only need to use metadata (ACD) files Usability • Web client application to test WSs: Spinet Usability • PANACEA developers have been in contact with Soaplab developers Collaboration • SOAP protocol standard Interoperability • WS can be called from Taverna or other workflow editors • WS can be called with many programming languages: Python, Perl, Ruby, Java, etc. • Soaplab polling to avoid client timeouts Scalability • PANACEA Improvements Scalability • Parallel request limit system • SOAP messaging optimization
Workflows design optimization: Robustness • Building workflows with Taverna • Version 2.4.2 Scalability • Polling (Soaplab) Scalability • long lasting web service calls without timeouts • Retries Scalability • Parallelization Scalability • Tutorials and videos Usability
Machine Resources: handling parallel requests Parallelization level 3 (3 parallel request per service * 2 services = 6 concurrent requests)
Machine Resources: handling parallel requests Parallelization level 10 (10 parallel request per service * 2 services = 20 concurrent requests)
Machine Resources: handling parallel requests • From 1x to 10x experiment http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_parallelization_scalability_v01.mp4 • Two Taverna instances running at the same time • 100 documents to be processed • 1 workflow with NO parallelization / the other with 10x • The same server: ws04 with 8GB RAM and 4 CPUs • More resources > more parallel requests
Machine Resources: handling parallel requests • Conclusions: • PANACEA fulfils large data scalabilty goal Scalability • Requirements: • Robust WS deployment: Soaplab (with Panacea improvements) or other robust framewoks. • Taverna 2.4 • Workflow design must follow the PANACEA massive data tutorial (retries, polling, etc) • The architecture is highly scalable: growth is just a matter of resources • EMBL –EBI (European Bioinformatics Institute in Cambridge): • 200 Servers • 2000 cores • Server requests balancing • Software, etc. • More than 50000 FreelingWS parallel requests • Typical Panacea server: • 2 - 4 cores • 4 - 8 GB RAM • 30 - 100 GB HDD • 100 Freeling WS parallel requests Statistics
Conclusions • Functional platform • Web services software • Registry / myExperiment • Usability for users and providers • Interoperability: • Data formats • Common Interfaces • Tutorials and Documentation • Scalability
The future • Authentication Web Services Business opportunity • Institutions and companies can sell their services and/or machine resources • Automatically build workflows Usability and interoperability • Based on input data and user desired output, etc. • Data Visualization tools / Widgets Usability • Improve total throughput Scalability • With more machine resources we can achieve faster experiment results • Software optimization: task splitting and parallelization • Publications with experiments Research • Researchers could link their publications to real experiments (WS, workflows, data. etc.) • Fostering research making experiments easily replicable • Improved experiments: more data, more machine resources, faster results, etc.
Thankyou Questions?