240 likes | 344 Vues
This presentation focuses on the technical infrastructure of CLARIN, emphasizing the need for an open and sustainable marketplace that integrates existing Language Resource Technologies (LRT) in a scalable manner. It discusses the preparation phase for funding, evaluation, and development contributions. The interaction model, tasks, and working groups within CLARIN Technical Infrastructure (TI) are outlined, with a clear emphasis on open source development and collaboration. The presentation also highlights the importance of collaboration between different partners and the allocation of funds.
E N D
CLARINWP2 Tech InfraWP2 Breakout 2008open discussion about all aspectsmy slides just the pacemaker
Focus • CLARIN Technical Infrastructure • must offer an open market place for “ALL” resources and tools • needs to be the super-market in LRT area • must be open, extendible and web-based • must not be bound by decisions of external groups • must build upon experience of lots of experts and work already done • must re-use components that are out there and that fit into the open policy • must not be fiction – but more important is sustainability • must be coherent with administrational/organizational/financial constraints • all code to be developed must be open source and free to use (academics) • CLARIN TI is about integration and interoperability of existing LRT • CLARIN TI is about scaling up – so not so much principally new ideas
Preparation Phase • EC funds: in prep phase (3 years) only piloting and tests to allow cost • estimation • national funds: at the end some demo cases • evaluation: probably already in 2 years national evaluations • necessary: early domain wide services at the Web-site • (repository/archiving, ISOcat, PID, translation, etc) • development contribution in WP2 widely by national funds – real money • who has got them so far? • tasks of WP leaders • all writing and overhead by WP leaders (except financial admin) • all interaction national-EC level by WP leaders • clear separation between EC level and national level activities • all reports/deliverables subject of extensive discussion in EB • special workshops and seminars planned • WP members should be able to just focus on content • real work organized in “small” working groups • when national funds are available WG leaders from other sites than MPI
Institutes / Persons MPI: Peter Wittenburg, Dieter van Uytvanck, Daan Broeder, Marc Kemps-Snijders INL: Jeannine Beeken, Remco van Veenendaal ELDA: Khalid Choukri, Viktoria Arranz DFKI: Thierry Declerck ILC: Nicoletta Calzolari, Ricardo del Gratta WROCUT: Maciej Piasecki, Bartosz Broda OTA: Martin Wynne UPF: Nuria Bel, Anna Guardiola, Santiago Bel ILSP: Stelios Piperides, Maria Gavrilidou RACAI: Dan Tufis, Epapadat? USFD: Wim Peters, Adam Funk Helsinki: Tero Aalto (CSC), (Kimmo Koskienniemi) Lund: Sven Strömqvist Leipzig: Gerhard Heyer Latvia: Inguna Skadina Leuven: Ineke Schuurman Utrecht:Jan Odijk, (Steven Krauwer) UniVie: Gerhard Budin, Csilla Bornemisza Tübingen: Erhard Hinrichs, Lothar Lemnitzer, Andreas Witt
WP2 Tasks/WGs • WG2.1: Centers Network Formation MPI T0 d=rep/T6 • types of services, types of centers, requirements • WG2.2: Federation Foundation MPI T0 d=rep/T6 • LRT requirements, architecture, schemas, criteria, selection • WG2.3: Federation Building MPI T6 d=dev/T18/36 • analysis, component installation + adaptation, agreements • WG2.4: Registry Requirements MPI T0 d=rep/T9 • experiences, new requirements, model+schema, ISO • WG2.5: Registry Infrastructure MPI T6 d=dev/T18/36 • design, building of components, integration, testing • WG2.6: Web-Services and Workflow Requirements ? T6 d=rep/T12/24 • analysis, experiences, requirements • WG2.7: WS and WF Creation ? T12 d=dev/T24/36 • encapsulation methods, services development, WF test tool • WG2.8: Service & Application Building ? T24 d=dev/T36 • show cases, cross-searching, LREP interaction • WG2.9: Cost estimates MPI T21 d=rep/T24/36 • for construction + operation phase
WP Interaction • WP2 – WP5: close interaction required • WP2 all aspects that have to do with LRT as a whole • WP5 all internal LRT aspects • WP2 more the IT side – WP5 more the linguistic side • WP2 the specifications, frameworks etc – WP5 the integration • WP2 – WP7: close interaction required • smooth LRT domain only when licensing and trust agreements settled • WP2 all specifications from technical/infrastructure perspective • WP7 all agreements from a formal/legal perspective • WP2 – WP3/4: WP2 needs to listen to requirements and wishes • WP2 will provide infos about possibilities/plans etc • WP2 – WP6: WP2 (all partners) needs to provide info for dissemination • WP2 should participate in national training programs • WP2 – WP8: WP2 needs to listen to constraints • WP2 needs to contribute with estimates (costs etc)
WG2.1 Service Centers • need to add a persistent infrastructure layer on top of the landscape formed by • accidental and temporary collaborations that is easily accessible for everyone and • that offers high availability so that humanities scholars can rely on it • perhaps different types of centers dependent on the services they give • fundamental deal: researchers give their babies and get seamless access to more • centers need to change their attitudes – they have to offer a true service mentality, • a new form of openness and technical accessibilityand little burocratic overhead • access suitable for humanities research – unpredictable access patterns
WG2.1 Types of Services • LR Services • uploading and integration of new resources/versions • long-term data preservation • curation of LR (immediate conversion, ...) • allowing access to LR via web services • building virtual collections • finding appropriate tools • LT Services • offer web services to execute LT • allow to combine LT to larger workflows (chains of operations) • conversion services/translation services/ .... • Infrastructure Services • ISOcat services for concepts, terminology, relations • PID service for unique and persistent identifiers • metadata services (all sorts of usages, all sorts of LRT) • etc • Advisory Services (WP6) – where to locate general advice???
WG2.1 Requirements • determine types of centers dependent on types of services • need a taxonomy of services (WP5) • determine together with WP3/4 what the required “business” models are • what are humanities researchers expecting – it’s much more than NLP! • searching for patterns in large virtual collections, easy application of tools • how much and which overhead is allowed to not hamper innovation • nevertheless proper handling of IPR • determine technical requirements for service centers • example: proper repository/archiving system • determine general requirements • what is the national support • what kind of expertise is around • what is the size and capability of the staff • geographic spreading – political aspects • prepare a call for participation in a first network of centers (together with WG2.2) • analyze all applications and determine local situation • make a selection based on criteria
WG2.2/3 LRT Federation • what are we going for? • alternative to Google model of “centralized” data management and ownership • distributed but nevertheless coordinated model that • can act as juridical person to make contracts • can interact with LRT providers about licenses etc (goal is simplification) • can interact with national identity federations to establish trust relations • and make deals • if we don’t organize ourselves others will take over control about scientific data • competition requires efficient operations eJournal Service Providers Trust Agreements Schema national Identity Federations Trust Agreement LRT Service Providers
WG2.2 Federation Requirements • Technical Requirements • joint metadata registry for resources and tools based on long experience • (see WG2.4/5) • support for virtual collections with resources from different archives based on • metadata and for combining services to more complex operations • unique way of referencing electronic resources and fragments in federation • single sign-on/identity principles in federation • all based on trusted and signed certificates (quality of certificates) • Trust Relations (together with WP7) • trust agreements with national identity federation • CLARIN needs to build a federation based on simplified and unified rules • for licensing, accessing, user authentication etc • what kind of auditing is required?
WG2.2 Work • understand the requirements of LRT providers wrt to a federation • interact with TERENA, national IF and organizations to understand the trends • (attribute sets and values, schemas, architecture) • usage of attributes in debate (mostly EduPerson, but different MPG requirements) • grid integration to get applications under the distributed AA scheme (NL project) • what are the trust agreements in federation – also a matter of licenses (WP7) • criteria for centers to participate in federation (see DAM-LR) • (strong enough, nat. support, appropriate staff, national grid support, ...) • launch a call for participation in first round • check the situation of all applicants – can’t take everyone in preparatory phase • check includes the repository and service structure • promised to have at least 10 participating centers
WG2.3 Federation Building • follow up WG with the selected centers mainly • deep analysis of local repository, service and authentication structure • get certificates and PKI in work • make training courses for local experts • install components (repository, MD infra, Handle System for PID) • install main AAI component Shibboleth and set up schema etc (help from CC) • adapt, test and integrate all components • make some agreements with national IF (WP7) • procedure in two steps – lot of detail work • first the “safe” candidates • later the other candidates
WG2.4/5 LRT Registry • joint metadata registry for resources and tools based on long experience • Metadata is an increasingly important research resource • Metadata is part of a proper registry mechanism • need to do a deep analysis of current practices and trends • domain crossing initiatives: DC, OAI PMH • working distributed infrastructures IMDI, OLAC, … in the domain • projects and organizations: ELDA catalogue, TEI usage, CHILDES, etc. • initiatives for tools: DFKI tool registry • technologies at web service level (UDDI, ebXML, WSDL, etc.) • importance of ontology based IS like “LT World”. • know the current limitations • coverage is still too little in LRT domain • in appropriate descriptions, fixed schemas, non-suitable vocabularies • hardly any localization • little customization (virtual collections, dynamic abstraction, faceted search, ...) • hardly any use or support for PIDs • not suitable for web services domain
WG2.4 Registry Requirements • deep analysis of current practices and their limitations • determining the essentials • concept registration and PIDs are essential - schemas less so • integration of relevant concepts in ISOcat and creation of relations • requirements for specific resource types and sub-disciplines • what is required to extend to LRT web services • how to integrate the huge amount of legacy material • get a proper LRT taxonomy as basis (WP5) • establish a board for the ISOcat MD profile • is social tagging an issue • how to split responsibility between national contributors • discuss ODD component model implications and opportunities • specify a flexible component based CLARIN “standard” that can be • submitted to ISO as well (model, core + extensions) • specify the requirements for the infrastructure • (tools, portals, reps, harvesting gateways, etc)
WG2.5 Registry Building • follow up of WG2.4 with those partners who are capable • design components and component schema • integrate relevant concepts in ISOcat, create relations, enter localizations • and sub-discipline variants • design set of infrastructure components and define APIs • decide about code development aspects and task division • develop code • do the code integration, testing etc • setup portals • give help for integration of existing collections and services (with WP5) • write manuals, do training courses etc • Registry should be finished after three years!
WG2.6/7 WS and WF • essential for CLARIN is a flexible and simple way to integrate and combine • resources and tools (virtual collections, chain of operations, profile match, etc) • need to understand what humanities researchers find simple • need a deep analysis of the current practices and trends • the registry part of this will be dealt with in WG4/5 (UDDI, ebXML, etc) • there are general suggestions such as • WSDL, REST for interface specification • BPEL, JBPM, Yahoo pipes, etc for workflow languages and graphical • WF options • there is domain specific knowledge and experience such as • within GATE (SAFE), UIMA, Bricks, at RACAI, at MPI etc etc • how to get all the LRT which is out there into an SOA • how to achieve an open and flexible setup • how much standardization is required wrt to formats etc (strong/weak typing) • what kind of conversion routines are required • what are the special requirements of grid computing applications • (http might be too slow) and services such as media streaming
WG2.6 WS & WF Requirements • carry out the deep analysis • understand in detail what humanities researchers would like to have • and which degree of complexity they can handle (WP3/4) • understand in detail what other initiatives have been trying out • analyze chains of operations in LRT to understand the interoperability issues (WP5) • analyze requirements of profile matching • determine how ISOcat etc can be included to achieve interoperability at tag level • make workshops with experts • write requirements specification document • detect necessity of additional standards (WP5)
WG2.7 WS & WF Building • the general goal is to • include a large number of resources into the service landscape (WP5) • include a number of tools into the service landscape (WP5) • show the potential of an open ws domain at the end • implement a first simple graphical WF framework (or re-use existing stuff) • estimate costs for the construction and operation phase, i.e. broad coverage • for the selection of LRT components • need an architectural design and a framework • design and develop encapsulation/wrapping methods • create services • create the required conversion services • develop a first simple WF tool and/or re-use existing stuff • carry out some grid computing tests based on fast, shared file systems
WG2.8 Basic Services • any suggestions are welcome – until vague commitments such as • domain wide searches • metadata • content (which architecture?, which rights?, ...) • combined • LREP profile matching service • more ideas to come from humanities projects
WG2.8 Not to forget • offer first services asap • web-site as center point • should have mirror sites at a certain moment (not just one portal) • examples • MPI will offer deposit, annotation/LMF curation and access services • RACAI can offer some technology services • Sheffield etc can offer services? • has someone a translation service? • etc
WP2 Work Distribution • open for any suggestions • constraint: we have to fulfill the constraints given by the TA • my original suggestion (see overview) – but it is TA theory • various WG registration until now (see overview) • the big questions: • how much power is available per institute? • is there additional national support for you beyond the few pm?
start today • forming the WG for the first half year • start with providing documents • video conferencing regularly • personal meeting when necessary • meeting at LREC if possible • workshop with experts • in summer workshop together with WP5 WP2 Procedure