WebLicht: Linguistic Chaining Tool for Language Corpus Data

WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University Tübingen

Outline • Web-based Linguistic Chaining Tool (WebLicht) for incremental filtering and access of language corpus data • WebLicht– Motivation • WebLicht - Architecture • WebLicht– Future Requirements • Test Case – Gutenberg Corpus

CLARIN (Common Language Resource and Technology Infrastructure Network) • is committed to establishing an integrated and interoperable RI • supporting easy access and use of language • aims to overcome the current fragmentation and offer a stable, • persistent and extendable infrastructure • it will offer its services to researchers and scholars across a wide • spectrum of domains in particular in the humanities and soc sciences • ESFRI roadmap project; implementation phase starts in 2011 CLARIN Mission

Typical CLARIN userscenario • Scenario: A PhDstudentinvestigates regional differences in vocabulary and in wordcollocations in different variants of German . • Data: large text corporaavailable at BBAW in Berlin, at theAustrianAcademy of Science in Vienna, the Swiss Text Corpus Project in Basel, and at EURAC, Bolzano. • Tools fortargeteddataaccess: WebLichtofferscustomizablechains of web servicesforfiltering and analyzingthedata

WebLicht - Motivation • Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available • Most of them are implemented to run on local machines. This can be inconvenient and error-prone • Requirements: go beyond “do-it-yourself” and “download-first” strategies • The CLARIN solution: Make tools and resources available as webservices

WebLicht - Architecture • WebLichtis a SOA foraccessing and processingtext corpora • Developmentstarted in October 2008 • WebLichtconsists of thefollowingcomponents: • Distributedservices: offeringfunctionality (resources & tools) overthe (inter-)net. Implemented as webservices (ca. 90 at themoment) • Repository: storesmetadata and technicalinformationabouttheservices • Web 2.0 baseduserinterface:interactswiththeuser and combinesservices and informationfromtherepository. Access still possible via scripts / programmingcode

WebLicht - Architecture Stuttgart Tübingen Leipzig Stuttgart Leipzig Finland Romania Iceland UK Tübingen Repository Standard-conformant Text Corpus Encoding Web 2.0 Application for Tool Chaining and Execution Berlin

WebLicht – Architecture • Services are implemented as REST style webservices • HTTPs POST method is used to send data from the UI to the services • As client, anything which is able to use the HTTP protocol, can be used: • Browser • Commandline tools (wget, curl) • Programming Languages • Anyone can implement his/her own interface to WebLicht

WebLicht - Processing Chains

WebLicht - Results

WebLicht - Features • With RESTstylewebservices, everyone can implement a web service for WebLicht (4pages tutorial) • The SOA infrastructure is independent of programming languages or operating systems • The chaining algorithm is independent of the used dataformat • Form a legal point of view, the web services are still located in the institute where they were created

WebLicht– Future Requirements • Web services are synchronous: some linguistic annotation processes are very time consuming an asynchronous behavior of these service would be desirable • The processing power is limited by local computing resources Scalability only with strong centers possible • The current architecture is not sufficiently parallelized and therefore does not scale up: Accommodate a large number of simultaneous users Parallelization of processes

WebLicht– Future Requirements • Currently, users have to store the input data and their results on their local machines • Online storage in the form of personal workspaces with reliable backup solutions • Linguistic tools are typically developed in a variety of heterogeneous software environments and programming languages (Java, Perl, Python, C/C++, Prolog, Lisp, …) Encapsulation of individualserviceswithcommonAPIsforinteroperability • Currently, WebLicht services are limited to processing text corpora Extending webservices also to spoken language and multi-modal datasets (MPI is already working on this)

Test Case: Gutenberg Corpus • On the basis of these structure, a part of the free available Gutenberg Project was annotated in Tübingen • Ca. 20.000 textsfrom 800 authors • Runtime: ca. 3.5 weeks • Result: • 217 milliontokens (words), 533 millionconstituents, 110 GB data

Gutenberg Corpus – Analyzing • Fulltext index (Lucene) • Database for the linear part of the data • Tree-like structures can be analyzed with XML based techniques (Xpath, Xquery) • DOM based techniques are slow and performance hungry

Links etc. • Clarin Homepage: http://www.clarin.eu • The D-Spin homepage: http://www.d-spin.org • WebLicht (login via DFN AAI): https://weblicht.sfs.uni-tuebingen.de/ Erhard Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft Universität Tübingen Wilhelmstr. 19 D-72074 Tübingen thomas.zastrow@uni-tuebingen.de Erhard.hinrichs@uni-tuebingen.de

WebLicht - Combinations

WebLicht: Linguistic Chaining Tool for Language Corpus Data