180 likes | 263 Vues
WebLicht is a web-based linguistic tool for accessing and filtering language corpus data. Explore its architecture, future requirements, and capabilities for researchers. Find out how WebLicht supports diverse user scenarios and provides customizable chains of web services for data analysis. Learn about the implementation phase and the mission of CLARIN in creating an integrated research infrastructure. Discover the potential of WebLicht for advancing linguistic research across various domains.
E N D
WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University Tübingen
Outline • Web-based Linguistic Chaining Tool (WebLicht) for incremental filtering and access of language corpus data • WebLicht– Motivation • WebLicht - Architecture • WebLicht– Future Requirements • Test Case – Gutenberg Corpus
CLARIN (Common Language Resource and Technology Infrastructure Network) • is committed to establishing an integrated and interoperable RI • supporting easy access and use of language • aims to overcome the current fragmentation and offer a stable, • persistent and extendable infrastructure • it will offer its services to researchers and scholars across a wide • spectrum of domains in particular in the humanities and soc sciences • ESFRI roadmap project; implementation phase starts in 2011 CLARIN Mission
Typical CLARIN userscenario • Scenario: A PhDstudentinvestigates regional differences in vocabulary and in wordcollocations in different variants of German . • Data: large text corporaavailable at BBAW in Berlin, at theAustrianAcademy of Science in Vienna, the Swiss Text Corpus Project in Basel, and at EURAC, Bolzano. • Tools fortargeteddataaccess: WebLichtofferscustomizablechains of web servicesforfiltering and analyzingthedata
WebLicht - Motivation • Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available • Most of them are implemented to run on local machines. This can be inconvenient and error-prone • Requirements: go beyond “do-it-yourself” and “download-first” strategies • The CLARIN solution: Make tools and resources available as webservices
WebLicht - Architecture • WebLichtis a SOA foraccessing and processingtext corpora • Developmentstarted in October 2008 • WebLichtconsists of thefollowingcomponents: • Distributedservices: offeringfunctionality (resources & tools) overthe (inter-)net. Implemented as webservices (ca. 90 at themoment) • Repository: storesmetadata and technicalinformationabouttheservices • Web 2.0 baseduserinterface:interactswiththeuser and combinesservices and informationfromtherepository. Access still possible via scripts / programmingcode
WebLicht - Architecture Stuttgart Tübingen Leipzig Stuttgart Leipzig Finland Romania Iceland UK Tübingen Repository Standard-conformant Text Corpus Encoding Web 2.0 Application for Tool Chaining and Execution Berlin
WebLicht – Architecture • Services are implemented as REST style webservices • HTTPs POST method is used to send data from the UI to the services • As client, anything which is able to use the HTTP protocol, can be used: • Browser • Commandline tools (wget, curl) • Programming Languages • Anyone can implement his/her own interface to WebLicht
WebLicht - Features • With RESTstylewebservices, everyone can implement a web service for WebLicht (4pages tutorial) • The SOA infrastructure is independent of programming languages or operating systems • The chaining algorithm is independent of the used dataformat • Form a legal point of view, the web services are still located in the institute where they were created
WebLicht– Future Requirements • Web services are synchronous: some linguistic annotation processes are very time consuming an asynchronous behavior of these service would be desirable • The processing power is limited by local computing resources Scalability only with strong centers possible • The current architecture is not sufficiently parallelized and therefore does not scale up: Accommodate a large number of simultaneous users Parallelization of processes
WebLicht– Future Requirements • Currently, users have to store the input data and their results on their local machines • Online storage in the form of personal workspaces with reliable backup solutions • Linguistic tools are typically developed in a variety of heterogeneous software environments and programming languages (Java, Perl, Python, C/C++, Prolog, Lisp, …) Encapsulation of individualserviceswithcommonAPIsforinteroperability • Currently, WebLicht services are limited to processing text corpora Extending webservices also to spoken language and multi-modal datasets (MPI is already working on this)
Test Case: Gutenberg Corpus • On the basis of these structure, a part of the free available Gutenberg Project was annotated in Tübingen • Ca. 20.000 textsfrom 800 authors • Runtime: ca. 3.5 weeks • Result: • 217 milliontokens (words), 533 millionconstituents, 110 GB data
Gutenberg Corpus – Analyzing • Fulltext index (Lucene) • Database for the linear part of the data • Tree-like structures can be analyzed with XML based techniques (Xpath, Xquery) • DOM based techniques are slow and performance hungry
Links etc. • Clarin Homepage: http://www.clarin.eu • The D-Spin homepage: http://www.d-spin.org • WebLicht (login via DFN AAI): https://weblicht.sfs.uni-tuebingen.de/ Erhard Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft Universität Tübingen Wilhelmstr. 19 D-72074 Tübingen thomas.zastrow@uni-tuebingen.de Erhard.hinrichs@uni-tuebingen.de