1 / 41

Experiences with UIMA from a User’s Perspective

Experiences with UIMA from a User’s Perspective. Dietmar Rösner, Manuela Kunze, Hany Mahgoub. University of Magdeburg C Knowledge Based Systems and Document Processing. Overview. Introduction GATE UIMA Conclusion. Introduction.

prunella
Télécharger la présentation

Experiences with UIMA from a User’s Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document Processing

  2. Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  3. Introduction "IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies." • November 2005; Version 1.2.3 of UIMA is available Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  4. Introduction really? Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  5. Introduction • similarity/comparison of GATE and UIMA • frameworks • results are documents + annotations • pipeline processing • steps: • task definition • one corpus Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  6. Evaluation Topics/Points • ease of getting acquainted with system?: • quality of docus: completeness, clarity, up-to-date, …? • tutorials, use cases, …? • processing and linguistic resources? • lexica, Gazetteer lists, tools • tools for resource maintenance and extension? • quality: selfexplanatory, robust, comfortable • speed of processing? • single docs vs. large corpora? • limitations, suggestions for improvement? • support for im-/export of a variety of document formats? Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  7. Task of the Experiment • process a corpus of websites • to detect and extract information relevant for tourists • opening times of museum, prices of hotels,… • corpus: • 30 tourism web sites of Egypt • additional 20 web sites of Washington, New York, London • output: • Prolog facts for a reasoner • Questions: • Which museum is now open? • … Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  8. Excerpts from the Corpus • The Egyptian Museum is open the hours: 9am-5pm daily • The Military Museum is open the hours: Summer: 8am-5:30pm; winter: 8am-4:30pm • Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter) • 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri • … Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  9. Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  10. GATE: General Architecture for Text Engineering • a suite of tools for language processing and information extraction • rule-basedmodular IE system (ANNIE) • language and domain-independentprocessing resources • open and extensible architecture • aims to provide uniform access to various linguistic and ontological resources • http://gate.ac.uk/ Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  11. GATE: General Architecture for Text Engineering • a software infrastructure for NLP researchers; based on three main elements: • an architecture • describing the components composing a language processing system • a framework • could be used as a basis for building such systems • a graphical development environment • a set of tools and • components for language engineers Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  12. GATE: General Architecture for Text Engineering • GATE distributed with IE system called ANNIE • relies on finite state algorithms and the Java Annotation Pattern Engine (JAPE) language • comprising a set of core Processing Resources (PRs): • Tokeniser • Gazetteers • POS tagger • Sentence Splitter • Semantic Tagger (JAPE transducer) • Orthomatcher (orthographic coreference) • … Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  13. GATE: ANNIE [Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)] Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  14. Gate Application • several Processing Resources: Tokenizer, Hash Gazetteer (with new/extended Gazetteer lists), JAPE Transducer ... * The Military Museum* Summer: 8am-5:30pm; Winter: 9pm-5pm … ANNIE English Tokenizer Gazetteer lists JAPE Transducer • JAPE rules: to annotate • interval of times and restrictions • museum names of museums, fragments of times and restrictions Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  15. Museum information in JAPE Rule: egyptmuseums ( ({SpaceToken}) ({Token.kind == word}) ({SpaceToken}) {Lookup.majorType ==org_base} //from gazetteer lists ({SpaceToken})? (({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))* ({timeinfo}) // annotation by jape transducer ) :museum --> :museum.sight = {rule ="egyptmuseums"} • timeinfo defined by JAPE rules detects patterns like: • 9am-5pm, 6pm-9pm • 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm • 5:00PM-7:00PM, 10:00am-5:00pm • …. Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  16. GATE: Presentation of Results Type and location of every extracted annotation on document Annotations Museums Information Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  17. GATE: Results • information annotated in the documents: • names of museums, hotels • names of tourist places in Egypt • times, time intervals • time restrictions • prices, intervals of prices (hotel prices and museum prices) • names of pharaohs, queens Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  18. GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • good • illustrative examples (tutorial) but not enough specialy about JAPE rules • can deal with it without know of Java programming • but is advantage to have experinces with Java programming to use it in JAPE rules Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  19. GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • many processing resources available (ANNIE) • tokenisers • POS taggers • parsers • gazetteers • sentence splitter • … • additional PRs : • gazetteer collector • PRs for Machine Learning • various exporters • annotation set transfer etc... Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  20. GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • editor for gazetteer list • corpus manager • text editor and debugger for JAPE rules Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  21. GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • there is no measurement of processing time in the GATE tool Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  22. GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • corpus pipeline vs document pipeline Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  23. GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • no limitations: • all is possible but it is not necessary to implement by yourself • for beginning: • processing and linguistic resources available within the distribution Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  24. GATE: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • import: • supports a variety of document formats: HTML, rtf, email, SGML and plain text • In all cases the format is analysed and converted into a single unified model of annotation • export: • documents, corpora and annotations in databases of various sorts • required: Java application (CREOLE) Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  25. Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  26. UIMA: Unstructured Information Management Architecture • a software architecture for developing and deploying unstructured information management (UIM) applications • UIM application: a software system • analyse large volumes of unstructured information to • discover, • organize, and • deliver relevant knowledge to the end user • software architecture which specifies • component interfaces, data representations, … • http://www.research.ibm.com/UIMA/ Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  27. UIMA: Unstructured Information Management Architecture … may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from <P> tags in the original HTML) into the CAS. … takes a CAS, analyzes its contents, and produces an enriched CAS. Analysis Engines can be recursively composed of other Analysis Engines (called an Aggregate Analysis Engine). Aggregates may also contain CAS Consumers. … interfaces to a collection of data items (e.g., documents) to be analyzed. Collection Readers return CASes that contain the documents to analyze, possibly along with additional metadata. CAS: Common Analysis Structure CPM: Collecting Processing Manager … consume the enriched CAS that was produced by the sequence of Analysis Engines before it, and produce an application-specific data structure, such as a search engine index or database. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference] Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  28. UIMA: Unstructured Information Management Architecture • Analysis Engine (AE): • a component that analyzes artifacts (e.g. documents) and infers information about them • consists of two parts: • Java classes (typically packaged as one or more JAR files) and • AE descriptors (one or more XML files) • the configuration settings for the Analysis Engine as well as • a description of the AE’s input and output requirements. [Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference] Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  29. UIMA Application • several annotators (like a pipeline) ... *Fraunces Tavern Museum* 54 Pearl St. - 1-212-425-1778 Tuesday-Friday, 12pm?5pm; … regular expressions restrictions Prolog facts: museumopen('Fraunces Tavern Museum ', '2005-12-01T12:00:00', '2005-12-01T17:00:00'). museumopen('Fraunces Tavern Museum ', '2005-12-02T12:00:00', '2005-12-02T17:00:00'). museumopen('Fraunces Tavern Museum ', '2005-12-03T12:00:00', '2005-12-03T17:00:00'). interval of times museum information time pattern window covering two time intervals and a restriction museum pattern regular expressions window covering a museum and opening hours regular expressions Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  30. UIMA: Results • information annotated in the documents: • names of museums, hotels • times, time intervals • time restrictions • prices, intervals of prices (hotel prices) • keywords for museum category • names of pharaohs (annotated with a correction of mispellings) • hotel and museum information are exported into Prolog facts and into a short textual summary • templates filled with the detected information • hotels: Price information about Cosmopolitan Hotel : $157 • museums: *** *Fraunces Tavern Museum* *** Open from 12:00:00 to 17:00:00; Restriction: Tuesday-Friday Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  31. UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • good • illustrative examples (tutorial) • completeness: sometimes it is very shortly described • prior knowledge about Java and Eclipse is helpful Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  32. UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • annotators only from tutorial • sentence annotation • word annotation • date/time annotators • examples for using regular expressions etc. • external resources can be integrated: • lexical resources as external resources (text files) • existing processing resources • implementation of an interface is necessary Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  33. UIMA: Evaluation documentation? processing and linguistic resources? tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • specific Eclipse component editors or • simple text Editors Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  34. UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • faster than GATE? • in CPE detailed information about processing time for each module Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  35. UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • Collection Reader • document(s) from a directory • adapt extensions into Preprocessing (CAS Initializer) • e.g., extraction of text fragments from a HTML document Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  36. UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • no limitations: • all is possible, but implementation or interfacing by user • wish: • more processing and linguistic resources within the distribution Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  37. UIMA: Evaluation documentation processing and linguistic resources tools for resource maintenance and extension? speed of processing? single docs vs. large corpora? limitations, suggestions for improvement? im-/export of document formats? • import: CAS Initializer • export: CAS Consumer • transform annotations in any other format • export of • document + annotations • only annotations • required: Java application Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  38. Overview • Introduction • GATE • UIMA • Conclusion Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  39. Conclusion • intended use • GATE: academic/scientific application • tools available • comfortable GUI • UIMA: more commercial • plain framework • simplified definition of (complex) results structures • simplified pre- and postprocessing of annotations • in sum: incommensurable Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  40. Conclusion • both are extensible • no final judgement about: use GATE or UIMA • depends on • your task • task description • expected results • which processing resources are necessary • your preferences for interface • prefer the Eclispe environment (or other Java editors) • prefer a comfortable GUI • or use both Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

  41. Conclusion • found in the UIMA Forum: I see UIMA and GATE as complementary rather than competitive, and each can gain from the strengths of the other. GATE was originally developed as a research tool, and has features suited to rapid prototyping of text processing code, like JAPE (a language for defining finite-state transducers over annotations on a document). UIMA is more targetted at robust deployment of applications, with strong typing of feature structures and better support for distributed processing.We're currently working on writing a translation layer to allow UIMA analysis components to be used in GATE and vice-versa. It's not in a releasable state just yet, but we hope to release something in the near future. Keep your eye on http://gate.ac.uk/ for details. Ian Roberts (GATE developer) Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective

More Related