1 / 12

ANNIC ANNotations In Context

ANNIC ANNotations In Context. GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani). Motivation - I. Need for efficient corpus indexing and querying arises frequently both in machine learning-based and human-engineered NLP systems.

zorion
Télécharger la présentation

ANNIC ANNotations In Context

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ANNICANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)

  2. Motivation - I • Need for efficient corpus indexing and querying arises frequently both in machine learning-based and human-engineered NLP systems. • Language Engineers use their intuition when writing patterns trying to strike the ideal balance between specificity and coverage. This requires them to make a series of informed guesses which are then validated by testing the resulting rule set over a corpus.

  3. Motivation - II • Need a system that allows querying the information contained in a corpus in more flexible ways than simple full-text search (e.g. identifying share movements like “BT shares ended up 36p” • Required: A system that can index and query both linguistic metadata and document content - in a flexible way and also allows validating the derived rule set with minimum possible efforts.

  4. ANNIC - ANNotations In Context Description Full featured annotation indexing and search engine, developed as part of GATE Powered with? Apache Lucene technology What can be indexed? Documents in any format supported by GATE (i.e. XML, HTML, RTF, E-mail, text, etc.) Indexing of Linguistic metadata Extensive indexing of document content and linguistic information (annotations and features) associated with document content, independent of document format

  5. ANNIC - ANNotations In Context What is special? Indexing and extraction of information from overlapping annotations and features Result? Matching texts in the corpus, displayed within the context of Linguistic annotations (and not just text, as is customary for KWIC systems) Interface? Advanced GUI provides a graphical view of annotation mark-ups over the text along with ability to build new queries interactively Where to use? Can be used as first step in rule development in NLP systems as it enables the discovery and testing of patterns in corpora

  6. The Pattern Syntax • JAPE – Java Annotation Pattern Engine in GATE • - It executes the JAPE grammar phases- each phase consists of • regular expression pattern/action rules over annotations • - LHS represents an annotation pattern • e.g. {Title}{Token.orth=“upperinitial”} • - RHS describes the action to be taken when pattern found • e.g. Annotate the above pattern as a Person • ANNIC allows indexing documents with annotations and features and • users to issue queries that contain LHS part of the JAPE pattern/action • rule • e.g. {Person} {Token.string==“from”} {Organization}

  7. Klene Operators • ANNIC supports two Klene operators “+” and “*” • ({A})+n one and upto n occurrences of annotation {A} • ({A})*n zero and upto n occurrences of annotation {A} • Also supports | (OR) operator • {A}({B} | {C})  {A}{B} | {A}{C} • {A} ({B} | {C})+2  ({A} ({B} |{C})) | • ({A} ({B} |{C}) ({B} | {C}))  ({A}{B}) | ({A}{C}) | ({A}{B}{B}) | ({A}{B}{C}) | ({A}{C}{B}) | ({A}{C}{C})

  8. ANNIC PRs • ANNIC Index PR • Allows indexing document content and metadata from a given corpus • Parameters • Corpus (serialized corpus) • Base token annotation type (e.g. Token) • Annotation features to be excluded (e.g. SpaceToken) • Index location

  9. ANNIC PRs • ANNIC Search PR • Allows searching over indexed documents • Parameters • Corpus (serialized corpus) OR one or more index locations • Limit (number of maximum patterns) • Context window (number of base tokens to show as context on each (left and right) side • Query (JAPE L.H.S. pattern)

  10. ANNIC Viewer

  11. ANNIC • DEMO • Index a corpus processed with ANNIE • Query the corpus • {Person} • {Organization}({Token})+10{Person} • QUESTIONS

  12. Thank You!This talk: http://gate.ac.uk/sale/talks/gate-course-oct06/annic.ppt

More Related