140 likes | 316 Vues
A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird * , David Day † , John Garofolo ‡ , John Henderson † , Christophe Laprun ‡ and Mark Liberman* * Linguistic Data Consortium, University of Pennsylvania † MITRE Corporation ‡ National Institute of Standards and Technology.
E N D
A Flexible and Extensible Architecture for LinguisticAnnotationSteven Bird*, David Day†, John Garofolo‡, John Henderson†, Christophe Laprun‡ and Mark Liberman**Linguistic Data Consortium, University of Pennsylvania†MITRE Corporation‡National Institute of Standards and Technology A T L A S
Tradition: Create formats and tools for each research domain RDB SGML • Existing bazaar of formats and tools discourages exchange and reuse
Background • Participant “Troika” motivated by applications needs • NIST work in evaluation infrastructure • LDC work in corpus building and annotation graph research • MITRE work in multi-modal visualization/annotation, extraction technology, Alembic Workbench • Began collaboration in early summer ‘99 • Initially, exploring feasibility of fitting together existing resources under Bird & Liberman annotation graph formalism • Early goals • develop ability to construct flexible and extensible tools and data formats for existing research domains and applications • focus task to create formats to support ACE infrastructure • Project has evolved substantially as we continue to explore new domains and uses
Base Ontology for Linguistic Annotation of Signals • Establishing an annotation requires specifying • Thesource signal that is being annotated • The particular region of the signal about which one wants to say something • Thecontentof the annotation being asserted about that region of the signal Signal Region Annotation
The Annotation Graph Model • The Annotation Graph model, a proper subset of the more general case, addresses annotation for one-dimensional signals (text, audio) • intervals specified with start and end nodes • nodes have (optional) offsets • annotations specified as labeled arcs between nodes • labels are fielded records (attributes + values) • collection of annotations => annotation graph • Formal definition • labeled directed acyclic graph, with a partial time function on nodes (see Bird & Liberman 2000)
ATLAS Generalized Model • The generalized model has been designed to accommodate non-linear signals such as images: • annotation elements describing regions within signals with signal pointer(s) and content-bearing attributes Annotation <Annotation> <Source> <Region> … </Region> </Source> <Content> … </Content> </Annotation> Region Content Signal • annotation sets containing clusters of annotation elements • annotations may be treated as signals themselves • standoff annotations provide alignment of annotations & signals
Extensibility • Impossible to anticipate all the varieties of “linguistic signals” and the ways one might wish to annotate them • ATLAS includes a mechanism for declaring new signal classes and defining new ways of carving out regions of those signals via • the definition of an anchor type for the new signal class • the creation of an anchor “plug-in” component • ATLAS will support general purpose signal classes for popular linguistic resource types • Signals: text, audio, images, video • Symbol tables: word lists, part-of-speech tagsets, … • Attribute value matrices: dictionaries, thesauri, knowledge representation propositions, … • Tree databases: Treebanks, … • Signal alignments: bilingual corpora, …
ATLAS Layers • Approach: Separate/abstract physical and logical levels from application-specific levels for maximum flexibility. • Physical level provides a persistent representation of logical level data for long-term storage, exchange, and pipelining • XML-based ATLAS Interchange Format (AIF) • Relational database implementation • Logical level provides a structural framework for the manipulation of annotation data • annotation elements and sets • atomic operators (creation, manipulation, destruction) • Application level specifies semantic interpretation of annotation data and provides user interfaces • application-specific (developer-provided)
Layered Solution Visualization and Exploration Extraction Systems Annotation Tools Query Systems Automatic Aligners Evaluation Software Conversion Tools RDB AIF Files Applications ATLAS CORE ATLAS API ATLAS Logical Level ATLAS Physical Level
ATLAS Architecture VC2 SC2 AC2 VC1 VCn SC1 SCn AC1 ACn EC2 EC1 ECn Visualization Search/Access Annotation Format Exchange • Persistent Storage • RDBMS • flat files (AIF) • XML Processing • DTD validation • XML parser • XSLT • Data Access • file sharing • network protocols • multi-user/collaboration • privacy ATLAS ATLAS Public Services Internal Representation ATLAS Private Services
ATLAS Interchange FormatAn Example Signal types Annot set Annot element <AnnotationSet id="http://ace.program/ocr/9801.10/9801.10.omni.xml”> <Signal mime-class=“AUDIO” mime-type=“wav” encoding=“wav” ID=“Audio1”> <Signal mime-class=“TEXT” mime-type=“PLAIN” encoding=“UTF8” ID=“Text1”> <Annotation id=“a1” type=“transcription”> <Source> <Region Signal=“Audio1” type=“interval”> <Value type=“integer” role=“start” unit=“msec”>453</Value> <Value type=“integer” role=“end” unit=“msec”>497</Value></Region> </Source> <Content> <Region Signal=“Text1” type=“interval”> <Value type=“integer” role=“start” unit=“char”>25</Value> <Value type=“integer” role=“end” unit=“char”>29</Value></Region> </Content> </Annotation> <Annotation id=“a2” type=“transcription”> … </Annotation> … </AnnotationSet> Source Signal Standoff Content
Potential ATLAS Applications • Corpora: • data exchange/reuse, consistent meta data formats • multi-layered/multi-linked annotation • multi-lingual dictionaries, aligned multi-lingual data • aligned multi-modal data (audio/video/image/text) • lexicons with varying levels of structure • Tools • modular/reusable annotation components • development infrastructure • conversion tools • Applications • internal/external data representation • faster prototyping and development • evaluation • data pipelining and plug-and-play data exchange • document segmentation/zoning
ATLAS Projects Underway • Evaluation Formats: • ACE Entity Detection and Tracking (EDT) Evaluation • DARPA/NIST ASR/Segmentation scoring • Corpora: • NSF linguistic exploration project on low-density languages • NSF Talkbank • UMD Image Recognition Evaluation Corpus • Tools: • LDC annotation tools • MITRE Alembic Workbench • Emu speech database access tools • DGA speech Transcriber • next generation SCLITE
Development Status • ATLAS Prototype Suite implemented: • ATLAS Interchange Format (AIF) XML DTD • Annotation graph API definition • Core API implementations (C++, Java) for annotation graphs • Extending the architecture for new signal types • Defining query language • Currently soliciting research community input • ACE, TIDES, DARPA ASR, ISLE, CES, industry ... • Complete ATLAS 1.0 (Beta) (Sep. 2000) • Internal representation, AIF, basic query language, sample applications (transcription/annotation tools, conversion tools) • Open Source ATLAS (Winter, 2000-2001) • ATLAS Website: • http://www.nist.gov/speech/atlas