480 likes | 589 Vues
This presentation details a structured approach to managing annotation schemes for richly annotated corpora, focusing on XML as a standard framework for language annotation. It addresses the challenges of chaotic annotation schemes and proposes a hierarchical representation that facilitates classification and reuse of annotations. The presentation includes examples of annotation sessions, explores human and automatic annotation integration, and examines multiple parentage within annotation schemes. Key conclusions stress the need for structural clarity in language resource annotation.
E N D
Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu dcristea@infoiasi.ro cris@infoiasi.ro “Al. I. Cuza” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch Institute for Theoretical Computer Science
XML in LR annotation • A de facto framework to support language annotation • Used to: • record experts views on linguistic phenomena on corpora • store intermediate results in pipe-line NLP applications • post NLP results • BUT: • annotation schemes: a chaos and not reusable • many annotations do share parts in common • not all layers are useful for the task at hand LREC 2004 – Workshop on Richly Annotated Corpora
Presentation • Motivation for a structural view on annotation schemes • Proposal for ahierarchical representation • circular references • classification within the hierarchy • operations within the hierarchy • Conclusions LREC 2004 – Workshop on Richly Annotated Corpora
An annotation session • a source XML annotated document • a database image of the annotation Annotation session both or DTD file LREC 2004 – Workshop on Richly Annotated Corpora
A sequence of annotation sessions Annotation session Annotation session DTD2 DTD1 LREC 2004 – Workshop on Richly Annotated Corpora
Mixing human with automatic annotation Automatic annotation Manual annotation DTD2 DTD1 LREC 2004 – Workshop on Richly Annotated Corpora
+ Multiple parentage of a scheme LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage < … > < … > < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG The hierarchy – a DAG representation LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG The hierarchy – a DAG representation LREC 2004 – Workshop on Richly Annotated Corpora
Definition of a scheme <scheme name=”scheme-name” parents=”list-of-parents”> <tag name="tag-name" attributes="list-of-attributes"/> … <ref source-tag="tag-name" source-attribute="attribute-name" target-tag="tag-name" target-attribute=”attribute-name”> … </scheme> LREC 2004 – Workshop on Richly Annotated Corpora
A The subsumption relation B A node A subsumes a node B in the hierarchy (B is a descendent of A) iff: • any tag-name of A is also in B; • any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B; • any semantic relation which holds in A also holds in B; • either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A, and/or there is at least one semantic relation which holds in B and which doesn’t hold in A. LREC 2004 – Workshop on Richly Annotated Corpora
Example <?xml version="1.0" encoding="ISO-8859-1" ?> <ROOT> <SEG id="0"> <NP head-id="2" id="0"> <TOK id="2" pos="N" lemma="Winston">Winston</TOK> </NP> <TOK id="3" pos="V" lemma="be">was</TOK> <TOK id="4" pos="ING" lemma="dream">dreaming</TOK> <TOK id="5" pos="PREP" lemma="of">of</TOK> <NP head-id="7" id="2"> <NP head-id="6" id="1" coref="0"> <TOK id="6" pos="PRON" lemma="he">his</TOK> </NP> <TOK id="7" pos="N" lemma="mother">mother</TOK> </NP> <TOK id="8" pos="PUNCT">.</TOK> </SEG> <SEG id="1"> <NP head-id="9" id="3" coref="0"> <TOK id="9" pos="PRON" lemma="he">He</TOK> </NP> <TOK id="10" pos="V" lemma="must">must</TOK> <TOK id="11" pos="PUNCT">,</TOK> </SEG> <SEG id="2"> <NP head-id="12" id="4" coref="0"> <TOK id="12" pos="PRON" lemma="he">he</TOK> </NP> <TOK id="13" pos="V" lemma="think">thought</TOK> <TOK id="14" pos="PUNCT">,</TOK> </SEG> </ROOT> LREC 2004 – Workshop on Richly Annotated Corpora
How can circular referencesbe notated? <SEG id=“seg0" head-id=“vp0"> Winston <VP id=“vp0“ in-seg=“seg0">was dreaming</VP> of his mother </SEG> LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG Representingcircular references SEG annotation <SEG id=“seg0"> Winston was dreaming of his mother </SEG> LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP Representingcircular references VP annotation Winston <VP id=“vp0“> was dreaming </VP> of his mother LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP ST-SEG Representingcircular references SEG refers into VP <SEG id=“seg0"head-id=“vp0"> Winston <VP id=“vp0“> was dreaming </VP> of his mother </SEG> ST-SEG-TO-VP LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP ST-SEG Representingcircular references VP refers into SEG <SEG id=“seg0"> Winston <VP id=“vp0“in-seg=“seg0"> was dreaming </VP> of his mother </SEG> ST-VP-TO-SEG LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP Representingcircular references Keeping all references <SEG id=“seg0“ head-id=“vp0”> Winston <VP id=“vp0“ in-seg=“seg0"> was dreaming </VP> of his mother </SEG> ST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-ROOT ST-VP ST-VP ST-SEG-VP ST-SEG-VP ST-SEG ST-SEG Representingcircular references Delete unnecessary layers ST-SEG-TO-VP ST-VP-TO-SEG LREC 2004 – Workshop on Richly Annotated Corpora
In what conditions can a document interact with a hierarchy? • Compatibility of names • Matching of semantic relations LREC 2004 – Workshop on Richly Annotated Corpora
In what conditions can a document interact with a hierarchy? • Compatibility of names = tag and attribute names • simple translation • expanding/shrinking values msd=”Ncmso” expands into a set of elementary features pos=”noun” type=”common” gender=”masculine” number=”singular” case=”obligue” LREC 2004 – Workshop on Richly Annotated Corpora
In what conditions can a document interact with a hierarchy? • Matching of semantic relations • only by explicit declaration • automatic detection (intersection of attribute value ranges) is prone to errors LREC 2004 – Workshop on Richly Annotated Corpora
Operations on the lattice:classification • Automatic classification of a document on the lattice proceeds in two steps: • the witness-collection is formed: • the document is parsed tag declarations • semantic-relations declaration in the header ref declarations • the witness-collection is “classified” down the hierarchy LREC 2004 – Workshop on Richly Annotated Corpora
Operations on the lattice:classification • The “programming by classification” paradigm of Mellish&Reiter (1993) • the witness collection satisfies the restrictions of a node collection (is classified under it) if the features of the node collection represent of subset of the features of the witness collection LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline inferior borderline LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice ST-SEG-NP-VP-1 LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice ST-NP-PP LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-PAR ST-TOK ST-NP ST-SEG ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:merge ST-NP-SEG LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-COREF ST-COREF-IN-SEG Operations on the lattice:extract ST-POS LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-COREF ST-COREF-IN-SEG Operations on the lattice:extract ST-POS LREC 2004 – Workshop on Richly Annotated Corpora
Conclusions • Propose a data structure facilitating: • Definition and exploitation of annotation schemes • Visualization of the hierarchy • Representation of circular references • Concurrent annotations • Automatic classification • Operations • initialize-hierarchy • classify • merge • extract • System developed in Java, freely available on request LREC 2004 – Workshop on Richly Annotated Corpora
Acknowledgements The research presented in this paper has been partly supported by the EC IST-2000-29388 Balkanet project funded by the EC and the Balkanet-MEC project funded by the Romanian Ministry of Education and Research LREC 2004 – Workshop on Richly Annotated Corpora
Thank you… LREC 2004 – Workshop on Richly Annotated Corpora