60 likes | 189 Vues
Annotating multiple bilingual and parallel text corpora for common representation. Explore surface realizations across languages. Methods and tools used for annotation process. Results include agreement metrics and annotation time analysis. Plans for merging IL1 representations to produce IL2.
E N D
Columbia, CRL/NMSU, ISI/USC, LTI/CMU, MITRE, UMIACS/UMD • David Farwell, Stephen Helmreich Computing Research Laboratory/New Mexico State University • Lori Levin, Teruko Mitamura Language Technologies Institute/Carnegie Mellon University • Bonnie Dorr, Rebecca Green Institute for Advanced Computer Studies/University of Md. • Eduard Hovy Information Sciences Institute/University of S. California • Keith Miller, Florence Reeder MITRE Corporation • Owen Rambow, Nizar Habash Columbia University
Columbia, CRL/NMSU, ISI/USC, LTI/CMU, MITRE, UMIACS/UMD What we annotate • multiple comparable bilingual text corpora • parallel text corpora • multiple translations of texts • Genre - newspaper texts / DARPA corpus • Goals • common representation (interlingua) • common methodology and tools • observe and catalogue different surface realizations of the same meaning across and within languages
Columbia, CRL/NMSU, ISI/USC, LTI/CMU, MITRE, UMIACS/UMD Annotation Process • Text is syntactically parsed (Connexor / IL0) • Reviewed and corrected (TrEd) • Annotation to IL1 (Tiamat) • Content words annotated for sense (Omega) • Arguments annotated for thematic role (LCS) • 2 English translations of 6 articles • Arabic, French, Hindi, Japanese, Korean, Spanish • 12 annotators, 2 at each site • Total: 144 annotated texts to IL1 level
Columbia, CRL/NMSU, ISI/USC, LTI/CMU, MITRE, UMIACS/UMD Results: Agreement & Time • Tools (Tiamat) • Manuals (IL0 for 7 languages, IL1) • Inter-annotator agreement: kappa = .83 (mK), .66 (wn), .59 (theta-roles) • Annotation time: 4 hours/annotator/ text, 250 words/text, 2 annotators/text = approx. 2 person years for 100K at IL1 • Next step: merge IL1 representations and develop transformation algorithms to produce IL2