1 / 16

Coreferencing Treebank data using CESAC

Annotating and analysing IS in corpora of historical English Berlin, 13-14 November 2009. Coreferencing Treebank data using CESAC. Contents. Coreferencing using Cesac. Overview. CESAC - Goals - Coreference types: operationalizing IS - Input and output Inter-rater agreement Example

miracle
Télécharger la présentation

Coreferencing Treebank data using CESAC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotating and analysing IS in corpora of historical English Berlin, 13-14 November 2009 Coreferencing Treebank data using CESAC

  2. Contents Coreferencing using Cesac Overview • CESAC - Goals - Coreference types: operationalizing IS - Input and output • Inter-rater agreement • Example • Summary and conclusion

  3. Goal of CESAC Coreferencing using Cesac CESAC goals • Overall goal - Referring from any one constituent to any other constituent • More specifically - Source and destination: IP/phrase/node or DP/lexeme/endnode - Attributes • Coreference type • Distance measure • (NP type: definite, indefinite, etc.) • (Animacy)

  4. Goal of CESAC Coreferencing using Cesac CESAC coreference types: operationalizing IS • Two basic rules - Do not omit possible coreference information - A source should be linked to the nearest possible destination • Labels encoding different forms of anaphoricity - Identity ‘Jacqueline plays the cello. She is an amazing musician’ - Cross Speech ‘John said to Paul: “Why don’t you play the guitar?”’ - Inferred ‘Do you see that house? They say the kitchen is extremely spacious’ - World knowledge (separate category) ‘According to Burt Reynolds, all dogs go to heaven’ • Cross Speech > Identity > Inferred • Encoding facts vs. encoding interpretations: objective data

  5. The input: Penn-Treebank Coreferencing using Cesac CESAC input 1 • Standard Penn-Treebank format - Collection of <nodes> - Each <node> consists of • Brackets: (…) • Label: (NP …) • Other node: (NP (N …) ) • Lexeme: (N man) • Possibly <lexeme>+<node>: (P to(NP him)) - Attributes in label • (NP-ACC (PRO^A hine)) - Extra-textual data in CODE nodes • (CODE <TEXT: +tyl+aste>)

  6. matching brackets Node Label EndNode The input: Penn-Treebank Coreferencing using Cesac CESAC input 2 ( (CODE <T06080009600,11.4>) (IP-MAT (CONJ And) (NP-NOM (D^N +t+at) (N^N folc)) (NP-ACC (PRO^A hine)) (ADVP-TMP (ADV^T +ta)) (PP (P mid) (NP-DAT (ADJ^D unasecgendlicre) (N^D wur+dmynte))) (PP (P to) (NP-DAT (N^D scipe))) (VBDI gel+addon) (. ,)) (ID coapollo,ApT:11.4.183)) ( (IP-MAT (CONJ and) (NP-NOM (NR^N Apollonius)) (NP-ACC-1 (PRO^A hi)) (VBDI b+ad) (IP-INF (NP-ACC-SBJ *ICH*-1) (QP-ACC (Q^A ealle)) (VB $gretan))) (ID coapollo,ApT:11.4.184))

  7. enriched Penn-Treebank Coreferencing using Cesac CESAC output 1 • Penn-Treebank format • Enriched with coreference information - Source node ID - Destination node ID - Coreference type - Coreference distance – derivable • Destination node example (NP-SBJ (CODE <Coref_Id="339"_/>) (NPR Crist)) • Source node example (NP-OB1 (CODE <Coref_Id="20"_Ref="21"_Type="Identity"_NdDist="16"_/>) (PRO hem) )

  8. NP-SBJ NP-SBJ NPR CODE NPR Crist Crist <Coref Id=“310”> Source node NP-OB1 NP-OB1 PRO PRO CODE hem <Coref Id=“340” Ref=“310” Type=“Identity> hem enriched Penn-Treebank Coreferencing using Cesac CESAC output 2 Destination node <node> = one-or-more <node> OR <lexeme>

  9. <lexeme> + <node> enriched Penn-Treebank Coreferencing using Cesac CESAC output 3 ( (IP-MAT (CONJ and) (NP-NOM *con* (CODE <Coref_Id="1488"_Ref="1489"_Type="Identity"_NdDist="12"_/>)) (VBD l+adde) (NP-ACC (CODE <Coref_Id="1476"_Ref="1477"_Type="Identity"_NdDist="8"_/>) (PRO^A hine)) (PP (P mid) (NP-DAT-RFL (CODE <Coref_Id="1487"_Ref="1488"_Type="Identity"_NdDist="6"_/>) (PRO^D him))) (PP (P to) (NP-DAT (PRO$ his (CODE <Coref_Id="1486"_Ref="1487"_Type="Identity"_NdDist="5"_/>)) (N^D huse)))) (ID coapollo,ApT:12.16.209))

  10. PP PP NP-DAT P NP-DAT P PRO$ to PRO$ to CODE his his <Coref Id=“20” Ref=“21” Type=“Identity> enriched Penn-Treebank Coreferencing using Cesac CESAC output 4 Source node <node> = one-or-more <node> OR <node> <lexeme> OR <lexeme>

  11. Goal of CESAC Coreferencing using Cesac Inter-rater agreement 1 • Two features measured - Coreference destination (node ID) - Coreference type • Adapted version of Cohen’s kappa: κ > .6 • Two important problems - Identity vs. cross speech - Omission of link • Solutions - Create new rule(s) - Adapt/specify existing rule(s)

  12. Goal of CESAC Coreferencing using Cesac Inter-rater agreement 2 • Tool used to calculate inter-rater agreement concerning - Coreference destination (feature 1): κ = .67 - Coreference type (feature 2): κ = .66

  13. Goal of CESAC Coreferencing using Cesac Example 1

  14. Goal of CESAC Coreferencing using Cesac Example 2 • Clean text fragment with translation Ant warshipe hire easkeđ. Hweonene cumest tu fearlac deađes munegunge. Ich cume he seiđ of helle. And Worship him asked, ‘From where come you, Fearlac, death’s reminder?’ ‘I come’, he said, ‘from hell.’ • Text fragment in CESAC coreference file 170.64 [2031 ant[2033 warschipe][2035 hire] easkeđ. Hweonene[2042[2043 ] cumest [2045 tu][2047 fearlac[2049 deađes munegunge]]] .] 170.65 [2053[2054 Ich] cume[2057[2058 he] seiđ] of[2063 helle] .] • Text fragment in Penn-Treebank file ( (IP-MAT (CONJ ant) (NP-SBJ (N warschipe)) (NP-OB1 (PRO hire)) (VBP easke+d) (, .) (CP-QUE-SPE (WADVP-1 (WADV Hweonene)) (IP-SUB-SPE (ADVP-DIR *T*-1) (VBP cumest) (NP-SBJ (CODE <Coref_Id="346"_Ref="345"_Type="CrossSpeech"_NdDist="10"_/>) (PRO tu)) (NP-VOC (CODE <Coref_Id="50"_Ref="346"_Type="Identity"_NdDist="2"_/>) (N fearlac) (NP-PRN (CODE <Coref_Id="49"_Ref="50"_Type="Identity"_NdDist="2"_/>) (N$ dea+des) (N munegunge))))) (E_S .)) (ID CMSAWLES,170.64)) ( (IP-MAT-SPE (NP-SBJ (CODE <Coref_Id="347"_Ref="346"_Type=“CrossSpeech"_NdDist="9"_/>) (PRO Ich)) (VBP cume) (IP-MAT-PRN (NP-SBJ (CODE <Coref_Id="348"_Ref="347"_Type="CrossSpeech"_NdDist="4"_/>) (PRO he)) (VBP sei+d)) (PP (P of) (NP (NPR helle))) (E_S .)) (ID CMSAWLES,170.65))

  15. Goal of CESAC Coreferencing using Cesac Summary and conclusion • Annotation program CESAC - Input: standard Penn-Treebank - Output: relatively easy to analyse - Inter-rater agreement measured • Operationalizing IS - 4 coreference types - As objective as possible: facts vs. interpretations • Plans - Fixed set of coreference types - Larger corpus of coreferenced texts

  16. Goal of CESAC Coreferencing using Cesac Thank you for your attention!

More Related