1 / 77

Off-line (and On-line) Text Analysis for Computational Lexicography

Off-line (and On-line) Text Analysis for Computational Lexicography. Hannah Kermes Algorithmische Syntax 21.12.2004. Motivation. maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process

rhea
Télécharger la présentation

Off-line (and On-line) Text Analysis for Computational Lexicography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004

  2. Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies

  3. Information needed • syntactic information • subcategorization patterns • semantic information • selectional preferences, collocations • synonyms • multi-word units • lexical classes • morphological information • case, number, gender • compounding and derivation

  4. Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards

  5. Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation

  6. A corpus linguistic approach

  7. Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

  8. Three different dimensions • type of grammar • symbolic grammar • probabilistic grammar • type of grammar development • hand-written grammar • learning methods • depth of analysis • analysis on token level only • full parsing • partial parsing

  9. Classical chunk definition • Abney 1991: The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template • Abney 1996: a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head

  10. Problems for extraction • Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

  11. An example • [PC mit kleinen ], [PC über die Köpfe ] with small above the heads [NCder Apostel ] [NC gesetzten Flammen ] the apostles set flames • [PP mit [NP[APkleinen ], [AP über [NPdie Köpfe with small above the heads [NPder Apostel ] ] gesetzten ] Flammen ]] the apostles set flames `with small flames set above the heads of the apostles´

  12. Problems for extraction • four NCs instead of only one NP • AN-pair: • gesetzten + Flammen • kleine + Flammen • NN-pair Köpfe + Apostel needs agreement information • VN-pair setzen + Flammen needs information about the deverbal character of gesetzten • a more complex analysis is needed • PCs and NCs need to be combined

  13. Simple solution PP  PC (PC|NC)* • theoretical motivation? • rule covers this particular example, other examples might need additional rules • rule is vague and largely underspecified • not very reliable • internal structure is mainly left opague

  14. Complex solution • NP  NC NCgen • PP  preposition NP • AP  PP adjective • NP  AP* noun

  15. Complex solution • solution for this particular example only • large number of rules needed • rules have to be repeated for every instance of a complex phrase • in order to support extractions, the classic chunk concept has to be extended

  16. Chunking Full Parsing YAC • full hierarchical representation • complex grammar • not very robust • ambiguous output • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output Conclusion

  17. A recursive chunker for unrestricted German text • recursive chunker for unrestricted German text • fully automatic analysis • main goal: provide a useful basis for extraction of linguistic as well as lexicographic information from corpora

  18. General aspects • based on a symbolic regular expression grammar • grammar rules written in CQP • basis: • tokenization • PoS-tagging • lemmatization • agreement information Tree Tagger IMSLex

  19. A typical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made

  20. YAC goes beyond • extends the chunk definition of Abney • recursive embedding • post-head embedding • provides additional information about annotated chunks • head lemma • agreement information • lexical-semantic and structural properties

  21. Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP-attachment, or sentential elements.

  22. Perl-Scripts rule application post- processing lexicon annotation of results Technical Framework corpus grammar rules

  23. Output formats • CQP format, used for: • interactive grammar development • parsing • extraction • an XML format, used for: • hierarchy building • extraction • data exchange

  24. Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules

  25. Linguistic coverage • Adverbial phrases (AdvP) • schön stark(beautifully strong) • daher (from there);irgendwoher (from anywhere) • heim (home); querfeldein (cross-country) • innen (inside); überall (everywhere) • "sehr bald" (very soon) • jetzt (now); damals (at that time)

  26. Linguistic coverage • Adjectival phrases (AP) • möglich (possible) • schreiend lila (screamingly purple) • rund zwei Meter hohe around two meter high • über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles'

  27. Linguistic coverage • Noun phrases (NP) • Oktober (October);er (he) • 4,9 Milliarden Euro 4.9 billion Euros • "Frankensteins Fluch" "Frankenstein's curse" • kleine, über die Köpfe der Apostel gesetzten small, above the heads of the apostles set Flammen flames 'small flames set above the heads of the apostles'

  28. Linguistic coverage • Prepositional phrases (PP) • davon (thereof) • zwischen Basel und St. Moritz between Basel and St. Moritz • mit kleinen, über die Köpfe der Apostel gesetzten with small, above the heads of the apostles set Flammen flames 'with small flames set above the heads of the apostles

  29. Linguistic coverage • Verbal complexes (VC) • gemunkelt (rumored) • muß gerechnet werden has counted to be 'has to be counted • zu bekommen to get • bekommen zu haben gotten to have 'to have gotten'

  30. Linguistic coverage • Clauses (CL) • … , daß selbst Ravel sich amüsiert hätte. … , that even Ravel himself enjoyed had. '… , that even Ravel would have enjoyed.' • … , die man in der griechischen Tragödie findet. … , which one in the Greek tragedy finds. '… , which one finds in the Greek tragedy.'

  31. Linguistic coverage • Clauses (CL) • … , Instrumente selbst zu bauen. … , instruments oneself to build. ' … , to build instruments oneself.' • … , um einen Kaffee zu trinken. … , in order a coffee to drink. '… , in order to drink a coffee.'

  32. Feature annotation • head lemma • morpho-syntactic information • lexical-semantic properties

  33. Feature annotation

  34. Head lemma • lemma attribute at the head position • normally a single token • multi-word proper nouns have a multi-token head lemma • a separated verbal prefix is included in the head lemma of the VC kommt … an  ankommen (arrive) • head lemma of PP: preposition:noun

  35. Morpho-syntactic information • intersection of the morpho-syntactic information of relevant elements • invariant elements are not considered • no guessing involved to solve ambiguities

  36. Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Gen:F:Pl:Def|Gen:F:Pl:Ind|Gen:F:Sg:Def|Gen:F:Sg:Ind|Gen:M:Pl:Def|Gen:M:Pl:Ind|Gen:M:Sg:Def|Gen:M:Sg:Ind|Gen:M:Sg:Nil|Gen:N:Pl:Def|Gen:N:Pl:Ind|Gen:N:Sg:Def|Gen:N:Sg:Ind|Gen:N:Sg:Nil|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>

  37. Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>

  38. Agreement Information den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>

  39. Agreement Information <np_agr |Akk:M:Sg:Def|> den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr> <nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr> </np_agr> <np_agr |Akk:M:Sg:Def|>

  40. Lexical-semantic properties • important for parsing as well as for extraction • properties can be triggers for specific internal structures, functions, and usages • properties inherent in the corpus • PoS-tags Johann Sebastian Bach NE NE NE • text markers "Wilhelm Meisters Lehrjahre" NE NN NN

  41. Lexical-semantic properties • properties determined by external knowledge sources (lexica, ontologies, word lists) • locality: hier (here);dort (there); Stuttgart • temporality: Jahr (year); damals (at that time) • derivation: gesetzten (set) deverbal adjective

  42. Lexical-semantic properties • structural information • complex embeddings [AP[PPüber die Köpfe der Apostel ]gesetzten ] above the heads of the apostles set ' set above the heads of the apostles' [AP[NP der "Inkatha"-Partei ] angehörenden ] to the Inkatha-party belonging 'belonging to the Inkatha-party'

  43. Some properties of NPs

  44. Other lexical-semantic properties • VC with separated prefix: pref Er kommt an(he arrives) • PP with contracted preposition and article: fus am Bahnhof(at the station) • complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles' • AP with deverbal adjectives: vder

  45. Second Level Corpus Corpus Corpus Third Level First Level Lexicon Chunking process

  46. First level • basic (non-recursive) chunks • chunks with specific internal structure • Ende September (end of Semptember) • Jahre später (years later) • 21. Juli 2003 • Johann Sebastian Bach • lexical information is introduced • within the rules itself • within the Perl-scripts

  47. Advantages • specific rules do not interact with main parsing rules • additional (e.g. domain specific) rules can be included easily • main parsing rules can be kept simple • number of main parsing rules can be kept small

  48. Second level • main parsing level • relatively simple and general rules • AP  AdvP? (PP|NP)* AC • NP  Determiner? Cardinal? AP* NC • PP  Preposition (NP|AdvP) • complex (recursive) structures are built in several iterations

  49. Rule blocks

  50. Complexity of phrases • complexity of phrases is achieved by the embedding of complex structures rather than by complex rules • [NPeine [AP verständliche ] Sprache ] an understandable language • [NPeine [AP für den Anwender verständliche ] Sprache ] a for the user understandable language 'a language understandable for the user'

More Related