1 / 101

Off-line (and On-line) Text Analysis for Computational Lexicography

Off-line (and On-line) Text Analysis for Computational Lexicography. Hannah Kermes. Introduction. Motivation computational lexicography corpus linguistics Approaches to text analysis symbolic vs. probabilistic approaches hand-written vs. learned

jaunie
Télécharger la présentation

Off-line (and On-line) Text Analysis for Computational Lexicography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

  2. Introduction • Motivation • computational lexicography • corpus linguistics • Approaches to text analysis • symbolic vs. probabilistic approaches • hand-written vs. learned • on-line queries vs. chunking vs. full parsing • Requirements • for the extraction tool • for the corpus annotation • classical chunking

  3. Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies

  4. Dictonaries • for human use • printed monolingual dictionaries • electronic dictionaries • machine readable dictionaries for NLP applications

  5. Printed monolingual dictionaries • intend to cover most important semantic and syntactic aspects • maintenance of consistency and completeness is a problem: • information is missing • entries are incomplete • information is not consistent • language changes have to be covered

  6. Electronic dictionaries • enormous amounts of information can be stored in a compact format • search engines allow for easy and fast access to desired data • users can choose how much and what kind of information they are interested in • reference corpus as additional knowledge source

  7. Machine readable dictionaries • NLP applications need detailed and consistent information about words • detailed morphological information • subcategorization frames of verbs, adjectives, nouns • specific syntactic information • selectional preferences • collocations • idiomatic usage

  8. Information needed • syntactic information • subcategorization patterns • semantic information • selectional preferences, collocations • synonyms • multi-word units • lexical classes • morphological information • case, number, gender • compounding and derivation

  9. Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards

  10. Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation

  11. A corpus linguistic approach

  12. Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

  13. Three different dimensions • type of grammar • symbolic grammar • probabilistic grammar • type of grammar development • hand-written grammar • learning methods • depth of analysis • analysis on token level only • full parsing • partial parsing

  14. Symbolic approaches • precise rules can be formulated • lexical knowledge can be included • results can be predicted and controlled • sometimes not sufficient to solve ambiguities • only phenomena which are explicit in the grammar can be dealt with

  15. Unification-based grammars • usually complex grammars • model the hierarchical structure of language • handle attachment ambiguities • determine relations among constituents and their grammatical function • extensive use of lexical information • richness and complexity of rules do not only solve ambiguities, but produces them as well • usually large number of possible analysis

  16. Context-free Grammars (CFG) • formal grammars consisting of a set of recursive rewriting rules • small and modular grammar • minimal interaction among rules • parsing process usually fast • covers only basic aspects of language • robustness rules are used to overcome shortcomings in the grammar

  17. Probabilistic approaches • supervised or unsupervised training of rules • all possible analyses are produced • no need for comprehensive lexical or linguistic knowledge • rules can be left underspecified • depend on the training corpus • highly frequent phenomena are preferred over low frequent phenomena

  18. Probabilistic context-free grammar • CFG rules enriched by probability • make use of underspecification • not as fast as CFG • special case: head lexicalized context-free grammar • unsupervised • grammar rules are indexed by the lemma of the syntactic head • extraction is performed on the rule set rather than on the annotated corpus

  19. Hand-written rules • good control of the rule system • negative evidence can be taken into account • depends heavily on the experties of the grammar writer

  20. Learning grammar rules • infer grammar form text corpora • extensional syntactic descriptions (annotations) are turned into intensional descriptions (rules) • optimal or suboptimal training data • new resources in the form of text corpora can be exploited • more or less independent of the knowledge of the grammar developer • depends heavily on the learning corpus • needs an annotated, well-balanced corpus

  21. memory based learning • special case of learning • most prominent is the data oriented parsing (DOP) • fragments are stored and as such replace the grammar • language generation and analysis is performed by combining the memorized fragments • needs structurally annotated corpus • the training corpus has great impact on the performance of the system • highly sensitive to suboptimal data • needs large storage capacity

  22. Annotation on token level • usually a form of pattern matching • completely flexible • does not depend on previous syntactic analysis • easily adaptable to different text types • full syntactic analysis has to be performed by extraction queries • queries can become rather complex • often restricted to simple contexts

  23. Full Parsing • provides rich and detailed information about structures, relations and functions • extraction queries simply have to collect the annotated information • slow parsing speed • lack of robustness • depend heavily on prerequisite lexical information • ambiguous output

  24. Chunking • relatively simple grammar rules • no need for extensive linguistic and lexicographic information • robust • usually non-hierarchical and non-recursive structures • annotated structures are simple and convey less information

  25. Classical chunk definition • Abney 1991: The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template • Abney 1996: a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head

  26. State-of-the-art systems • CASS parser • finite-state cascades • flat, non-recursive structures • small lexicon (tag-fixes) • information about the head is given as an attribute • Conexor • symbolic constraint grammar parser • full-fedged grammar for English (ENGCG) • German: • simple, non-recursive structure • no lexical information available • head lemma indicated by a special tag

  27. State-of-the-art systems • KaRoParse • top-down bottom-up parser • includes recursion • internal structure is flat and non-hierarchical • no agreement or lexical information • Schiehlen's chunker • symbolic context free grammar • recursion • no head lemma or lexical-semantic information • needs optimally tokenized text (including MWL recognition)

  28. State-of-the-art systems • Chunkie • uses TnT-tagger to assign tree fragments to sequences of PoS-tags • recursion in pre-head position (maximal depth of three) • head lemma information, yet no agreement or lexical information • Cascaded Markov Models • stochastic context free grammar rules • several layers, each layer serving as input to the next • hierachical phrases, including complex recursion • head lemma information, yet no agreement or lexical information

  29. Problems for extraction • Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

  30. An example • [PC mit kleinen ], [PC über die Köpfe ] with small above the heads [NCder Apostel ] [NC gesetzten Flammen ] the apostles set flames • [PP mit [NP[APkleinen ], [AP über [NPdie Köpfe with small above the heads [NPder Apostel ] ] gesetzten ] Flammen ]] the apostles set flames `with small flames set above the heads of the apostles´

  31. Problems for extraction • four NCs instead of only one NP • AN-pair: • gesetzten + Flammen • kleine + Flammen • NN-pair Köpfe + Apostel needs agreement information • VN-pair setzen + Flammen needs information about the deverbal character of gesetzten • a more complex analysis is needed • PCs and NCs need to be combined

  32. Simple solution PP  PC (PC|NC)* • theoretical motivation? • rule covers this particular example, other examples might need additional rules • rule is vague and largely underspecified • not very reliable • internal structure is mainly left opague

  33. Complex solution • NP  NC NCgen • PP  preposition NP • AP  PP adjective • NP  AP* noun

  34. Complex solution • solution for this particular example only • large number of rules needed • rules have to be repeated for every instance of a complex phrase • in order to support extractions, the classic chunk concept has to be extended

  35. Chunking Full Parsing YAC • full hierarchical representation • complex grammar • not very robust • ambiguous output • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output Conclusion

  36. Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted data show correlation between • collocational preference • subcategorization frames • semantic classes of adjectives • to a certain extent distributional preferences

  37. General Concept • a recursive chunker for unrestricted German text • technical framework • CWB • CQP • output formats • advantages of the architecture • general framework of YAC • linguistic coverage • feature annotation • chunking process

  38. A recursive chunker for unrestricted German text • recursive chunker for unrestricted German text • fully automatic analysis • main goal: provide a useful basis for extraction of linguistic as well as lexicographic information from corpora

  39. General aspects • based on a symbolic regular expression grammar • grammar rules written in CQP • basis: • tokenization • PoS-tagging • lemmatization • agreement information Tree Tagger IMSLex

  40. A typical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made

  41. YAC goes beyond • extends the chunk definition of Abney • recursive embedding • post-head embedding • provides additional information about annotated chunks • head lemma • agreement information • lexical-semantic and structural properties

  42. Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP-attachment, or sentential elements.

  43. Perl-Scripts rule application post- processing lexicon annotation of results Technical Framework corpus grammar rules

  44. Technical framework - CQP • regular expression matching on token and annotation strings • tests for membership in user specific word lists • feature set operations • constraints to specify dependencies

  45. Perl-Scripts • invocation of CQP • processing of the results • annotation of the results into the corpus

  46. Postprocessing • values can be checked • values can be changed • values can be compared • range of structures can be changed

  47. Output formats • CQP format, used for: • interactive grammar development • parsing • extraction • an XML format, used for: • hierarchy building • extraction • data exchange

  48. Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules

  49. Linguistic coverage • Adverbial phrases (AdvP) • schön stark(beautifully strong) • daher (from there);irgendwoher (from anywhere) • heim (home); querfeldein (cross-country) • innen (inside); überall (everywhere) • "sehr bald" (very soon) • jetzt (now); damals (at that time)

  50. Linguistic coverage • Adjectival phrases (AP) • möglich (possible) • schreiend lila (screamingly purple) • rund zwei Meter hohe around two meter high • über die Köpfe der Apostel gesetzten above the heads of the apostles set 'set above the heads of the apostles'

More Related