1 / 44

Issues in Computational Linguistics: Grammar Engineering

Issues in Computational Linguistics: Grammar Engineering. Dick Crouch and Tracy King. Outline. What is a deep grammar? How to engineer them: robustness integrating shallow resources ambiguity writing efficient grammars real world data. What is a shallow grammar.

Télécharger la présentation

Issues in Computational Linguistics: Grammar Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Computational Linguistics:Grammar Engineering Dick Crouch and Tracy King

  2. Outline • What is a deep grammar? • How to engineer them: • robustness • integrating shallow resources • ambiguity • writing efficient grammars • real world data

  3. What is a shallow grammar • often trained automatically from marked up corpora • part of speech tagging • chunking • trees

  4. POS tagging and Chunking • Part of speech tagging: I/PRP saw/VBD her/PRP duck/VB./PUNCT I/PRP saw/VBD her/PRP$ duck/NN./PUNCT • Chunking: • general chunking [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]. (Abney) • NP chunking [NP President Clinton] visitited [NP the Hermitage] in [NP Leningrad]

  5. Treebank grammars • Phrase structure tree (c-structure) • Annotations for heads, grammatical functions Collins parser output

  6. Deep grammars • Provide detailed syntactic/semantic analyses • LFG (ParGram), HPSG (LinGO, Matrix) • Grammatical functions, tense, number, etc. Mary wants to leave. subj(want~1,Mary~3) comp(want~1,leave~2) subj(leave~2,Mary~3) tense(leave~2,present) • Usually manually constructed • linguistically motivated rules

  7. Why would you want one • Meaning sensitive applications • overkill for many NLP applications • Applications which use shallow methods for English may not be able to for "free" word order languages • can read many functions off of trees in English SUBJ: NP sister to VP [S [NP Mary] [VP left]] OBJ: first NP sister to V [S [NP Mary] [VP saw [NP John]]] • need other information in German, Japanese, etc.

  8. shallow but wrong delegation furthest away but Subject of flew deep and right Deep analysis matters…if you care about the answer Example: A delegation led by Vice President Philips, head of the chemical division, flew to Chicago a week after the incident. Question: Who flew to Chicago? Candidate answers: division closest noun head next closest V.P. Philips next

  9. Post-Search Sifting Google AutonomousKnowledge Filtering Alta Vista Technology Frontier AskJeeves KnowledgeFusion Useful Summary Good Translation NaturalDialogue Microsoft Paperclip Restricted Dialogue Document BaseManagement Manually-tagged Keyword Search Applications of Language Engineering Shallow Synthesis Broad Domain Coverage Narrow Deep Low High Functionality

  10. Traditional Problems • Time consuming and expensive to write • Not robust • want output for any input • Ambiguous • Slow • Other gating items for applications that need deep grammars

  11. Why deep analysis is difficult • Languages are hard to describe • Meaning depends on complex properties of words and sequences • Different languages rely on different properties • Errors and disfluencies • Languages are hard to compute • Expensive to recognize complex patterns • Sentences are ambiguous • Ambiguities multiply: explosion in time and space

  12. How to overcome this • Engineer the deep grammars • theoretical vs. practical • what is good enough • Integrate shallow techniques into deep grammars • Experience based on broad-coverage LFG grammars (ParGram project)

  13. Robustness: Sources of Brittleness • missing vocabulary • you can't list all the proper names in the world • missing constructions • there are many constructions theoretical linguistics rarely considers (e.g. dates, company names) • easy to miss even core constructions • ungrammatical input • real world text is not always perfect • sometimes it is really horrendous

  14. Real world Input • Other weak blue-chip issues included Chevron, which went down 2 to 64 7/8 in Big Board composite trading of 1.3 million shares; Goodyear Tire & Rubber, off 1 1/2 to 46 3/4, and American Express, down 3/4 to 37 1/4. (WSJ, section 13) • ``The croaker's done gone from the hook – (WSJ, section 13) • (SOLUTION 27000 20) Without tag P-248 the W7F3 fuse is located in the rear of the machine by the charge power supply (PL3 C14 item 15. (Eureka copier repair tip)

  15. Missing vocabulary • Build vocabulary based on the input of shallow methods • fast • extensive • accurate • Finite-state morphologies • Part of Speech Taggers

  16. Finite State Morphologies • Finite-state morphologies falls -> fall +Noun +Pl fall +Verb +Pres +3sg Mary -> Mary +Prop +Giv +Fem +Sg vienne -> venir +SubjP +SG {+P1|+P3} +Verb • Build lexical entry on-the-fly from the morphological information • have canonicalized stem form • have significant grammatical information • do not have subcategorization

  17. Building lexical entries • Lexical entries -unknown N @(COMMON-NOUN %stem). +Noun N-SFX @(PERS 3). +Pl N-NUM @(NUM pl). • Rule NOUN -> N N-SFX N-NUM. • Templates • COMMON-NOUN :: (^ PRED)='%stem' (^ NTYPE)=common • PERS(3) :: (^ PERS)=3 • NUM(pl) :: (^ NUM)=pl

  18. C-Structure for falls Noun N fall N-SFX +Noun N-NUM +Pl Building lexical entries • F-structure for falls • [ PRED 'fall' • NTYPE common • PERS 3 • NUM pl ]

  19. Guessing words • Use FST guesser if the morphology doesn't know the word • Capitalized words can be proper nouns • Saakashvili -> Saakashvili +Noun +Proper +Guessed • ed words can be past tense verbs or adjectives • fumped -> fump +Verb +Past +Guessed fumped +Adj +Deverbal +Guessed • Languages with more morphology allow for better guessers

  20. Using the lexicons • Rank the lexical lookup • overt entry in lexicon • entry built from information from morphology • entry built from information from guesser • Use the most reliable information • Fall back only as necessary

  21. Missing constructions • Even large hand-written grammars are not complete • new constructions, especially with new corpora • unusual constructions • Generally longer sentences fail • one error can destroy the parse • Build up as much as you can; stitch together the pieces

  22. Grammar engineering approach • First try to get a complete parse • If fail, build up chunks that get complete parses • Have a fall back for things without even chunk parses • Link these chunks and fall backs together in a single structure

  23. Fragment Chunks: Sample output • the the dog appears. • Split into: • "token" the • sentence "the dog appears" • ignore the period

  24. C-structure

  25. F-structure

  26. Ungrammatical input • Real world text contains ungrammatical input • typos • run ons • cut and paste errors • Deep grammars tend to only cover grammatical output • Two strategies • robustness techniques: guesser/fragments • disprefered rules for ungrammatical structures

  27. Rules for ungrammatical structures • Common errors can be coded in the rules • want to know that error occurred (e.g., feature in f-structure) • Disprefer parses of ungrammatical structure • tools for grammar writer to rank rules • two+ pass system • standard rules • rules for known ungrammatical constructions • default fall back rules

  28. Sample ungrammatical structures • Mismatched subject-verb agreement Verb3Sg = { SUBJ PERS = 3 SUBJ NUM = sg |BadVAgr} • Missing copula VPcop ==> { Vcop: ^=! |e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>' MissingCopularVerb} { NP: (^ XCOMP)=! |AP: (^ XCOMP)=! | …}

  29. Robustness summary • Integrate shallow methods • for lexical items • morphologies • guessers • Fall back techniques • for missing constructions • fragment grammar • disprefered rules

  30. Ambiguity • Deep grammars are massively ambiguous • Example: 700 from section 23 of WSJ • average # of words: 19.6 • average # of optimal parses: 684 • for 1-10 word sentences: 3.8 • for 11-20 word sentences: 25.2 • for 50-60 word sentences: 12,888

  31. Managing Ambiguity • Use packing to parse and manipulate the ambiguities efficiently (more tomorrow) • Trim early with shallow markup • fewer parses to choose from • faster parse time • Choose most probable parse for applications that need a single input

  32. Shallow markup • Part of speech marking as filter I saw her duck/VB. • accuracy of tagger (v. good for English) • can use partial tagging (verbs and nouns) • Named entities • <company>Goldman, Sachs & Co.</company> bought IBM. • good for proper names and times • hard to parse internal structure • Fall back technique if fail • slows parsing • accuracy vs. speed

  33. TB arrived TB . TB Example shallow markup: Named entities • Allow tokenizer to accept marked up input: parse {<person>Mr. Thejskt Thejs</person> arrived.} tokenized string: Mr. Thejskt ThejsTB +NEperson Mr(TB). TB Thejskt TB Thejs • Add lexical entries and rules for NE tags

  34. Resulting C-structure

  35. Resulting F-structure

  36. Results for shallow markup Kaplan and King 2003

  37. Chosing the most probable parse • Applications may want one input • or at least just a handful • Use stochastic methods to choose • efficient (XLE English grammar: 5% of parse time) • Need training data • partially labelled data ok [NP-SBJ They] see [NP-OBJ the girl with the telescope]

  38. Run-time performance • Many deep grammars are slow • Techniques depend on the system • LFG: exploit the context-free backbone ambiguity packing techniques • Speed vs. accuracy trade off • remove/disprefer peripheral rules • remove fall backs for shallow markup

  39. Development expense • Grammar porting • Starter grammar • Induced grammar bootstrapping • How cheap are shallow grammars? • training data can be expensive to produce

  40. Grammar porting • Use an existing grammar as the base for a new language • Languages must be typologically similar • Japanese-Korean • Balkan • Lexical porting via bi-lingual dictionaries • Main work in testing and evaluation

  41. Starter grammar • Provide basic rules and templates • including for robustness techniques • Grammar writer: • chooses among them • refines them • Grammar Matrix for HPSG

  42. Grammar induction • Induce a core grammar from a treebank • compile rule generalizations • threshold rare rules • hand augment with features and fallback techniques • Requires • induction program • existing resources (treebank)

  43. Conclusions • Grammar engineering makes deep grammars feasible • robustness techniques • integration of shallow methods • Many current applications can use shallow grammars • Fast, accurate, broad-coverage deep grammars enable new applications

More Related