Issues in Computational Linguistics: Grammar Engineering
440 likes | 456 Vues
Explore deep and shallow grammars, POS tagging, Chunking, Treebank grammars, Deep analysis benefits, and overcoming challenges in computational linguistics. Learn about practical applications, challenges, and solutions.
Issues in Computational Linguistics: Grammar Engineering
E N D
Presentation Transcript
Issues in Computational Linguistics:Grammar Engineering Dick Crouch and Tracy King
Outline • What is a deep grammar? • How to engineer them: • robustness • integrating shallow resources • ambiguity • writing efficient grammars • real world data
What is a shallow grammar • often trained automatically from marked up corpora • part of speech tagging • chunking • trees
POS tagging and Chunking • Part of speech tagging: I/PRP saw/VBD her/PRP duck/VB./PUNCT I/PRP saw/VBD her/PRP$ duck/NN./PUNCT • Chunking: • general chunking [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]. (Abney) • NP chunking [NP President Clinton] visitited [NP the Hermitage] in [NP Leningrad]
Treebank grammars • Phrase structure tree (c-structure) • Annotations for heads, grammatical functions Collins parser output
Deep grammars • Provide detailed syntactic/semantic analyses • LFG (ParGram), HPSG (LinGO, Matrix) • Grammatical functions, tense, number, etc. Mary wants to leave. subj(want~1,Mary~3) comp(want~1,leave~2) subj(leave~2,Mary~3) tense(leave~2,present) • Usually manually constructed • linguistically motivated rules
Why would you want one • Meaning sensitive applications • overkill for many NLP applications • Applications which use shallow methods for English may not be able to for "free" word order languages • can read many functions off of trees in English SUBJ: NP sister to VP [S [NP Mary] [VP left]] OBJ: first NP sister to V [S [NP Mary] [VP saw [NP John]]] • need other information in German, Japanese, etc.
shallow but wrong delegation furthest away but Subject of flew deep and right Deep analysis matters…if you care about the answer Example: A delegation led by Vice President Philips, head of the chemical division, flew to Chicago a week after the incident. Question: Who flew to Chicago? Candidate answers: division closest noun head next closest V.P. Philips next
Post-Search Sifting Google AutonomousKnowledge Filtering Alta Vista Technology Frontier AskJeeves KnowledgeFusion Useful Summary Good Translation NaturalDialogue Microsoft Paperclip Restricted Dialogue Document BaseManagement Manually-tagged Keyword Search Applications of Language Engineering Shallow Synthesis Broad Domain Coverage Narrow Deep Low High Functionality
Traditional Problems • Time consuming and expensive to write • Not robust • want output for any input • Ambiguous • Slow • Other gating items for applications that need deep grammars
Why deep analysis is difficult • Languages are hard to describe • Meaning depends on complex properties of words and sequences • Different languages rely on different properties • Errors and disfluencies • Languages are hard to compute • Expensive to recognize complex patterns • Sentences are ambiguous • Ambiguities multiply: explosion in time and space
How to overcome this • Engineer the deep grammars • theoretical vs. practical • what is good enough • Integrate shallow techniques into deep grammars • Experience based on broad-coverage LFG grammars (ParGram project)
Robustness: Sources of Brittleness • missing vocabulary • you can't list all the proper names in the world • missing constructions • there are many constructions theoretical linguistics rarely considers (e.g. dates, company names) • easy to miss even core constructions • ungrammatical input • real world text is not always perfect • sometimes it is really horrendous
Real world Input • Other weak blue-chip issues included Chevron, which went down 2 to 64 7/8 in Big Board composite trading of 1.3 million shares; Goodyear Tire & Rubber, off 1 1/2 to 46 3/4, and American Express, down 3/4 to 37 1/4. (WSJ, section 13) • ``The croaker's done gone from the hook – (WSJ, section 13) • (SOLUTION 27000 20) Without tag P-248 the W7F3 fuse is located in the rear of the machine by the charge power supply (PL3 C14 item 15. (Eureka copier repair tip)
Missing vocabulary • Build vocabulary based on the input of shallow methods • fast • extensive • accurate • Finite-state morphologies • Part of Speech Taggers
Finite State Morphologies • Finite-state morphologies falls -> fall +Noun +Pl fall +Verb +Pres +3sg Mary -> Mary +Prop +Giv +Fem +Sg vienne -> venir +SubjP +SG {+P1|+P3} +Verb • Build lexical entry on-the-fly from the morphological information • have canonicalized stem form • have significant grammatical information • do not have subcategorization
Building lexical entries • Lexical entries -unknown N @(COMMON-NOUN %stem). +Noun N-SFX @(PERS 3). +Pl N-NUM @(NUM pl). • Rule NOUN -> N N-SFX N-NUM. • Templates • COMMON-NOUN :: (^ PRED)='%stem' (^ NTYPE)=common • PERS(3) :: (^ PERS)=3 • NUM(pl) :: (^ NUM)=pl
C-Structure for falls Noun N fall N-SFX +Noun N-NUM +Pl Building lexical entries • F-structure for falls • [ PRED 'fall' • NTYPE common • PERS 3 • NUM pl ]
Guessing words • Use FST guesser if the morphology doesn't know the word • Capitalized words can be proper nouns • Saakashvili -> Saakashvili +Noun +Proper +Guessed • ed words can be past tense verbs or adjectives • fumped -> fump +Verb +Past +Guessed fumped +Adj +Deverbal +Guessed • Languages with more morphology allow for better guessers
Using the lexicons • Rank the lexical lookup • overt entry in lexicon • entry built from information from morphology • entry built from information from guesser • Use the most reliable information • Fall back only as necessary
Missing constructions • Even large hand-written grammars are not complete • new constructions, especially with new corpora • unusual constructions • Generally longer sentences fail • one error can destroy the parse • Build up as much as you can; stitch together the pieces
Grammar engineering approach • First try to get a complete parse • If fail, build up chunks that get complete parses • Have a fall back for things without even chunk parses • Link these chunks and fall backs together in a single structure
Fragment Chunks: Sample output • the the dog appears. • Split into: • "token" the • sentence "the dog appears" • ignore the period
Ungrammatical input • Real world text contains ungrammatical input • typos • run ons • cut and paste errors • Deep grammars tend to only cover grammatical output • Two strategies • robustness techniques: guesser/fragments • disprefered rules for ungrammatical structures
Rules for ungrammatical structures • Common errors can be coded in the rules • want to know that error occurred (e.g., feature in f-structure) • Disprefer parses of ungrammatical structure • tools for grammar writer to rank rules • two+ pass system • standard rules • rules for known ungrammatical constructions • default fall back rules
Sample ungrammatical structures • Mismatched subject-verb agreement Verb3Sg = { SUBJ PERS = 3 SUBJ NUM = sg |BadVAgr} • Missing copula VPcop ==> { Vcop: ^=! |e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>' MissingCopularVerb} { NP: (^ XCOMP)=! |AP: (^ XCOMP)=! | …}
Robustness summary • Integrate shallow methods • for lexical items • morphologies • guessers • Fall back techniques • for missing constructions • fragment grammar • disprefered rules
Ambiguity • Deep grammars are massively ambiguous • Example: 700 from section 23 of WSJ • average # of words: 19.6 • average # of optimal parses: 684 • for 1-10 word sentences: 3.8 • for 11-20 word sentences: 25.2 • for 50-60 word sentences: 12,888
Managing Ambiguity • Use packing to parse and manipulate the ambiguities efficiently (more tomorrow) • Trim early with shallow markup • fewer parses to choose from • faster parse time • Choose most probable parse for applications that need a single input
Shallow markup • Part of speech marking as filter I saw her duck/VB. • accuracy of tagger (v. good for English) • can use partial tagging (verbs and nouns) • Named entities • <company>Goldman, Sachs & Co.</company> bought IBM. • good for proper names and times • hard to parse internal structure • Fall back technique if fail • slows parsing • accuracy vs. speed
TB arrived TB . TB Example shallow markup: Named entities • Allow tokenizer to accept marked up input: parse {<person>Mr. Thejskt Thejs</person> arrived.} tokenized string: Mr. Thejskt ThejsTB +NEperson Mr(TB). TB Thejskt TB Thejs • Add lexical entries and rules for NE tags
Results for shallow markup Kaplan and King 2003
Chosing the most probable parse • Applications may want one input • or at least just a handful • Use stochastic methods to choose • efficient (XLE English grammar: 5% of parse time) • Need training data • partially labelled data ok [NP-SBJ They] see [NP-OBJ the girl with the telescope]
Run-time performance • Many deep grammars are slow • Techniques depend on the system • LFG: exploit the context-free backbone ambiguity packing techniques • Speed vs. accuracy trade off • remove/disprefer peripheral rules • remove fall backs for shallow markup
Development expense • Grammar porting • Starter grammar • Induced grammar bootstrapping • How cheap are shallow grammars? • training data can be expensive to produce
Grammar porting • Use an existing grammar as the base for a new language • Languages must be typologically similar • Japanese-Korean • Balkan • Lexical porting via bi-lingual dictionaries • Main work in testing and evaluation
Starter grammar • Provide basic rules and templates • including for robustness techniques • Grammar writer: • chooses among them • refines them • Grammar Matrix for HPSG
Grammar induction • Induce a core grammar from a treebank • compile rule generalizations • threshold rare rules • hand augment with features and fallback techniques • Requires • induction program • existing resources (treebank)
Conclusions • Grammar engineering makes deep grammars feasible • robustness techniques • integration of shallow methods • Many current applications can use shallow grammars • Fast, accurate, broad-coverage deep grammars enable new applications