560 likes | 700 Vues
This lecture note explores the principles of features and unification in natural language processing, focusing on grammatical categories and their properties. It discusses the challenges of agreement at the grammatical category level, including count and mass nouns, and presents sample rules that account for features in syntactic structures. The notes also cover regular expressions, finite-state automata, and their applications in language processing, providing exercises and examples related to agreement in verb forms, noun phrases, and more, highlighting the complexities of linguistic structures.
E N D
EECS 595 / LING 541 / SI 661 Natural Language Processing Fall 2005 Lecture Notes #4
Introduction • Grammatical categories have properties • Constraint-based formalisms • Example: this flights: agreement is difficult to handle at the level of grammatical categories • Example: many water: count/mass nouns • Sample rule that takes into account features: S NP VP (but only if the number of the NP is equal to the number of the VP)
Feature structures CAT NP NUMBER SINGULAR PERSON 3 CAT NP AGREEMENT NUMBER SG PERSON 3 Feature paths: {x agreement number}
Unification [NUMBER SG] [NUMBER SG] + [NUMBER SG] [NUMBER PL] - [NUMBER SG] [NUMBER []] = [NUMBER SG] [NUMBER SG] [PERSON 3] = ?
Agreement • S NP VP{NP AGREEMENT} = {VP AGREEMENT} • Does this flight serve breakfast? • Do these flights serve breakfast? • S Aux NP VP{Aux AGREEMENT} = {NP AGREEMENT}
Agreement • These flights • This flight • NP Det Nominal{Det AGREEMENT} = {Nominal AGREEMENT} • Verb serve{Verb AGREEMENT NUMBER} = PL • Verb serves{Verb AGREEMENT NUMBER} = SG
Subcategorization • VP Verb{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = INTRANS • VP Verb NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = TRANS • VP Verb NP NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = DITRANS
Regular expressions • Searching for “woodchuck” • Searching for “woodchucks” with an optional final “s” • Regular expressions • Finite-state automata (singular: automaton)
Regular expressions • Basic regular expression patterns • Perl-based syntax (slightly different from other notations for regular expressions) • Disjunctions [abc] • Ranges [A-Z] • Negations [^Ss] • Optional characters ? and * • Wild cards . • Anchors ^ and $, also \b and \B • Disjunction, grouping, and precedence |
Writing correct expressions • Exercise: write a Perl regular expression to match the English article “the”: /the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example • Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Substitutions and memory • Substitutions s/colour/color/ • Memory (\1, \2, etc. refer back to matches) s/([0-9]+)/<\1>/
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions Step 1: replace first person references with second person referencesStep 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* all .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Finite-state automata • Finite-state automata (FSA) • Regular languages • Regular expressions
Finite-state automata (machines) baa! baaa! baaaa! baaaaa! ... baa+! a b a a ! q0 q1 q2 q3 q4 finalstate state transition
Input tape q0 a b a ! b
Finite-state automata • Q: a finite set of N states q0, q1, … qN • : a finite input alphabet of symbols • q0: the start state • F: the set of final states • (q,i): transition function
The FSM toolkit and friends • Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) • Download: http://www.research.att.com/sw/tools/fsm/tech.htmlhttp://www.research.att.com/sw/tools/lextools/ • Tutorial available • 4 useful parts: FSM, Lextools, GRM, Dot (separate) • /data2/tools/fsm-3.6/bin • /data2/tools/lextools/bin • /data2/tools/dot/bin
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or rejectindex Beginning of tapecurrent-state Initial state of machineloopif End of input has been reached thenif current-state is an accept state thenreturn acceptelsereturn rejectelsiftransition-table [current-state, tape[index]] is empty thenreturn rejectelsecurrent-state transition-table [current-state, tape[index]]index index + 1end
Adding a failing state a b a a ! q0 q1 q2 q3 q4 ! ! b ! b ! b b a a qF
Languages and automata • Formal languages: regular languages, non-regular languages • deterministic vs. non-deterministic FSAs • Epsilon () transitions
Using NFSAs to accept strings • Backup: add markers at choice points, then possibly revisit underexplored markers • Look-ahead: look ahead in input • Parallelism: look at alternatives in parallel
More about FSAs • Transducers • Equivalence of DFSAs and NFSAs • Recognition as search: depth-first, breadth-search
Regular languages • Operations on regular languages and FSAs: concatenation, closure, union • Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)
An exercise • J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.
Morphology and Finite-State Transducers
Morphemes • Stems, affixes • Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German • Concatenative morphology • Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)
Morphological analysis • rewrites • unbelievably
Inflectional morphology • Tense, number, person, mood, aspect • Five verb forms in English • 40+ forms in French • Six cases in Russian:http://www.departments.bucknell.edu/russian/language/case.html • Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)
Derivational morphology • Nominalization: computerization, appointee, killer, fuzziness • Formation of adjectives: computational, embraceable, clueless
Finite-state morphological parsing • Cats: cat +N +PL • Cat: cat +N +SG • Cities: city +N +PL • Geese: goose +N +PL • Ducks: (duck +N +PL) or (duck +V +3SG) • Merging: +V +PRES-PART • Caught: (catch +V +PAST-PART) or (catch +V +PAST)
Principles of morphological parsing • Lexicon • Morphotactics (e.g., plural follows noun) • Orthography (easy easier) • Irregular nouns: e.g., geese, sheep, mice • Irregular verbs: e.g., caught, ate, eaten
FSA for adjectives • Big, bigger, biggest • Cool, cooler, coolest, coolly • Red, redder, reddest • Clear, clearer, clearest, clearly, unclear, unclearly • Happy, happier, happiest, happily • Unhappy, unhappier, unhappiest, unhappily • What about: unbig, redly, and realest?
Using FSA for recognition • Is a string a legitimate word or not? • Two-level morphology: lexical level + surface level (Koskenniemi 83) • Finite-state transducers (FST) – used for regular relations • Inversion and composition of FST
Orthographic rules • Beg/begging • Make/making • Watch/watches • Try/tries • Panic/panicked
Combining FST lexicon and rules • Cascades of transducers:the output of one becomes the input of another
Phonetic symbols • IPA • Arpabet • Examples
Using WFST for language modeling • Phonetic representation • Part-of-speech tagging
Some POS statistics • Preposition list from COBUILD • Single-word particles • Conjunctions • Pronouns • Modal verbs
Tagsets for English • Penn Treebank • Other tagsets (see Week 1 slides)
POS ambiguity • Degrees of ambiguity (DeRose 1988) • Rule-based POS tagging • ENGTWOL (Voutilainen et al. ) • Sample rule: • Adverbial-That rule (“it isn’t that odd”) (“Given input: “that”if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”)then eliminate non-ADV tagselse eliminate ADV tag