1 / 56

Fall 2005 Lecture Notes #4

EECS 595 / LING 541 / SI 661. Natural Language Processing. Fall 2005 Lecture Notes #4. Features and unification. Introduction. Grammatical categories have properties Constraint-based formalisms Example: this flights : agreement is difficult to handle at the level of grammatical categories

heaton
Télécharger la présentation

Fall 2005 Lecture Notes #4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 595 / LING 541 / SI 661 Natural Language Processing Fall 2005 Lecture Notes #4

  2. Features and unification

  3. Introduction • Grammatical categories have properties • Constraint-based formalisms • Example: this flights: agreement is difficult to handle at the level of grammatical categories • Example: many water: count/mass nouns • Sample rule that takes into account features: S  NP VP (but only if the number of the NP is equal to the number of the VP)

  4. Feature structures CAT NP NUMBER SINGULAR PERSON 3 CAT NP AGREEMENT NUMBER SG PERSON 3 Feature paths: {x agreement number}

  5. Unification [NUMBER SG] [NUMBER SG] + [NUMBER SG] [NUMBER PL] - [NUMBER SG] [NUMBER []] = [NUMBER SG] [NUMBER SG] [PERSON 3] = ?

  6. Agreement • S  NP VP{NP AGREEMENT} = {VP AGREEMENT} • Does this flight serve breakfast? • Do these flights serve breakfast? • S  Aux NP VP{Aux AGREEMENT} = {NP AGREEMENT}

  7. Agreement • These flights • This flight • NP  Det Nominal{Det AGREEMENT} = {Nominal AGREEMENT} • Verb  serve{Verb AGREEMENT NUMBER} = PL • Verb  serves{Verb AGREEMENT NUMBER} = SG

  8. Subcategorization • VP  Verb{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = INTRANS • VP  Verb NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = TRANS • VP  Verb NP NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = DITRANS

  9. Regular Expressions andAutomata

  10. Regular expressions • Searching for “woodchuck” • Searching for “woodchucks” with an optional final “s” • Regular expressions • Finite-state automata (singular: automaton)

  11. Regular expressions • Basic regular expression patterns • Perl-based syntax (slightly different from other notations for regular expressions) • Disjunctions [abc] • Ranges [A-Z] • Negations [^Ss] • Optional characters ? and * • Wild cards . • Anchors ^ and $, also \b and \B • Disjunction, grouping, and precedence |

  12. Writing correct expressions • Exercise: write a Perl regular expression to match the English article “the”: /the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

  13. A more complex example • Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

  14. Advanced operators

  15. Substitutions and memory • Substitutions s/colour/color/ • Memory (\1, \2, etc. refer back to matches) s/([0-9]+)/<\1>/

  16. Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

  17. Eliza-style regular expressions Step 1: replace first person references with second person referencesStep 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* all .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

  18. Finite-state automata • Finite-state automata (FSA) • Regular languages • Regular expressions

  19. Finite-state automata (machines) baa! baaa! baaaa! baaaaa! ... baa+! a b a a ! q0 q1 q2 q3 q4 finalstate state transition

  20. Input tape q0 a b a ! b

  21. Finite-state automata • Q: a finite set of N states q0, q1, … qN • : a finite input alphabet of symbols • q0: the start state • F: the set of final states • (q,i): transition function

  22. State-transition tables

  23. The FSM toolkit and friends • Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) • Download: http://www.research.att.com/sw/tools/fsm/tech.htmlhttp://www.research.att.com/sw/tools/lextools/ • Tutorial available • 4 useful parts: FSM, Lextools, GRM, Dot (separate) • /data2/tools/fsm-3.6/bin • /data2/tools/lextools/bin • /data2/tools/dot/bin

  24. D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or rejectindex Beginning of tapecurrent-state  Initial state of machineloopif End of input has been reached thenif current-state is an accept state thenreturn acceptelsereturn rejectelsiftransition-table [current-state, tape[index]] is empty thenreturn rejectelsecurrent-state  transition-table [current-state, tape[index]]index  index + 1end

  25. Adding a failing state a b a a ! q0 q1 q2 q3 q4 ! ! b ! b ! b b a a qF

  26. Languages and automata • Formal languages: regular languages, non-regular languages • deterministic vs. non-deterministic FSAs • Epsilon () transitions

  27. Using NFSAs to accept strings • Backup: add markers at choice points, then possibly revisit underexplored markers • Look-ahead: look ahead in input • Parallelism: look at alternatives in parallel

  28. Using NFSAs

  29. More about FSAs • Transducers • Equivalence of DFSAs and NFSAs • Recognition as search: depth-first, breadth-search

  30. Recognition using NFSAs

  31. Regular languages • Operations on regular languages and FSAs: concatenation, closure, union • Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)

  32. An exercise • J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.

  33. Morphology and Finite-State Transducers

  34. Morphemes • Stems, affixes • Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German • Concatenative morphology • Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)

  35. Morphological analysis • rewrites • unbelievably

  36. Inflectional morphology • Tense, number, person, mood, aspect • Five verb forms in English • 40+ forms in French • Six cases in Russian:http://www.departments.bucknell.edu/russian/language/case.html • Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)

  37. Derivational morphology • Nominalization: computerization, appointee, killer, fuzziness • Formation of adjectives: computational, embraceable, clueless

  38. Finite-state morphological parsing • Cats: cat +N +PL • Cat: cat +N +SG • Cities: city +N +PL • Geese: goose +N +PL • Ducks: (duck +N +PL) or (duck +V +3SG) • Merging: +V +PRES-PART • Caught: (catch +V +PAST-PART) or (catch +V +PAST)

  39. Principles of morphological parsing • Lexicon • Morphotactics (e.g., plural follows noun) • Orthography (easy  easier) • Irregular nouns: e.g., geese, sheep, mice • Irregular verbs: e.g., caught, ate, eaten

  40. FSA for adjectives • Big, bigger, biggest • Cool, cooler, coolest, coolly • Red, redder, reddest • Clear, clearer, clearest, clearly, unclear, unclearly • Happy, happier, happiest, happily • Unhappy, unhappier, unhappiest, unhappily • What about: unbig, redly, and realest?

  41. Using FSA for recognition • Is a string a legitimate word or not? • Two-level morphology: lexical level + surface level (Koskenniemi 83) • Finite-state transducers (FST) – used for regular relations • Inversion and composition of FST

  42. Orthographic rules • Beg/begging • Make/making • Watch/watches • Try/tries • Panic/panicked

  43. Combining FST lexicon and rules • Cascades of transducers:the output of one becomes the input of another

  44. Weighted Automata

  45. Phonetic symbols • IPA • Arpabet • Examples

  46. Using WFST for language modeling • Phonetic representation • Part-of-speech tagging

  47. Word Classes andPart Of Speech Tagging

  48. Some POS statistics • Preposition list from COBUILD • Single-word particles • Conjunctions • Pronouns • Modal verbs

  49. Tagsets for English • Penn Treebank • Other tagsets (see Week 1 slides)

  50. POS ambiguity • Degrees of ambiguity (DeRose 1988) • Rule-based POS tagging • ENGTWOL (Voutilainen et al. ) • Sample rule: • Adverbial-That rule (“it isn’t that odd”) (“Given input: “that”if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”)then eliminate non-ADV tagselse eliminate ADV tag

More Related