1 / 46

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics. Lecture 4 Giuseppe Carenini. Today 1/23. Finite State Transducers (FSTs) and Morphological Parsing Stemming (Porter Stemmer). Computational problems in Morphology. Recognition : recognize whether a string is an English word (FSA) Parsing/Generation :.

jana-landry
Télécharger la présentation

CPSC 503 Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPSC 503Computational Linguistics Lecture 4 Giuseppe Carenini CPSC503 Spring 2004

  2. Today 1/23 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) CPSC503 Spring 2004

  3. Computational problems in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. lie +N +PL e.g., lies lie +V +3SG • Stemming: stem word …. CPSC503 Spring 2004

  4. Finite State Transducers (FSTs) • FSA cannot help …… • Need to extend FSA • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL” (or vice versa) CPSC503 Spring 2004

  5. FSTs as translators parsing generation CPSC503 Spring 2004

  6. Example +PL:s l:l i:i e:e +N:ε q0 q1 q2 q3 q4 q6 q5 q7 +V:ε +3SG:s Transitions (as a translator): • l:l means read a l on one tape and write a l on the other (or vice versa) • +N:ε means read a +N symbol on one tape and write nothing on the other (or vice versa) • +PL:s means read +PL and write an s (or vice versa) • … CPSC503 Spring 2004

  7. Examples (as a translator) lexical surface l i e s lexical l i e +V +3SG surface CPSC503 Spring 2004

  8. Examples (as a recognizer and a generator) l i e +V +3SG lexical surface l i e s lexical surface CPSC503 Spring 2004

  9. FST definition • Q: a finiteset of states • I,O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i:o, iI and oO • Q0: the start state • F: a set of accept/final states (FQ) • A transition relation δ that maps QxΣ to Q CPSC503 Spring 2004

  10. FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a strings from IxO • Generator: output a string from IxO Terminology warning! CPSC503 Spring 2004

  11. A step back: FSA can represent morphological knowledge • Lexicon: list of stem and affixes, together with basic information about them • Morphotactics: the rules governing the ordering of morphemes • Orthographics rules: model changes in morphemes when they combine CPSC503 Spring 2004

  12. FSA for inflectional morphology of plural Some regular-nouns i Some irregular-nouns CPSC503 Spring 2004

  13. FST for inflectional morphology of plural Some regular-nouns Some irregular-nouns o:i CPSC503 Spring 2004

  14. Examples lexical surface m i c e lexical c a t +N +PL surface CPSC503 Spring 2004

  15. Problems/Challenges • Ambiguity: one word can correspond to multiple structures • Spelling changes: may occur when two morphemes are combined (inflectionally) e.g. butterfly + -s -> butterflies CPSC503 Spring 2004

  16. Ambiguity • ND recognition: multiple paths through a machine may lead to an accept state (Didn’t matter which path was actually traversed) • In ND parsing the path to an accept state does matter: differ paths represent different parses and different outputs will result +PL:s l:l i:i e:e +N:ε q0 q1 q2 q3 q4 q6 q5 q7 +V:ε CPSC503 Spring 2004 +PL:s

  17. Ambiguity: more complex example • What’s the right parse for Unionizable? • Union-ize-able • Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. CPSC503 Spring 2004

  18. Deal with Morphological Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored Then Part-of-speech tagging to choose CPSC503 Spring 2004

  19. Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change • Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., try, butterfly) CPSC503 Spring 2004

  20. Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols • ^ morpheme boundary • # word boundary CPSC503 Spring 2004

  21. Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape CPSC503 Spring 2004

  22. FST-1 for inflectional morphology of plural Some regular-nouns +PL:^s# # # # Some irregular-nouns o:i ε:s ε:# +PL:^ CPSC503 Spring 2004

  23. Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate CPSC503 Spring 2004

  24. FST-2 for E-insertion(Intermediate to Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε CPSC503 Spring 2004

  25. Examples intemediate f o x ^ s # surface intemediate b o x ^ i n g # surface CPSC503 Spring 2004

  26. Where are we? CPSC503 Spring 2004

  27. Final Scheme: Part 1 CPSC503 Spring 2004

  28. Final Scheme: Part 2 CPSC503 Spring 2004

  29. Intersection (T1,T2) • States of T1 and T2 : Q1 and Q2 • States of intersection: Q1 x Q2 • Transitions of T1 and T2 : δ1, δ2 • Transitions of intersection : δ3 δ3((xa,ya), i:c) = (xb,yb) iff • δ1(xa, i:c) = xb AND • δ2(ya, i:c) = yb CPSC503 Spring 2004

  30. Composition(T1,T2) • States of T1 and T2 : Q1 and Q2 • States of composition : Q1 x Q2 • Transitions of T1 and T2 : δ1, δ2 • Transitions of composition : δ3 δ3((xa,ya), i:o) = (xb,yb) iff • There exists c such that • δ1(xa, i:c) = xb AND • δ2(ya, c:o) = yb CPSC503 Spring 2004

  31. Other important applications of FTS in NLP • Segmentation: finding word boundaries in text (?!) • Shallow syntactic parsing: e.g., find only noun phrases • Dialogue Act Disambiguation: “right” (IUI-04) • Phonological Rules…. CPSC503 Spring 2004

  32. FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e.g, lexicon, morphotactic and rules) in a RegExp like notation (pointer) • Your specification is compiled in an FST NOTE: FSTs for the morphology of a natural language may have 105 – 107 states and arcs CPSC503 Spring 2004

  33. Computational problems in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation (FST): stem, class, lexical features word …. …. lie +N +PL e.g., lies lie +V +3SG • Stemming: stem word …. CPSC503 Spring 2004

  34. Stemmer • E.g. the Porter algorithm (Appendix B), which is based on a series of sets of simple cascaded rewrite rules: • ATIONAL  ATE (relational  relate) • ING   if stem contains vowel (motoring  motor) • Cascade of rules applied to: computerization • ization -> -ize computerize • ize -> εcomputer • Errors occur: • organization  organ, doing  doe university  universe CPSC503 Spring 2004

  35. Stemming mainly used in Information Retrieval • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Compute similarity between queries and documents (based on stems they contain) CPSC503 Spring 2004

  36. Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… • Each stage is a separate transducer • The stages can be composed to get one big transducer CPSC503 Spring 2004

  37. Formalisms and associated Algorithms Linguistic Knowledge State Machines (no prob.) • Finite State Automata (and Regular Expressions) • Finite State Transducers (English) Morphology Syntax Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Semantics Logical formalisms (First-Order Logics) Pragmatics Discourse and Dialogue AI planners CPSC503 Spring 2004

  38. Next Time • Intro to probability and information theory • On your preferred source read about • Conditional probability • Bayes’ rule • Independence • Entropy • Conditional Entropy and Mutual Information CPSC503 Spring 2004

  39. Lexical to Intermediate Level CPSC503 Spring 2004

  40. FST for inflectional morphology of plural Some regular-nouns Some irregular-nouns CPSC503 Spring 2004

  41. Foxes CPSC503 Spring 2004

  42. FST Review • FSTs allow us to take an input and deliver a structure based on it • Or… take a structure and create a surface form • Or take a structure and create another structure CPSC503 Spring 2004

  43. Formalisms and associated Algorithms Linguistic Knowledge State Machines (no prob.) • Finite State Automata (and Regular Expressions) • Finite State Transducers (English) Morphology Syntax Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Semantics Logical formalisms (First-Order Logics) Pragmatics Discourse and Dialogue AI planners CPSC503 Spring 2004

  44. Review • In many applications its convenient to decompose the problem into a set of cascaded transducers where • The output of one feeds into the input of the next. CPSC503 Spring 2004

  45. English Spelling Changes • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape CPSC503 Spring 2004

  46. FST can be used as… • Translators: input one string (a sequence from I), output another one (a sequence from O)……or viceversa • Recognizers: input both strings (a sequence from IxO) • Generator: output both strings (a sequence from IxO) CPSC503 Spring 2004

More Related