1 / 33

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics. Lecture 3 Giuseppe Carenini. Subscribe to mailing list Some more Intros NLP@UBC. Introductions. Your Name Previous experience in NLP? Why are you interested in NLP?

cdye
Télécharger la présentation

CPSC 503 Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPSC 503Computational Linguistics Lecture 3 Giuseppe Carenini CPSC503 Winter 2008

  2. Subscribe to mailing list • Some more Intros • NLP@UBC CPSC503 Winter 2008

  3. Introductions • Your Name • Previous experience in NLP? • Why are you interested in NLP? • Are you thinking of NLP as your main research area? If not, what else do you want to specialize in…. • Anything else………… CPSC503 Winter 2008

  4. NLP research at UBC TOPICS • Generation and Summarization of Evaluative Text (e.g., customer reviews) • Summarization of conversations (emails, blogs, meetings) PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), MSResearch CPSC503 Winter 2008

  5. Formalisms and associated Algorithms Linguistic Knowledge • State Machines (no prob.) • Finite State Automata (and Regular Expressions) • Finite State Transducers (English) Morphology Syntax Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Semantics Logical formalisms (First-Order Logics) Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2008

  6. Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. buy +V +PAST-PART bought e.g., buy +V +PAST • Stemming: stem word …. CPSC503 Winter 2008

  7. Today Sept 15 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) CPSC503 Winter 2008

  8. FST definition • Q: a finiteset of states • I,O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i:o, iI and oO • Q0: the start state • F: a set of accept/final states (FQ) • A transition relation δ that maps QxΣ to 2Q E.g., |Q| =3 ; I={a,b,c, ε} ; O={a,b}; |Σ|=?; 0 <= |δ| <= ? CPSC503 Winter 2008

  9. FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from IxO • Generator: output a string from IxO Terminology warning! E.g., if I={a,b,c, ε} ; O={a,b}; …… CPSC503 Winter 2008

  10. FST: inflectional morphology of plural Some regular-nouns Notes: X -> X:X lexical:surface Some irregular-nouns o:i CPSC503 Winter 2008

  11. Examples lexical surface m i c e lexical c a t +N +PL surface CPSC503 Winter 2008

  12. Computational Morphology: Problems/Challenges • Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) • Spelling changes: may occur when two morphemes are combined e.g. butterfly + -s -> butterflies CPSC503 Winter 2008

  13. Ambiguity: more complex example • What’s the right parse for Unionizable? • Union-ize-able • Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. • Both Adj…… CPSC503 Winter 2008

  14. Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of-speech tagging to choose…… look at the neighboring words CPSC503 Winter 2008

  15. (2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change • Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., try, butterfly) CPSC503 Winter 2008

  16. Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols • ^ morpheme boundary • # word boundary CPSC503 Winter 2008

  17. Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape CPSC503 Winter 2008

  18. FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL:^s# # # # Some irregular-nouns o:i ε:s ε:# +PL:^ CPSC503 Winter 2008

  19. Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate CPSC503 Winter 2008

  20. FST-2 for E-insertion(Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε CPSC503 Winter 2008

  21. Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface CPSC503 Winter 2008

  22. Where are we? # CPSC503 Winter 2008

  23. Final Scheme: Part 1 CPSC503 Winter 2008

  24. Final Scheme: Part 2 CPSC503 Winter 2008

  25. Intersection (FST1, FST2) • States of FST1 and FST2 : Q1and Q2 • States of intersection: (Q1x Q2) • Transitions of FST1 and FST2 :δ1, δ2 • Transitions of intersection : δ3 • For all i,j,n,m,a,bδ3((q1i,q2j), a:b) = (q1n,q2m) iff • δ1(q1i, a:b) = q1n AND • δ2(q2j, a:b) = q2m a:b q1i q1n a:b a:b a:b (q1i,q2j) (q1n,q2m)  q2j q2m CPSC503 Winter 2008

  26. Composition(FST1, FST2) • States of FST1 and FST2 : Q1 and Q2 • States of composition : Q1x Q2 • Transitions of FST1 and FST2 :δ1, δ2 • Transitions of composition : δ3 • For all i,j,n,m,a,bδ3((q1i,q2j), a:b) = (q1n,q2m) iff • There exists c such that • δ1(q1i, a:c) = q1n AND • δ2(q2j, c:b) = q2m a:c q1i q1n c:b a:b a:b  q2j q2m (q1i,q2j) (q1n,q2m) CPSC503 Winter 2008

  27. FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e.g, lexicon, morphotactic and rules) in a RegExp-like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) • Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46x103 stems; 3.4 x 106word forms • Arabic (2002?) 131x103 stems; 7.7 x 106word forms CPSC503 Winter 2008

  28. Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: • finding word boundaries in text (?!) …maxmatch • Finding sentence boundaries: punctuation… but . is ambiguous look at example in Fig. 3.22 • Shallow syntactic parsing: e.g., find only noun phrases • Phonological Rules…… (Chpt. ?11?) CPSC503 Winter 2008

  29. Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. buy +V +PAST-PART bought e.g., buy +V +PAST • Stemming: stem word …. CPSC503 Winter 2008

  30. Stemmer • E.g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S1->S2 • ATIONAL  ATE (relational  relate) • (*v*) ING   if stem contains vowel (motoring  motor) • Cascade of rules applied to: computerization • ization -> -ize computerize • ize -> εcomputer • Errors occur: • organization  organ, doing  doe university  universe Code freely available in most languages: Python, Java,… CPSC503 Winter 2008

  31. Stemming mainly used in Information Retrieval • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents CPSC503 Winter 2008

  32. Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… • Each stage is a separate transducer • The stages can be composed to get one big transducer CPSC503 Winter 2008

  33. Next Time • Read handout • Probability • Stats • Information theory • Next Lecture: • finish Chpt 3, 3.10-11 • Start Probabilistic Models for NLP (Chpt. 4, 4.1 – 4.2 and 5.9!) CPSC503 Winter 2008

More Related