1 / 52

CMSC 723: Intro to Computational Linguistics

CMSC 723: Intro to Computational Linguistics. February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot. Plan for Today’s Lecture. Morphology: Definitions and Problems What is Morphology? Topology of Morphologies

kamea
Télécharger la présentation

CMSC 723: Intro to Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorrand Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot

  2. Plan for Today’s Lecture • Morphology: Definitions and Problems • What is Morphology? • Topology of Morphologies • Approaches to Computational Morphology • Lexicons and Rules • Computational Morphology Approaches • Assignment 2

  3. Morphology • The study of the way words are built up from smaller meaning units called Morphemes • Abstract versus Realized • HOP +PAST hop +ed  hopped  /hapt/ • Context Context Context

  4. Phonology and Morphology • Phonology vs. Orthography • Historical spelling • night, nite • attention, mission, fish • Script Limitations • Spoken English has 14 vowels • heed hid hayed head had hoed hood who’d hide how’d taught Tut toyenough • English Alphabet has 5 • Use vowel combinatios: far fair fare • Consonantal doubling (hopping vs. hoping)

  5. conj prep noun article plural poss Syntax and Morphology • Phrase-level agreement • Subject-Verb • John studies hard (STUDY+3SG) • Noun-Adjective • Las vacas hermosas • Sub-word phrasal structures • שבספרינו • ש+ב+ספר+ים+נו • That+in+book+PL+Poss:1PL • Which are in our books

  6. Topology of Morphologies • Concatinative vs. Templatic • Derivational vs. Inflectional • Regular vs. Irregular

  7. Concatinative Morphology • Morpheme+Morpheme+Morpheme+… • Stems: also called lemma, base form, root, lexeme • hope+ing  hoping hop  hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Agglutinative Languages • uygarlaştıramadıklarımızdanmışsınızcasına • uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına • Behaving as if you are among those whom we could not cause to become civilized

  8. Templatic Morphology • Roots and Patterns ب ت ك ב ת כ K T B ? و ? ? مَ ? ו ? ? כתוב مكتوب maktuubwritten ktuuvwritten

  9. Templatic Morphology: Root Meaning • KTB: writing “stuff” كتاب book write كتب כתב כתיב spelling مكتبة library letter מכתב مكتوب כתובת address مكتب office writer كاتب כתב

  10. Inflectional vs. Derivational • Word Classes • Parts of speech: noun, verb, adjectives, etc. • Word class dictates how a word combines with morphemes to form new words

  11. Derivational morphology • Nominalization: computerization, appointee, killer, fuzziness • Formation of adjectives: computational, clueless, embraceable • CatVar: Categorial Variation Database http://clipdemos.umiacs.umd.edu/catvar/

  12. Inflectional morphology • Adds: Tense, number, person, mood, aspect • Word class doesn’t change • Word serves new grammatical role • Five verb forms in English • Other languages have (lots more)

  13. Nouns and Verbs (in English) • Nouns have simple inflectional morphology • cat • cat+s, cat+’s • Verbs have more complex morphology

  14. Regulars and Irregulars • Nouns • Cat/Cats • Mouse/Mice, Ox, Oxen, Goose, Geese • Verbs • Walk/Walked • Go/Went, Fly/Flew

  15. Regular (English) Verbs

  16. Irregular (English) Verbs

  17. “To love” in Spanish

  18. Computational Morphology • Finite State Morphology • Finite State Transducers (FST) • Input/Output • Analysis/Generation

  19. Computational Morphology WORD STEM (+FEATURES)* • cats cat +N +PL • cat cat +N +SG • cities city +N +PL • geese goose +N +PL • ducks (duck +N +PL) or (duck +V +3SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST)

  20. Computational Morphology • The Rules and the Lexicon • General versus Specific • Regular versus Irregular • Accuracy, speed, space • The Morphology of a language • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers

  21. Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules …? • Analysis/Generation is easy • Very large for English • What about Arabic or Turkish? • Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$

  22. Lexicon and RulesFSA Inflectional Morphology • English Noun Lexicon • English Noun Rule

  23. FSAEnglish Verb Inflectional Morphology

  24. FSA for Derivational Morphology: Adjectival Formation

  25. More Complex Derivational Morphology

  26. Using FSAs for Recognition: English Nouns and their Inflection

  27. Morphological Parsing • Finite-state automata (FSA) • Recognizer • One-level morphology • Finite-state transducers (FST) • Two-level morphology • PC-Kimmo (Koskenniemi 83) • input-output pair

  28. Terminology for PC-Kimmo • Upper = lexical tape • Lower = surface tape • Characters correspond to pairs, written a:b • If “a:a”, write “a” for shorthand • Two-level lexical entries • # = word boundary • ^ = morpheme boundary • Other = “any feasible pair that is not in this transducer”

  29. Four-Fold View of FSTs • As a recognizer • As a generator • As a translator • As a set relater

  30. Nominal Inflection FST

  31. Lexical and Intermediate Tapes

  32. Spelling Rules

  33. Chomsky and Halle Notation x s z ^ __ s # ε → e /

  34. Intermediate-to-Surface Transducer

  35. State Transition Table

  36. Two-Level Morphology

  37. Sample Run

  38. FST Properties • Inversion • T-1= inversion of T • Input/Output switched • Composition • T1 maps I1 to O1 • T2 maps I2 to O2 • T1°T2 maps I1 to O2

  39. FSTs and ambiguity • Kimmo Demo • Parse Example 1: unionizable • union +ize +able • un+ ion +ize +able • Parse Example 2: assess • assessv • assN +essN • Parse Example 3: tender • tenderAJ • tenNum+dAJ+erCMP

  40. What to do about Global Ambiguity? • Accept first successful structure • Run parser through all possible paths • Bias the search in some manner

  41. Computational Morphology • The Rules and the Lexicon • General versus Specific • Regular versus Irregular • Accuracy, speed, space • The Morphology of a language • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers

  42. Lexicon-Free Morphology:Porter Stemmer • Lexicon-Free FST Approach • By Martin Porter (1980)http://www.tartarus.org/%7Emartin/PorterStemmer/ • Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER • Porter Stemmer Game

  43. Porter Stemmer Definitions • C = consonant = Not A E I O U or (Y preceded byC) • V = not C • M = Measure: Words = C*(V*C*){M}V* • M=0 TR, EE, TREE, Y, BY • M=1 TROUBLE, OATS, TREES, IVY • M=2 TROUBLES, PRIVATE, OATEN, ORRERY • Conditions • *S - stem ends with S • *v* - stem contains a V • *d - stem ends with double C • -DD, -ZZ • *o - stem ends CVC, where the second C is not W, X or Y • -WIL, -SOB

  44. *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 1: Plural Nouns and Third Person Singular Verbs SSES  SS caresses  caress IES  I ponies  poni ties  ti SS  SS caress  caress S  cats  cat Step 2a: Verbal Past Tense and Progressive Forms (M>0) EED  EE feed  feed, agreed  agree i (*v*) ED  plastered  plaster, bled  bled ii (*v*) ING  motoring  motor, sing  sing Step 2b: If 2a.i or 2a.ii is successful, Cleanup AT  ATE conflat(ed)  conflate BL  BLE troubl(ed)  trouble IZ  IZE siz(ed)  size (*d and not (*L or *S or *Z)) hopp(ing)  hop, tann(ed)  tan  single letter hiss(ing)  hiss, fizz(ed)  fizz (M=1 and *o)  E fail(ing)  fail, fil(ing)  file

  45. *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 3: Y  I (*v*) Y  I happy  happi sky  sky

  46. Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

  47. Porter Stemmer Step 5: Derivational Morphology II: More Multiple Suffixes (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good

  48. *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 5: Derivational Morphology III: Single Suffixes (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

  49. *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 7a: Cleanup (m>1) E  probate  probat rate  rate (m=1 and not *o) E  cease  ceas Step 7b: More Cleanup (m > 1 and *d and *L) controll  control  single letter roll  roll

  50. Porter Stemmer • Errors of Omission • European Europe • analysis analyzes • matrices matrix • noise noisy • explain explanation • Errors of Commission • organization organ • doing doe • generalization generic • numerical numerous • university universe

More Related