1 / 20

RePortS: A Simpler, Intuitive Approach to Morpheme Induction

RePortS: A Simpler, Intuitive Approach to Morpheme Induction. Emily Pitler Samarth Keshava Yale University. Goals. Segment English words into morphemes Simple algorithm Minimize assumptions and “magic numbers”. Approach. Identify common morphemes in the language

eaaron
Télécharger la présentation

RePortS: A Simpler, Intuitive Approach to Morpheme Induction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RePortS: A Simpler, Intuitive Approach to Morpheme Induction Emily Pitler Samarth Keshava Yale University

  2. Goals • Segment English words into morphemes • Simple algorithm • Minimize assumptions and “magic numbers”

  3. Approach • Identify common morphemes in the language • “prefix” and “suffix” lists • Use these to segment the test words

  4. Intuition and Motivation • The resulting word fragment, after removing a potential morpheme, is often still a word • Examples: • training = train+ing • chairman = chair+man • insufferable = insuffer+able • Don’t use to segment words

  5. Intuition and Motivation • Use fluctuations in transitional probabilities (Harris 1955, Hafer and Weiss 1974) • Examples: • Expect Pr(t | repor) ≈ 1 • Expect Pr(s | report) < 1 • Because there are other words such as reported, reporting, report, etc.

  6. Four Steps • Preprocessing: build the lexicographic trees • Score word fragments to determine morphemes • Prune the morpheme lists • Segment words using the trees and morpheme lists

  7. Step 1: Build the trees • We build a “forward tree” and a “backward tree” • We use these trees to calculate transitional probabilities in O(1) time

  8. Hypothetical section of the forward tree

  9. Step 2: Scoring morphemes • Example: scoring “s” in “reports” • Check if “report” is a word in the corpus • Check if Pr(t | repor) ≈ 1 • Check if Pr(s | report) < 1 • If “s” passes all three tests, we add 19 to its suffix score; otherwise we subtract 1

  10. Step 2: Scoring morphemes • We declare fragments to be morphemes if they have positive scores • +19/-1 scheme • Chosen so that positive score iff pass 5% of tests • More frequent morphemes have higher scores • Any multiple of these numbers would produce same results

  11. Step 3: Pruning • Don’t want “er”, “s” and “ers” all in the morpheme list • Remove any morpheme composed of two other morphemes with higher scores

  12. Top 10 of the 808 morphemes in the “prefix” list: un re dis non over mis in sub pre inter Top English Morphemes

  13. Top 10 of the 987 morphemes in the “suffix” list: s ly ness ing ed al ism less ist able Top English Morphemes

  14. Prefixes and suffixes later in the list well water servo make quick ier box town line more Top English Morphemes

  15. Step 4: Segmenting Words • politeness = polite+ness or politenes+s ? • Use transitional probabilities again • Expect Pr(n | polite) < Pr(s | politenes) • Peel off morpheme with smallest probability (unless all probabilities are 1)

  16. Results • English results • On the provided 532-word Gold Standard • On the organizers’ test data

  17. Results • Breakdown • Contribution of the different intuitions

  18. Results • Finnish • Turkish

  19. Simple and Effective • Based on intuition, not a complex model • How we personally would segment words • Program was relatively short--252 lines of Perl • Other variations had slightly better F-scores • Best mixture of performance and elegance

  20. Thank you for listening. Emily Pitler Samarth Keshava

More Related