1 / 68

Natural Language Processing

Natural Language Processing. Finite State Morphology. Slides adapted from Jurafsky & Martin, Speech and Language Processing. Outline. Finite State Automata (English) Morphology Finite State Transducers. Finite State Automata. Let ’ s start with the sheep language from Chapter 2 /baa+!/.

selden
Télécharger la présentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Finite State Morphology Slides adapted from Jurafsky & Martin, Speech and Language Processing

  2. Outline • Finite State Automata • (English) Morphology • Finite State Transducers

  3. Finite State Automata • Let’s start with the sheep language from Chapter 2 • /baa+!/

  4. Sheep FSA • We can say the following things about this machine • It has 5 states • b, a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions

  5. More Formally • You can specify an FSA by enumerating the following things. • The set of states: Q • A finite alphabet: Σ • A start state • A set of accept/final states • A transition relation that maps QxΣ to Q

  6. Yet Another View If you’re in state 1 and you’re looking at an a, go to state 2 • The guts of FSAs can ultimately be represented as tables

  7. Recognition • Recognition is the process of determining if a string should be accepted by a machine • Or… it’s the process of determining if a string is in the language we’re defining with the machine • Or… it’s the process of determining if a regular expression matches a string • Those all amount the same thing in the end

  8. Recognition • Traditionally, (Turing’s notion) this process is depicted with a tape.

  9. Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer • Until you run out of tape

  10. D-Recognize

  11. Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all unambiguous regular languages. • To change the machine, you simply change the table.

  12. Recognition as Search • You can view this algorithm as a trivial kind of state-space search. • States are pairings of tape positions and state numbers. • Operators are compiled into the table • Goal state is a pairing with the end of tape position and a final accept state • It is trivial because?

  13. Non-Determinism

  14. Non-Determinism • Yet another technique • Epsilon transitions • Key point: these transitions do not examine or advance the tape during recognition

  15. Equivalence • Non-deterministic machines can be converted to deterministic ones with a fairly simple construction • That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones in terms of the languages they can accept

  16. ND Recognition • Two basic approaches (used in all major implementations of regular expressions, see Friedl 2006) • Either take a ND machine and convert it to a D machine and then do recognition with that. • Or explicitly manage the process of recognition as a state-space search (leaving the machine as is).

  17. Non-Deterministic Recognition: Search • In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. • But not all paths directed through the machine for an accept string lead to an accept state. • No paths through the machine lead to an accept state for a string not in the language.

  18. Non-Deterministic Recognition • So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when all of the possible paths for a given string lead to failure.

  19. Example

  20. Example

  21. Example

  22. Example

  23. Example

  24. Example

  25. Example

  26. Example

  27. Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.

  28. Why Bother? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother? • More natural (understandable) solutions • Not always equivalent

  29. Compositional Machines • Formal languages are just sets of strings • Therefore, we can talk about various set operations (intersection, union, concatenation) • This turns out to be a useful exercise

  30. Union

  31. Concatenation

  32. Words • Finite-state methods are particularly useful in dealing with a lexicon • Many devices, most with limited memory, need access to large lists of words • And they need to perform fairly sophisticated tasks with those lists • So we’ll first talk about some facts about words and then come back to computational methods

  33. English Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes • Stems: The core meaning-bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

  34. English Morphology • We can further divide morphology up into two broad classes • Inflectional • Derivational

  35. Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word: • Has the same word class as the original • Serves a grammatical/semantic purpose that is • Different from the original • But is nevertheless transparently related to the original

  36. Nouns and Verbs in English • Nouns are simple • Markers for plural and possessive • Verbs are only slightly more complex • Markers appropriate to the tense of the verb

  37. Regulars and Irregulars • It is a little complicated by the fact that some words misbehave (refuse to follow the rules) • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular are used to refer to words that follow the rules and those that don’t

  38. Regular and Irregular Verbs • Regulars… • Walk, walks, walking, walked, walked • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut

  39. Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you. • Quasi-systematicity • Irregular meaning change • Changes of word class

  40. Derivational Examples • Verbs and Adjectives to Nouns

  41. Derivational Examples • Nouns and Verbs to Adjectives

  42. Morphology and FSAs • We’d like to use the machinery provided by FSAs to capture these facts about morphology • Accept strings that are in the language • Reject strings that are not • And do so in a way that doesn’t require us to in effect list all the words in the language

  43. Start Simple • Regular singular nouns are ok • Regular plural nouns have an -s on the end • Irregulars are ok as is

  44. Simple Rules

  45. Now Plug in the Words

  46. Derivational Rules If everything is an accept state how do things ever get rejected?

  47. Parsing/Generation vs. Recognition • We can now run strings through these machines to recognize strings in the language • But recognition is usually not quite what we need • Often if we find some string in the language we might like to assign a structure to it (parsing) • Or we might have some structure and we want to produce a surface form for it (production/generation) • Example • From “cats” to “cat +N +PL”

  48. Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”

  49. FSTs

  50. Applications • The kind of parsing we’re talking about is normally called morphological analysis • It can either be • An important stand-alone component of many applications (spelling correction, information retrieval) • Or simply a link in a chain of further linguistic analysis

More Related