1 / 58

CS60057 Speech &Natural Language Processing

CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture 2 26 July 2007. Why is NLP difficult?. Because Natural Language is highly ambiguous. Syntactic ambiguity The president spoke to the nation about the problem of drug use in the schools from one coast to the other.

alain
Télécharger la présentation

CS60057 Speech &Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS60057Speech &Natural Language Processing Autumn 2007 Lecture 2 26 July 2007 Natural Language Processing

  2. Why is NLP difficult? • Because Natural Language is highly ambiguous. • Syntactic ambiguity • The president spoke to the nation about the problem of drug use in the schools from one coast to the other. • has 720 parses. • Ex: • “to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb  6 places • “from one coast” has 5 places to attach • … Natural Language Processing

  3. Why is NLP difficult? • Word category ambiguity • book -->verb? or noun? • Word sense ambiguity • bank --> financial institution? building? or river side? • Words can mean more than their sum of parts • make up a story • Fictitious worlds • People on mars can fly. • Defining scope • People like ice-cream. • Does this mean that all (or some?) people like ice cream? • Language is changing and evolving • I’ll email you my answer. • This new S.U.V. has a compartment for your mobile phone. • Googling, … Natural Language Processing

  4. Dealing with Ambiguity • Four possible approaches: • Tightly coupled interaction among processing levels; knowledge from other levels can help decide among choices at ambiguous levels. • Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures. Natural Language Processing

  5. Resolve Ambiguities • We will introduce models and algorithms to resolve ambiguities at different levels. • part-of-speech tagging -- Deciding whether duck is verb or noun. • word-sense disambiguation -- Deciding whether make is create or cook. • lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation. • syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing. Natural Language Processing

  6. Resolve Ambiguities (cont.) I made her duck S S NP VP NP VP I V NP NP I V NP made her duck made DET N her duck Natural Language Processing

  7. Dealing with Ambiguity • Three approaches: • Tightly coupled interaction among processing levels; knowledge from other levels can help decide among choices at ambiguous levels. • Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures. • Syntax proposes/semantics disposes approach • Probabilistic approaches based on making the most likely choices Natural Language Processing

  8. Models and Algorithms • By models I mean the formalisms that are used to capture the various kinds of linguistic knowledge we need. • Algorithms are then used to manipulate the knowledge representations needed to tackle the task at hand. Natural Language Processing

  9. Models to Represent Linguistic Knowledge • Different formalisms (models) are used to represent the required linguistic knowledge. • State Machines -- FSAs, HMMs, ATNs, RTNs • Formal Rule Systems -- Context Free Grammars, Unification Grammars, Probabilistic CFGs. • Logic-based Formalisms -- first order predicate logic, some higher order logic. • Models of Uncertainty -- Bayesian probability theory. Natural Language Processing

  10. Algorithms • Many of the algorithms that we’ll study will turn out to be transducers; algorithms that take one kind of structure as input and output another. • Unfortunately, ambiguity makes this process difficult. This leads us to employ algorithms that are designed to handle ambiguity of various kinds Natural Language Processing

  11. Algorithms • In particular.. • State-space search • To manage the problem of making choices during processing when we lack the information needed to make the right choice • Dynamic programming • To avoid having to redo work during the course of a state-space search • CKY, Earley, Minimum Edit Distance, Viterbi, Baum-Welch Natural Language Processing

  12. State Space Search • States represent pairings of partially processed inputs with partially constructed representations. • Goals are inputs paired with completed representations that satisfy some criteria. • As with most interesting problems the spaces are normally too large to exhaustively explore. • We need heuristics to guide the search • Criteria to trim the space Natural Language Processing

  13. Dynamic Programming • Don’t do the same work over and over. • Avoid this by building and making use of solutions to sub-problems that must be invariant across all parts of the space. Natural Language Processing

  14. Languages • Languages: 39,000 languages and dialects (22,000 dialects in India alone) • Top languages: • Chinese/Mandarin (885M), • Spanish (332M), • English (322M), • Bengali (189M), • Hindi (182M), • Portuguese (170M), Russian (170M), Japanese (125M) • Source: www.sil.org/ethnologue, www.nytimes.com • Internet: English (128M), Japanese (19.7M), German (14M), Spanish (9.4M), French (9.3M), Chinese (7.0M) • Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%) • Source: www.computereconomics.com Natural Language Processing

  15. The Description of Language • Language = Words and Rules •  Dictionary (vocabulary) + Grammar • Dictionary • set of words defined in the language. open (dynamic) • Traditional - paper based • Electronic - machine readable dictionaries; can be obtained from paper-based • Grammar • set of rules which describe what is allowable in a language • Classic Grammars • meant for humans who know the language • definitions and rules are mainly supported by examples • no (or almost no) formal description tools; cannot be programmed • Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) • formal description • can be programmed & tested on data (texts)

  16. Levels of (Formal) Description • 6 basic levels (more or less explicitly present in most theories): and beyond (pragmatics/logic/...) meaning (semantics) (surface) syntax morphology phonology phonetics/orthography • Each level has an input and output representation • output from one level is the input to the next (upper) level • sometimes levels might be skipped (merged) or split

  17. Phonetics/Orthography • Input: • acoustic signal (phonetics) / text (orthography) • Output: • phonetic alphabet (phonetics) / text (orthography) • Deals with: • Phonetics: • consonant & vowel (& others) formation in the vocal tract • classification of consonants, vowels, ... in relation to frequencies, shape & position of the tongue and various muscles • intonation • Orthography: normalization, punctuation, etc.

  18. Phonology • Input: • sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes] • Output: • sequence of phonemes (~ (lexical) letters; in an abstract alphabet) • Deals with: • relation between sounds and phonemes (units which might have some function on the upper level) • e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)

  19. Morphology • Input: • sequence of phonemes (~ (lexical) letters) • Output: • sequence of pairs (lemma, (morphological) tag) • Deals with: • composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding) • e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.

  20. (Surface) Syntax • Input: • sequence of pairs (lemma, (morphological) tag) • Output: • sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms • Deals with: • the relation between lemmas & morphological categories and the sentence structure • uses syntactic categories such as Subject, Verb, Object,... • e.g.: I/PP1 see/VB a/DT dog/NN ~ • ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S

  21. Meaning (semantics) • Input: • sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) • Output: • sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions) • Deals with: • relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s • e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ • (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)

  22. ...and Beyond • Input: • sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) • Output: • logical form, which can be evaluated (true/false) • Deals with: • assignment of objects from the real world to the nodes of the sentence structure • e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ • see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])[Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]

  23. Three Views • Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Finite State Automata Regular Languages Natural Language Processing

  24. Transition • Finite-state methods are particularly useful in dealing with a lexicon. • Lots of devices, some with limited memory, need access to big lists of words. • So we’ll switch to talking about some facts about words and then come back to computational methods Natural Language Processing

  25. MORPHOLOGY Natural Language Processing

  26. Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions • Prefix: un-, anti-, etc • Suffix: -ity, -ation, etc • Infix: are inserted inside the stem • Tagalog: um + hingi humingi • Circumfixes – precede and follow the stem • English doesn’t stack more affixes. • But Turkish can have words with a lot of suffixes. • Languages, such as Turkish, tend to string affixes together are called agglutinative languages. Natural Language Processing

  27. Surface and Lexical Forms • The surface level of a word represents the actual spelling of that word. • geliyorum eats cats kitabım • The lexical level of a word represents a simple concatenation of morphemes making up that word. • gel +PROG +1SG • eat +AOR • cat +PLU • kitap +P1SG • Morphological processors try to find correspondences between lexical and surface forms of words. • Morphological recognition/ analysis – surface to lexical • Morphological generation/ synthesis – lexical to surface Natural Language Processing

  28. Morphology: Morphemes & Order • Handles what is an isolated form in written text • Grouping of phonemes into morphemes • sequence deliverables ~deliver, able and s(3 units) • Morpheme Combination • certain combinations/sequencing possible, other not: • deliver+able+s, but not able+derive+s; noun+s, but not noun+ing • typically fixed (in any given language)

  29. Inflectional & Derivational Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical/semantic purpose different from the original After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. • eat / eats pencil / pencils • After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. • compute / computer do / undo friend / friendly • Uygar / uygarlaş kapı /kapıcı • The irregular changes may happen with derivational affixes. Natural Language Processing

  30. Morphological Parsing • Morphological parsing is to find the lexical form of a word from its surface form. • cats -- cat +N +PLU • cat -- cat +N +SG • goose -- goose +N +SG or goose +V • geese -- goose +N +PLU • gooses -- goose +V +3SG • catch -- catch +V • caught -- catch +V +PAST or catch +V +PP • There can be more than one lexical level representation for a given word. (ambiguity) Natural Language Processing

  31. Morphological Analysis • Analyzing words into their linguistic components (morphemes). • Morphemes are the smallest meaningful units of language. cars car+PLU giving give+PROG AsachhilAma AsA+PROG+PAST+1st I/We was/were coming • Ambiguity: More than one alternatives flies flyVERB+PROG flyNOUN+PLU mAtAla kare Natural Language Processing

  32. Fly + s  flys  flies (y i rule) • Duckling Go-getter  get + er Doer  do + er Beer  ? What knowledge do we need? How do we represent it? How do we compute with it? Natural Language Processing

  33. Knowledge needed • Knowledge of stems or roots • Duck is a possible root, not duckl We need a dictionary (lexicon) • Only some endings go on some words • Do + er ok • Be + er – not ok • In addition, spelling change rules that adjust the surface form • Get + er – double the t getter • Fox + s – insert e – foxes • Fly + s – insert e – flys – y to i – flies • Chase + ed – drop e - chased Natural Language Processing

  34. Put all this in a big dictionary (lexicon) • Turkish – approx 600  106 forms • Finnish – 107 • Hindi, Bengali, Telugu, Tamil? • Besides, always novel forms can be constructed • Anti-missile • Anti-anti-missile • Anti-anti-anti-missile • …….. • Compounding of words – Sanskrit, German Natural Language Processing

  35. Morphology: From Morphemes to Lemmas & Categories • Lemma: lexical unit, “pointer” to lexicon • typically is represented as the “base form”, or “dictionary headword” • possibly indexed when ambiguous/polysemous: • state1 (verb), state2 (state-of-the-art), state3 (government) • from one or more morphemes (“root”, “stem”, “root+derivation”, ...) • Categories: non-lexical • small number of possible values (< 100, often < 5-10)

  36. Morphology Level: The Mapping • Formally: A+  2(L,C1,C2,...,Cn) • A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes) • L is the set of possible lemmas, uniquely identified • Ci are morphological categories, such as: • grammatical number, gender, case • person, tense, negation, degree of comparison, voice, aspect, ... • tone, politeness, ... • part of speech (not quite morphological category, but...) • A, L and Ci are obviously language-dependent

  37. Morphological Analysis (cont.) • Relatively simple for English. • But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. • Common tools: Finite-state transducers Natural Language Processing

  38. Simple Rules Natural Language Processing

  39. Adding in the Words Natural Language Processing

  40. Derivational Rules Natural Language Processing

  41. Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. • Usually if we find some string in the language we need to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Example • From “cats” to “cat +N +PL”and back Natural Language Processing

  42. Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. Natural Language Processing

  43. FSTs Natural Language Processing

  44. +N:ε +PL:s c:c a:a t:t Transitions • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s Natural Language Processing

  45. Typical Uses • Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). • And we’ll write to the second tape using the other symbols on the transitions. Natural Language Processing

  46. Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result Natural Language Processing

  47. Ambiguity • What’s the right parse for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine. Natural Language Processing

  48. Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored Natural Language Processing

  49. The Gory Details • Of course, its not as easy as • “cat +N +PL” <-> “cats” • As we saw earlier there are geese, mice and oxen • But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes • Cats vs Dogs • Fox and Foxes Natural Language Processing

  50. Multi-Tape Machines • To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next • So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols Natural Language Processing

More Related