1 / 123

Introduction to Language Modeling

Introduction to Language Modeling. Alex Acero. Acknowledgments. Joshua Goodman, Scott MacKenzie for many slides. Outline. Prob theory intro Text prediction Intro to LM Perplexity Smoothing Caching Clustering Parsing CFG Homework. Outline. Prob theory intro Text prediction

asher
Télécharger la présentation

Introduction to Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Language Modeling Alex Acero

  2. Acknowledgments • Joshua Goodman, Scott MacKenzie for many slides

  3. Outline • Prob theory intro • Text prediction • Intro to LM • Perplexity • Smoothing • Caching • Clustering • Parsing • CFG • Homework

  4. Outline • Prob theory intro • Text prediction • Intro to LM • Perplexity • Smoothing • Caching • Clustering • Parsing • CFG • Homework

  5. Babies Baby boys John Probability – definition P(X) means probability that X is true P(baby is a boy)  0.5 (% of total that are boys) P(baby is named John)  0.001 (% of total named John)

  6. Babies Baby boys John Brown eyes Joint probabilities • P(X, Y) means probability that X and Y are both true, e.g. P(brown eyes, boy)

  7. Babies Baby boys John Conditional probabilities • P(X|Y) means probability that X is true when we already know Y is true • P(baby is named John | baby is a boy)  0.002 • P(baby is a boy | baby is named John )  1

  8. Babies Baby boys John Bayes rule P(X|Y) = P(X, Y) / P(Y) P(baby is named John | baby is a boy) =P(baby is named John, baby is a boy) / P(baby is a boy) = 0.001 / 0.5 = 0.002

  9. Outline • Prob theory intro • Text prediction • Intro to LM • Perplexity • Smoothing • Caching • Clustering • Parsing • CFG • Homework

  10. 90% Removed k r ce r oz y t a e t h c o , d o a a o b a a e g v t a l a ss m n t i s n f d w i t l h e n s - w le e a r i w f e n t e h r w e , h v e r d l Wi

  11. 80% Removed D r s r s n h s d f w e e b e i e in th w i r o a t e t ar e t e k i t i c i ver e . as d on, ho ve n o n an o h t h sp r f a e of s a h o a u o e e n a n s - au h m r s s o h l e ld a e s n in i f li w h mas u i e i m g in i i o i t e o t w t a g z a N n d

  12. 70% Removed D uc ore frown d on i h r id n ter a h t ee a ent o e whi e in f fro y seeme t l a s ac c s e ad A l ei d h h lf wa s a s, th mo t, a ha th p o v at sa s e a a i i g t aught m r t h y ne s a g t at wa m es e e o i laug s e fr s p k f h gr nes i f t e fu an i c u l s m of r i u l f a he ef rt o il , v , ro ar t land il

  13. 60% Removed k ruc res f w n th r id t e fr z n t r y. T r d stri d b a rece t n t w i cove n o r s a d t e e e to e n h ot er, l o no s e ad ng l g . A st n eig d v e l d. h n t e w a d s ti , i es , wi mo e t, so a d co t a pi it i as ev hat dn ss he a n it a ht r b t f l u e te bl t n ess a u e t r e s the sm e t p i ugh o d a he fr d p ta in h g ne s llib l It as e a ful d mm c le w e nity lau h g a t e l t o and ff t . I a he W d s ge fr z a e N th a W ld

  14. 50% Removed D k r ce fore t o on i h r si the fr en wa erw tr s ha ee i ed y a ec nt wi d f the r whit erin o f an e e med t lean t war ch o er, ck a d mi ou in g i t. t le r ig d r he l d la d t el as d so i if , w t o m v ent, s lone d d hat h spi t f t was t ven th sa e s h as a h nt n i f aug t , t f l u t mo e ter ble than an a ss - lau ter hat as m h as t s ile f he ph n a ghte ol as he os a d r a ing e grimn ss o n a il y. a h m s er u in om a i d e it laug n a t fu i y f l e he ff t of if . It he W l , t e s ag , f e - ed N r n ld

  15. 40% Removed Dark s r c res ow e n eit er side the froze w erwa . T trees h d b n tr p d b c t w n o heir white c v ring of f s and h y s med e n towa ds e c o her la k and i o s, i e f l gh . A as ilence reig ed o he and. The lan e f a a de l tion, life e s, i mo ment, so n nd cold h the pi t o it s n t ev n hat o sa nes . h e wa hin n it of laughter, bu a lau h r e t rib e a dn - laught r a was mi hle s as the i e of t e s a la hte ol as he ro t a d ar ak n f th g mne of inf llib i . It wa he mas erful n incom un b e w s om of t rnity l ugh n at e lity f fe nd e effort f ife. I as t e , savage, fro n- h a r hland Wi d.

  16. 30% Removed Da s ruce fores r on ith r i e the froz n ater ay The t s ha een st ippe b re e wi d f heir white c vering of ost, an they e d o ean towards eac othe , lac a d om nou , n t e f in ight. A vast il nc re ned o e he la d. The l d s l wa d s a ion, if s, without m men o e and old that t e pi t of it was not e en tha o sadnes T ere s a hint in t of laug r, ut o a laugh e o er ble t n a y sadn s a la ghte at as irthless s the mil o h phi x, a ghter cold as the fro t a d par ak ng f the grim es of n al b l ty was th m s r ul and co mu i able wi m of ter la ghi t t e futil ty o if an the ef rt of li e. It as the Wild, h avage, froze - hearte Northland Wil .

  17. 20% Removed Dark spruce forest frowned n either side the roze ate wa . The trees had ee stripp d by ecent wi d f thei white coverin of frost d they eemed to lean towa ds e ch o h r, bl ck and mino s, in the ad ng l ght. A vas sil n e reigned over the land. The land i sel was e olatio , ifeless, wit out movem n , s lon n cold ha e spi it of s n t eve hat adn s. The e was hi i it of a ght , but f a laughter ore ible t an any s ne s - a laughter t a was mi hless as he mile of he sphinx, a aug ter col as the f ost nd arta in of th grimn ss of i fall bility It as the asterful and inc mmunicabl wisdo of e ernity ugh ng at the futilit of li and t e for of ife. It w the Wild, the avage, rozen- hea t d N rthla d W ld.

  18. 10% Removed Dark s ru e forest frowned on either side the frozen waterw y. The trees had bee stripped by a recent ind of t eir w ite covering of rost, and they seemed o lean towards each ot er, black and ominous, in the fading li h . A vast silence reigned ver the land The land itself w s a deso ation, lifel ss, without ovement, s l n and cold hat the spiri of it wa not even that of sa ness. here was a hint in it of laughte , but of a l ug ter more terrible than any sadness - a ughte that as mirthless s he smile of the sphinx a laug ter cold as the rost a d p rt k ng of the grimness of infal i ility. It as the masterful and incommunica le wisdom of eternity laughing at th futility of life an the effort f life. It was e Wild, the sa ag , froz n- earte Northland Wild.

  19. 0% Removed Dark spruce forest frowned on either side the frozen waterway. The trees had been stripped by a recent wind of their white covering of frost, and they seemed to lean towards each other, black and ominous, in the fading light. A vast silence reigned over the land. The land itself was a desolation, lifeless, without movement, so lone and cold that the spirit of it was not even that of sadness. There was a hint in it of laughter, but of a laughter more terrible than any sadness - a laughter that was mirthless as the smile of the sphinx, a laughter cold as the frost and partaking of the grimness of infallibility. It was the masterful and incommunicable wisdom of eternity laughing at the futility of life and the effort of life. It was the Wild, the savage, frozen- hearted Northland Wild. From Jack London’s “White Fang”

  20. Language as Information • Information is commonly measured in “bits” • Since language is highly redundant, perhaps it can be viewed somewhat like information • Can language be measured or coded in “bits”? • Sure. Examples include… • ASCII (7 bits per “symbol”) • Unicode (16 bits per symbol) • But coding schemes, such as ASCII or Unicode, do not account for the redundancy In the language • Two questions: • Can a coding system be developed for a language (e.g., English) that accounts for the redundancy in the language? • If so, how may bits per symbol are required?

  21. How many bits? • ASCII codes have seven bits • 27 = 128 codes • Codes include… • 33 control codes • 95 symbols, including 26 uppercase letters, 26 lowercase letters, space, 42 “other” symbols • In general, if we have n symbols, the number of bits to encode them is log2n (Note: log2128 = 7) • What about bare bones English – 26 letters plus space? • How many bits?

  22. How many bits? (2) • It takes log227 = 4.75 bits/character to encode bare bones English • But, what about redundancy in English? • Since English is highly redundant, is there a way to encode letters in fewer bits? • Yes • How many bits? • The answer (drum roll please)…

  23. How many bits? (3) • The minimum number of bits to encode English is (approximately)… • 1 bit/character • How is this possible? • E.g., Huffman coding • ngrams • More importantly, how is this answer computed? • Want to learn how? Read… Shannon, C. E. (1951). Prediction and entropy of printed English. TheBell System Technical Journal, 30, 51-64.

  24. Disambiguation • A special case of prediction is disambiguation • Consider the telephone keypad… • Is it possible to enter text using this keypad? • Yes. But the keys are ambiguous.

  25. Ambiguity Continuum 53 keys 27 keys 8 keys 1 key LessAmbiguity MoreAmbiguity

  26. Coping With Ambiguity • There are two approaches to disambiguating the telephone keypad • Explicit • Use additional keys or keystrokes to select the desired letter • E.g., multitap • Implicit • Add “intelligence” (i.e., a language model) to the interface to guess the intended letter • E.g., T9, Letterwise

  27. But, there is a problem. When consecutive letters are on the same key, additional disambiguation is needed. Two techniques: (i) timeout, (ii) a special “next letter” (N) key1 to explicitly segment the letters. (See above: “ps” in “jumps” and “zy” in “lazy”). 1 Nokia phones: timeout = 1.5 seconds, “next letter” = down-arrow Multitap • Press a key once for the 1st letter, twice for the 2nd letter, and so on • Example... 84433.778844422255.22777666966.33366699.th e q u i c k b r o wn f o x 58867N7777.66688833777.84433.55529999N999.36664.ju mp s o v e r th e l az y do g

  28. T9 • Product of Tegic Communications (www.tegic.com), a subsidiary of Nuance Communications • Licensed to many mobile phone companies • The idea is simple: • one key = one character • A language model works “behind the scenes” to disambiguate • Example (next slide)...

  29. C O M P U T E R Number of word stems to consider… = 11,664 3 x 3 x 3 x 4 x 3 x 3 x 3 x 4 Guess the Word

  30. 843.78425.27696.369.58677.6837.843.5299.364. the quick brown fox jumps over the jazz dog tie stick crown lumps muds tie lazy fog vie vie DecreasingProbability “Quick Brown Fox” Using T9 843.78425.27696.369.58677.6837.843.5299.364. the quick brown fox jumps over the lazy dog But, there is problem. The key sequences are ambiguous and other words may exist for some sequences. See below.

  31. Keystrokes Per Character (KSPC) • Earlier examples used the “quick brown fox” phrase, which has 44 characters (including one character after each word) • Multitap and T9 require very different keystroke sequences • Compare…

  32. Formulas

  33. Outline • Prob theory intro • Text prediction • Intro to LM • Perplexity • Smoothing • Caching • Clustering • Parsing • CFG • Homework

  34. Speech Recognition TTS ASR SLG SLU DM Acoustic Model Input Speech Pattern Classification (Decoding, Search) “Hello World” Feature Extraction Confidence Scoring (0.9) (0.8) Word Lexicon Language Model

  35. Language Modeling in ASR • Some sequences of words sounds alike, but not all of them are good English sentences. • I went to a party • Eye went two a bar tea Rudolph the red nose reindeer. Rudolph the Red knows rain, dear. Rudolph the Red Nose reigned here.

  36. Language Modeling in ASR • This lets the recognizer make the right guess when two different sentences sound the same. For example: • It’s fun to recognize speech? • It’s fun to wreck a nice beach?

  37. Humans have a Language Model Ultimate goal is that a speech recognizer performs a good as human being. In psychology a lot of research has been done. • The *eel was on the shoe • The *eel was on the car People capable to adjusting to right context • removes ambiguities • limits possible words Already very good language models for dedicated applications (e.g. medical, a lot of standardization)

  38. A bad language model

  39. A bad language model

  40. A bad language model Herman is reprinted with permission from LaughingStock Licensing Inc., Ottawa Canada. All rights reserved.

  41. A bad language model

  42. What’s a Language Model • A Language model is a probability distribution over word sequences • P(“And nothing but the truth”)  0.001 • P(“And nuts sing on the roof”)  0

  43. What’s a language model for? • Speech recognition • Machine translation • Handwriting recognition • Spelling correction • Optical character recognition • Typing in Chinese or Japanese • (and anyone doing statistical modeling)

  44. How Language Models work Hard to compute P(“And nothing but the truth”) Step 1: Decompose probability P(“And nothing but the truth” ) = P(“And”) P(“nothing” | “And”) x P(“but” | “And nothing”) x P(“the” | “And nothing but”) x P(“truth” | “And nothing but the”) Step 2: Approximate with trigrams P(“And nothing but the truth” ) ≈ P(“And”) P(“nothing” | “And”) x P(“but” | “And nothing”) x P(“the” | “nothing but”) x P(“truth” | “but the”)

  45. example How do we find probabilities? Get real text, and start counting! P(“the | nothing but”)  C(“nothing but the”) / C(“nothing but”) Training set: “John read her book” “I read a different book” “John read a book by Mulan”

  46. example These bigram probabilities help us estimate the probability for the sentence as: P(John read a book) = P(John|<s>)P(read|John)P(book|a)P(</s>|book) = 0.148 Then cross entropy: -1/4*2log(0.148) = 0.689 So perplexity = 20.689 = 1.61 Comparison: Wall street journal text (5000 words) has a bigram perplexity of 128

  47. gram example To calculate this probability, we need to compute both the number of times "am" is preceded by "I" and the number of times "here" is preceded by "I am." All four sounds the same, right decision can only be made by language model.

  48. Outline • Prob theory intro • Text prediction • Intro to LM • Perplexity • Smoothing • Caching • Clustering • Parsing • CFG • Homework

  49. Evaluation • How can you tell a good language model from a bad one? • Run a machine translation system, a speech recognizer (or your application of choice), calculate word error rate • Slow • Specific to your system

  50. Evaluation: Perplexity Intuition • Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 • Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000 • Ask a speech recognizer to recognize “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) each – perplexity 54 • Perplexity is weighted equivalent branching factor.

More Related