1 / 17

Lecture 13

Lecture 13. Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics. A Paradigm Shift begins in the 1980s Seeds planted in the 1950s (Harris, Firth) Cut off by Chomsky Renewal due to Interest in practical applications (ASR, MT, …)

dougg
Télécharger la présentation

Lecture 13

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 13 Corpus Linguistics I CS 4705

  2. From Knowledge-Based to Corpus-Based Linguistics • A Paradigm Shift begins in the 1980s • Seeds planted in the 1950s (Harris, Firth) • Cut off by Chomsky • Renewal due to • Interest in practical applications (ASR, MT, …) • Availability at major industrial labs of powerful machines and large amounts of storage • Increasing availability of large online texts and speech data • Crossover efforts with ASR community, fostered by DARPA

  3. For many practical tasks, statistical methods perform better • Less knowledge required by researchers

  4. Next Word Prediction • An ostensibly artificial task: predicting the next word in a sequence. • From a NY Times story... • Stocks plunged this …. • Stocks plunged this morning, despite a cut in interest rates • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ... • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

  5. Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last … • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday's terrorist attacks.

  6. Human Word Prediction • Clearly, at least some of us have the ability to predict future words in an utterance. • How? • Domain knowledge • Syntactic knowledge • Lexical knowledge

  7. Claim • A useful part of the knowledge needed to allow Word Prediction (guessing the next word) can be captured using simple statistical techniques. • In particular, we'll rely on the notion of the probability of a sequence (e.g., sentence) and the likelihood of words co-occurring

  8. Why would we want to do this? • Why would anyone want to predict a word? • If you say you can predict the next word, it means you can rank the likelihood of sequences containing various alternative words, or, alternative hypotheses • You can assess the likelihood/goodness of an hypothesis

  9. Many NLP problems can be modeled as mapping from one string of symbols to another. • In statistical language applications, knowledge of the source (e.g, a statistical model of word sequences) is referred to as a Language Model or a Grammar

  10. Why is this useful? • Example applications that employ language models: • Speech recognition • Handwriting recognition • Spelling correction • Machine translation systems • Optical character recognizers

  11. Real Word Spelling Errors • They are leaving in about fifteen minuets to go to her house. • The study was conducted mainly be John Black. • The design an construction of the system will take more than a year. • Hopefully, all with continue smoothly in my absence. • Can they lave him my messages? • I need to notified the bank of…. • He is trying to fine out.

  12. Handwriting Recognition • Assume a note is given to a bank teller, which the teller reads as I have a gub. (cf. Woody Allen) • NLP to the rescue …. • gub is not a word • gun, gum, Gus, and gull are words, but gun has a higher probability in the context of a bank

  13. For Spell Checkers • Collect a list of commonly substituted words • piece/peace, whether/weather, their/there ... • Whenever you encounter one of these words in a sentence, construct the alternative sentence as well • Assess the goodness of each and choose the one (word) with the more likely sentence • E.g. • On Tuesday, the whether • On Tuesday, the weather

  14. The Noisy Channel Model • A probabilistic model developed by Claude Shannon to model communication (as over a phone line)  Noisy Channel  O = argmaxPr(I|O) = argmaxPr(I) Pr(O|I) I I • the most likely input • Pr(I) the prior probability • Pr(I|O) the most likely I given O • Pr(O|I) the probability that O is the output if I is the input

  15. Review: Basic Probability • Prior Probability (or unconditional probability) • P(A), where A is some event • Possible events: it raining, the next person you see being Scandinavian, a child getting the measles, the word ‘warlord’ occurring in the newspaper • Conditional Probability • P(A | B) • the probability of A, given that we know B • E.g. it raining, given that we know it’s October; the next person you see being Scandinavian, given that you’re in Sweden, the word ‘warlord’ occurring in a story about Afghanistan

  16. Example F F F FF FI I I I • P(Finn) = .6 • P(skier) = .5 • P(skier|Finn) = .67 • P(Finn|skier) = .8

  17. Next class • Midterm • Next class: • Hindle & Rooth 1993 • Begin studying semantics, Ch. 14

More Related