1 / 9

A fast and simple neural probabilistic language model for natural language processing

A fast and simple neural probabilistic language model for natural language processing. Presenter: Yifei Guo Supervisor: David Barber. Statistical language model. Goal: model the joint distribution of words in a sentence

emelda
Télécharger la présentation

A fast and simple neural probabilistic language model for natural language processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A fast and simple neural probabilistic language model for natural language processing Presenter: Yifei Guo Supervisor: David Barber

  2. Statistical language model • Goal: model the joint distribution of words in a sentence • Task: predict the next word Wn from preceding n-1 words called context • Example: n-gram algorithm • Pro: Simplicity – conditional probability tables for P(Wn|context) estimated by smoothing word n-tuple counts • Con : curse of high dimensions – e.g. 10 words in a sentence with vocabulary size of 10,000 , will leads to 10,000 to the power of 10 free parameters

  3. Neural probabilistic language model • Uses distributed representations of words to address the curse of high dimensions Neural architecture Example Generalize from The cat is walking in the kitchen to The dog was running in the room and likewise to The cat was walking in the kitchen The cat was running in the kitchen The cat is running in the kitchen The cat is running in the room The dog was walking in the room The dog was walking in the kitchen The dog is running in the room The dog was running in the kitchen ……………

  4. NPLM • Quantifies the compatibility between the next word and its context estimated by score function • Words are mapped into real-valued feature vector learnt from data • Distribution over next word defined by the score function

  5. Maximum Likelihood learning • Ml training of NLM is tractable but expensive • Computing the gradient of log-likelihood takes time linear in the vocabulary size -------------?????-------------- • Importance sampling approximation (Bengio 2003) • Sample words from a proposal distribution and reweight the gradient • Stability issues: need either many samples or an adaptive proposal distributions

  6. A fast and simple solution: noise-contrastive estimation • IDEA: Fit a density model by discriminating samples from data distribution and samples from a known noise distribution • Fit the model to data: maximize the expected log-posterior of the data noise labeled D

  7. The strength of NCE • Allow working with unnormalized distribution • The gradient of object are more easier than the importance sampling gradient, since that the weights are always between 0 and 1. • NCE gradient converges to ML gradient when k tends to the infinity

  8. Application: MSR sentence completion challenge • Task: given a sentence with a missing word, find the most appropriate word among five candidates choices • Training dataset: five novels of Sherlock Holmes

  9. Thank you 

More Related