A fast and simple neural probabilistic language model for natural language processing

A fast and simple neural probabilistic language model for natural language processing Presenter: Yifei Guo Supervisor: David Barber

Statistical language model • Goal: model the joint distribution of words in a sentence • Task: predict the next word Wn from preceding n-1 words called context • Example: n-gram algorithm • Pro: Simplicity – conditional probability tables for P(Wn|context) estimated by smoothing word n-tuple counts • Con : curse of high dimensions – e.g. 10 words in a sentence with vocabulary size of 10,000 , will leads to 10,000 to the power of 10 free parameters

Neural probabilistic language model • Uses distributed representations of words to address the curse of high dimensions Neural architecture Example Generalize from The cat is walking in the kitchen to The dog was running in the room and likewise to The cat was walking in the kitchen The cat was running in the kitchen The cat is running in the kitchen The cat is running in the room The dog was walking in the room The dog was walking in the kitchen The dog is running in the room The dog was running in the kitchen ……………

NPLM • Quantifies the compatibility between the next word and its context estimated by score function • Words are mapped into real-valued feature vector learnt from data • Distribution over next word defined by the score function

Maximum Likelihood learning • Ml training of NLM is tractable but expensive • Computing the gradient of log-likelihood takes time linear in the vocabulary size -------------?????-------------- • Importance sampling approximation (Bengio 2003) • Sample words from a proposal distribution and reweight the gradient • Stability issues: need either many samples or an adaptive proposal distributions

A fast and simple solution: noise-contrastive estimation • IDEA: Fit a density model by discriminating samples from data distribution and samples from a known noise distribution • Fit the model to data: maximize the expected log-posterior of the data noise labeled D

The strength of NCE • Allow working with unnormalized distribution • The gradient of object are more easier than the importance sampling gradient, since that the weights are always between 0 and 1. • NCE gradient converges to ML gradient when k tends to the infinity

Application: MSR sentence completion challenge • Task: given a sentence with a missing word, find the most appropriate word among five candidates choices • Training dataset: five novels of Sherlock Holmes

Thank you 

A fast and simple neural probabilistic language model for natural language processing

A fast and simple neural probabilistic language model for natural language processing

Presentation Transcript

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing : Probabilistic Parsing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing : Probabilistic Context Free Grammars

Natural Language Processing : Probabilistic Context Free Grammars

Natural Language Processing

Probabilistic Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing