A Rough Translation by Davis Cho, Jianxiong Wang, John Yuan, and Nelly Lyu

A Hierarchical Bayesian Language Model basedon Pitman-YorProcesses A Rough Translation by Davis Cho, Jianxiong Wang, John Yuan, and NellyLyu

Overview The main goal of this paper is to introduce a language model via the Pitman-Yor Process. This is a Hierarchical Nonparametric bayesian model , which utilizes Bayesian inference to update latent parameters. The Pitman Yor Process is simply a nonparametric extension of the Dirichlet process. The beneﬁts of this language model is that is avoids the overﬁtting problems of N-gram models. However it is important to note that the Inference process for such process is quite complex and requires a Markov Chain Monte CarloSequence.

Overview/Motivation • Previously in class, we discussed about certain ways to smoothen the distribution of the data points using diﬀerent smoothing techniques suchas • Laplace: Assume all have been seenonce • Jelinek-Mercer aka Interpolation: Mixing higher order n-gram estimates with lower • Absolute Discount: Subtract discount fromfrequency • Kneser-Ney: Account for the number of previous contextwords • But Teh proposes a better way to dothings!

Agenda Bayesian Philosophy/ BayesianInference Bayesian Parametric vs Bayesian NonParametric Bayesian HierarchicalModeling DirichletProcess Pitman YorProcess Chinese RestaurantProcess Hierarchical Chinese Restaurant forNLP Inference (GibbsSampling) ExperimentalResults

PresentationTimings • Bayesian Overview- 20 minutes(Nelly) • BayesianExample • Prior and Posterior example (John Cho used thescale • ParametricExample • Non ParametricExample 2. Dirichlet 20 minutes (Davis?) Hierarchical DirichletProcess PitmanYorr HypotheticalBreak Chinese Restaurant Example 25 minutes (John) Why is it called ChineseRestaurant Rich get richer, preferentialattachment. Inference Gibbs Sampling (20 minutes)(Jianxiong)

What is BayesianPhilosophy? • A Bayesian approach to a problem starts with the formulation of a model that we hope is adequate to describe the situation ofinterest. • We then formulated a prior distribution over the unknown parameters of the model, which is meant to capture our beliefs about the situation before seeing thedata. • After observing some data, we apply Bayes' Rule to obtain a posterior distribution for these unknowns, which takes account of both the prior and thedata.

What does “Bayesian inference” evenmean? Bayesian inference = Guessing in the style ofBayes Bayesian inference is a method of statistical inferencein which Bayes' theoremis used to update the probability for a hypothesis as more evidenceor informationbecomesavailable.

Steps of Bayesian Inference Identify the observed data you are workingwith. Construct a probabilistic model to represent the data(likelihood). Specify prior distributions over the parameters of your probabilistic model(prior). Collect data and apply Bayes’ rule to re-allocate credibility across thepossible parameter values(posterior).

Bayesian Coin FlipsExample

Prior

Putting ittogether Posterior ∝Likelihood ×Prior

BayesianParametric • Parametric models assume some ﬁxed ﬁnite set of parameters Θ. Given the parameters, future predictions, x, are independent of the observed data,D: Therefore Θ capture everything there is to know about thedata.

Let’s see a parametricexample E.g. Given some measurement points about mosquitoes in Asia,how many species arethere? K =4? K =5? K =6 K=8?

What is Non ParametricMethods? • Non Parametric: modeling with an unbounded number ofparameters. • Why useful for NLP?

Bayesian nonparametricexample MusicGenre

N Gram ModelReview Let's take 3 Gram model asexample How do we ﬁnd the probabilities for each context Goal is to predict next words givencontext.

Hierarchical Bayes, Dirichlet, and Pitman-YorProcesses

Hierarchical BayesModel Bayesian hierarchical modelling is a statistical modelwritten in multiple levels (hierarchical form) that estimates the parametersof the posterior distributionusing the Bayesian method(aka Bayesian Inference. More on thislater!) Very very simple view for now. More on this later aswell! P(w2) P(w1|w2) P(w2 |w2) P(w |w1,w2) P(w | w2,w2)

DirichletProcess • Dirichlet Process - Essentially a “sample” distribution obtained from an original distribution such that the expected value of the sample distribution is the same as the expected value of the originaldistribution • The original distribution can be continuous or discrete, while the new distribution will be discrete • Let: be the original distribution we want to samplefrom • 𝞪 the “concentration” of the samplestaken • G the resultingdistribution • Often denoted as: G~DP(𝞪, )

DirichletProcess • Let’s see anexample: • Let be a Normal Distribution with mean μ =0, and variance σ^2 =1, denotedas • With 𝞪 =1, G ~DP(𝞪 , N(0, 1)) would look somethinglike: =N(0,1) • Notice that the resulting distributions only slightly resemble the normal distribution.Why?

Dirichlet Process 𝞪Value • 𝞪 is inversely proportional to how “concentrated” the datais • Higher 𝞪 is less concentrated while lower 𝞪is moreconcentrated • Why is this thecase? • Dirichlet Process draws sample from the original distribution according to the following criterion. In this context, H is the originaldistribution. • ○

How 𝞪 Aﬀects the NewDistribution • Equationa) • If alpha is low, we are less likely to draw a new value from H as the number of draws increases • If alpha is high, we are more likely to draw a new value fromH. • Equationb) • If alpha is low, we are more likely to pick a previously seenvalue • If alpha is high, we are less likely to pick a previously seenvalue • Thus, variance is higher when we have a higher alpha level because we pick more new valuesfrom • H.

𝞪Examples • Draws from the Dirichlet ProcessDP(N(0,1), • 𝞪) • The four rows use diﬀerent 𝞪 (top to bottom: 1, 10, 100 and1000) • Each row contains three repetitions of the sameexperiment. • As seen from the graphs, draws from a Dirichlet process are discrete distributions and they become less concentrated (more spread out) with increasing𝞪 • The probabilities of each value in the x-axis decrease drastically as 𝞪 increasestoo!

Problems? • As previously discussed in other papers before, we don’t want the probabilities of a word appearing to0. • This is why we usesmoothing! • However, if we use Dirichlet, many values are left unseen, especially when 𝞪 is low • This is why we need some control over the tail-end behavior of thedistribution • that we obtain from the originaldistribution. • We want to see more uniquevalues!

Pitman-YorProcess! • G ~PYP(𝞪,d, ) • d is a discountfactor • t is the number of diﬀerent values drawn sofar • First Eq. increases probability of doing a newdraw • Second Eq. decreases the probability of assigning a draw to a previousvalue • We now have bettercontrol! Draw new xn fromH Set xn=x

Improvement! X-axis: number of words drawn Y-axis: number of uniquewords Left: d =.5 and 𝞪 =1 (bottom), 10 (middle) and 100 (top) Right: same, with 𝞪 =10 and d =0 (bottom), .5 (middle) and .9(top) X-axis: number of wordsdrawn Y-axis: proportion of words appearing only once Left: d =.5 and 𝞪 =1 (bottom), 10 (middle), 100(top). Right: 𝞪 =10 and d =0 (bottom), .5 (middle) and .9 (top).

Chinese RestaurantProcess What is Chinese Restaurant Process Probabilities of Joining ATable Power LawBehavior Parameters, Pitman Yor vs Dirichlet Hierarchical Chinese Restaurant Process-NLP

Chinese Restaurant Process- Analogy toDirichlet ➔ Chinese restaurant with an infinite number of circulartables ➔ Each Table has InfiniteCapacity ➔ N customers enter the restaurant join an existing table or a newone. ➔ At Time t, there are M<=NTables ➔ Process result isExchangeable

Probability of Joining aTable Prob to Join a NewTable Where: θ,d are Parameters C is the number of customers Ck is # of customers at tablek T is the number of DistinctTables Prob to Join an ExistingTable

Probability of Joining atable To Join a NewTable (Param θ) +(# Distinct Tables)*(Param d) (Param θ) +(Total number ofCustomers ) Notice Something Interesting about theseformulas? To Join a Current Table (For eachtable) (Number of Customers at speciﬁc table) - (Param d) (Param θ)+(Total number of Customers)

PowerLaw Only 2 variables areadjustable 1. Number of Customers at Table k- Rich getRicher If you increase this, you increase the chance of future customersjoining 2. DistinctTables If you increase this, you increase the chance that future customer will create a newtable.

Example RestaurantDemo Let’s say 6 Friends go and grab lunch, however in this restaurant each table only serves a diﬀerent type ofCuisine. The table choices are Stir-fry, Steak, Salad, Soup, Sandwiches. (S1-S5) Let’s Go through two restaurantsexamples: For Simplicity let d=0.5 and θ=10

Parameter d, why Pitman Yor is Better thanDirichlet

Why do we not want the # of individual words to drop to15%?

Hierarchical Chinese Restaurant forNLP Now Let’s go back to our Chinese Restaurant and try to run our algorithm recursively in a NLPcontext. Remember our earlier two variableprobabilities: To Predict a New Word givencontext (Param θ) +(# Distinct Words Predicted)*(Paramd) To Predict an existingword (Number of times we predicted that word) +(Paramd) (Param θ) +(Total number ofWords predicted) (Param θ)+(Total number ofwords predicted )

What does Hierarchical StructureMean? Create a Hierarchy of WordDistributions. Use Chinese Restaurant Process to Continuously DrawWords. Each layer of the hierarchy based on a previous context from layer above. Each Layer Contains restaurants (Group of Nodes) and Tables (Nodes) Each Restaurant is aContext. Each Table is aWord.

SampleHierarchy G0 G1=”Soccer” G2=”Basketball” Love c=2 Practice ,c=10 Play, c=10 G3= “PlaySoccer” Boy G4=”PracticeSoccer” Girl Messi c=10 Girl c=10 Boy c=10 Messi c=10 c=10 c=10 “Girl PlaySoccer” “Boy PlaySoccer” “Messi PlaySoccer”

PseudoCode

Inference What is BayesianInference? Posterior ∝Likelihood x Prior PosteriorDistribution Posterior PredictiveDistribution (Which is a Chinese RestaurantProcess)

PosteriorDistribution The posterior distribution over the latent vectors C ={G_v : all contexts v} and parameters Θ ={θ m , d m : 0 ≤m ≤n−1}: p(C, Θ|D) =p(C, Θ,D)/p(D) ∝p(C | Θ D) p(Θ) p(Θ) is generated from Dirichlet process(Our prior). p(C | Θ D) is a multinomialdistribution. The product of these two terms is also a Dirichlet distribution (It can be generated through a Dirichlet process).

Predictive PosteriorDistribution what is the probability of a test word w after a contextu? However, the above integral is diﬃcult, so we use Monte Carlo approximation to approximate the aboveintegral.

Predictive PosteriorDistribution

GibbsSampling We use Gibbs sampling to obtain the posterior samples {S,Θ} Gibbs Sampling isbased on Monte Carlo Markov Chain(MCMC)

Results The paper uses perplexity for the evaluation, and it shows that this model is better than the state of art method, interpolated Kneser-Ney. It can compete with modiﬁed Kneser-Ney aswell.

Conclusion Pitman Yor Process is a Hierarchical Nonparametric bayesian model , which utilizes Bayesian inference to update latentparameters. The Pitman Yor Process is simply a nonparametric extension of the Dirichlet process. The beneﬁts of this language model is that is avoids the overﬁtting problems of N-gram models, and it provides a coherent explanation for themodel.

Reference Teh, Y. W.; Jordan, M. I.; Beal, M. J.; Blei, D. M. (2006). "Hierarchical Dirichlet Processes"(PDF). Journal of the American Statistical Association. 101(476): pp.1566–1581. CiteSeerX10.1.1.5.9094.doi:10.1198/016214506000000302 Wikipedia contributors. (2019, April 4). Bayesian hierarchical modeling. In Wikipedia, The Free Encyclopedia. Retrieved 15:39, April 22, 2019, fromhttps://en.wikipedia.org/w/index.php?title=Bayesian_hierarchical_modeling&oldid=890951273 Wikipedia contributors. (2019, February 7). Dirichlet process. In Wikipedia, The Free Encyclopedia. Retrieved 15:41, April 22, 2019, from https://en.wikipedia.org/w/index.php?title=Dirichlet_process&oldid=882242821 Wikipedia contributors. (2019, February 16). Hierarchical Dirichlet process. In Wikipedia, The Free Encyclopedia. Retrieved 15:43, April 22, 2019, fromhttps://en.wikipedia.org/w/index.php?title=Hierarchical_Dirichlet_process&oldid=883602247

A Rough Translation by Davis Cho, Jianxiong Wang, John Yuan, and Nelly Lyu

A Rough Translation by Davis Cho, Jianxiong Wang, John Yuan, and Nelly Lyu

Presentation Transcript

Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

Presented by Sam Supervised by Prof. Michael Lyu

Chaoshin Chiao Zi-Mei Wang Shiau-Yuan Tong

Speech by Jin Young Choi Presentation by Su Jin Km Editing and translation by Bin Cho

By Sunny Wang

By Tim Cho

Jewelry by Nelly 10X10

By: Stephanie Davis

Tolga Can and Yuan-Fang Wang

Team 14 Yuan Mingkai, Wang Ziqiu Wangxing ,Zengjun

By: Chanise Davis

Presented by Sam Supervised by Prof. Michael Lyu

East Meets West A Vegetarian History by John Davis IVU Manager and Historian

Adviser ： Zhi-Ming Lin postgraduate： Jhih-Yuan Lyu E-Mail ： m94662010@mail.ncue.tw

Rowing Core Exercises By John Davis

By Taylor Davis

NELLY

Ting-Yuan Wang Charlie Chung-Ping Chen Electrical and Computer Engineering

BY: NELLY A. COSTALES V.

Ye Yuan 1 , Guoren Wang 1 , Haixun Wang 2 , Lei Chen 3

By Xudong Wang