Probabilistic Model of Sequences

Probabilistic Model of Sequences Ata Kaban The University of Birmingham

Sequence • Example1: a b a c a b a b a c • Example2: 1 0 0 1 1 0 1 0 0 1 • Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 • Roll a six-sided die N times. You get a sequence. • Roll it again: You get another sequence. • Here is a sequence of characters, can you see it? • What is a sequence? • Alphabet1 = {a,b,c}, Alphabet2={0,1}, Alphabet3={1,2,3,4,5,6}

Probabilistic Model • Model = system that simulates the sequence under consideration • Probabilistic model = model that produces different outcomes with different probabilities • It includes uncertainty • It can therefore simulate a whole class of sequences & assigns a probability to each individual sequence • Could you simulate any of the sequences on the previous slide?

Random sequence model • Back to the die example (can possibly be loaded) • Model of a roll: has 6 parameters: p1,p2,p3,p4,p5,p6 • Here, p_i is the probability of throwing i • To be probabilities, these must be non-negative and must sum to one. • What is the probability of the sequence [1, 6, 3]? p1*p6*p3 • NOTE: in the random sequence model, the individual symbols in a sequence do not depend on each other. This is the simplest sequence model.

Maximum Likelihood parameter estimation • The parameters of a probabilistic model are typically estimated from large sets of trusted examples, called training set. • Example (t=tail, h=head) : [t t t h t h h t] • Count up the frequencies: t5, h3 • Compute probabilities: • p(t)=5/(5+3), p(h)=3/(5+3) • These are the Maximum Likelihood (ML) estimates of the parameters of the coin. • Does it make sense? • What if you know the coin is fair?

Overfitting • A fair coin has probabilities p(t)=0.5, p(h)=0.5 • If you throw it 3 times and get [t, t, t], then the ML estimates for this sequence are p(t)=1, p(h)=0. • Consequently, from these estimates, the probability of e.g. the sequence [h, t, h, t] = …………. • This is an example of what is called overfitting. Overfitting is the greatest enemy of Machine Learning! • Solution1: get more data • Solution2: build in what you already know into the model. (Will return to it during the module)

Why is it called Maximum Likelihood? • It can be shown that using the frequencies to compute probabilities maximises the total probability of all the sequences given the model (the likelihood). P(Data|parameters)

Probabilities • Have two dice D1 and D2 • The probability of rolling I given die D1 is called P(i|D1). This is a conditional probability • Pick a die at random with probability P(Dj), j=1 or 2. The probability for picking die Dj and rolling i is is called joint probability and is P(I,Dj)=P(Dj)P(I|Dj). • For any events X and Y, P(X,Y)=P(X|Y)P(Y) • If we know P(X,Y), then the so-called marginal probability p(X) can be computed as

Now, we show that maximising P(Data|parameters) for the random sequence model leads to the frequency-based computation that we did intuitively.

Why did we bother? Because in more complicated models we cannot ‘guess’ the result.

Markov Chains • Further examples of sequences: • Bio-sequences • Web page request sequences while browsing • These are not anymore random sequences, but have a time-structure. • How many parameters would such a model have? • We need to make simplifying assumptions to end up with a reasonable number of parameters • The first order Markov assumption: the observation only depends on the immediately previous one, no longer history • Markov Chain = sequence model which makes the Markov assumption

Markov Chains • The probability of a Markov sequence: • The alphabet’s symbols are also called states • Once the parameters are estimated from training data, the Markov chain can be used for prediction • Amongst others, Markov Chains are successful for web browsing behavior prediction

Markov Chains • A Markov Chain is stationary if at any time, it has the same transition probabilities. • We assume stationary models here. • Then the parameters of the model consist of the transition probability matrix & initial state probabilities.

ML parameter estimation • We can derive how to compute the parameters of a Markov Chain from data, using Maximum Likelihood, as we did for random sequences. • The ML estimate of the transition matrix will be again very intuitive: Remember that

Simple example • If it is raining today, it will rain tomorrow with probability 0.8 implies the contrary has probability 0.2 • If it is not raining today, it will rain tomorrow with probability 0.6 implies the contrary has probability 0.4 • Build the transition matrix • Be careful which numbers need to sum to one and which don’t. Such a matrix is called stochastic matrix. • Q: It rained all week, including today. What does this model predict for tomorrow? Why? What does it predict for a day from tomorrow? (*Homework)

Examples of Web Applications • HTTP request prediction: • To predict the probabilities of the next requests from the same user based on the history of requests from that client. • Adaptive Web navigation: • To build a navigation agent which suggests which other links would be of interest to the user based on the statistics of previous visits. • The predicted link does not strictly have to be a link present in the Web page currently being viewed. • Tour generation: • Is given as input the starting URL and generates a sequence of states (or URLs) using the Markov chain process.

Building Markov Models from Web Log Files • A Web log file is a collection of records of user requests for documents on a Web site, an example: • Transition matrix can be seen as a graph • Link pair: (r - referrer, u - requested page, w - hyperlink weight) • Link graph: it is called the state diagram of the MarkovChain • a directed weighted graph • a hierarchy from the homepage down to multiple levels 177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html HTTP/1.1" 200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

Link Graph: an example (University of Ulster site) Zhu et al. 2002 State diagram: - Nodes: states - Weighted arrows: number of transitions

Experimental Results(Sarukkai, 2000) • Simulations : • ‘Correct link’ refers to the actual link chosen at the next step. • ‘depth of the correct link’ is measured by counting the umber of links which have a probability greater than or equal to the correct link. • Over 70% of correct links are in the top 20 scoring states. • Difficulties: very large state space

Simple exercise • Build the Markov transition matrix of the following sequence: [a b b a c a b c b b d e e d e d e d] State space: {…………….}

Further topics • Hidden Markov Model • Does not make the Markov assumption on the observed sequence • Instead, it assumes that the observed sequence was generated by another sequence which is unobservable (hidden), and this other sequence is assumed to be Markovian • More powerful • Estimation is more complicated • Aggregate Markov model • Useful for clustering sub-graphs of a transition graph

HMM at an intuitive level • Suppose that we know all the parameters of the following HMM, as shown on the state-diagram below. What is the probability of observing the sequence [A,B] if the initial state is S1? The same question if the initial state is chosen randomly with equal probabilities. ANSWER: If the initial state is S1: 0.2*(0.4*0.8+0.6*0.7) = 0.148. In the second case: 0.5*0.148+0.5*0.3*(0.3*0.7+0.7*0.8) = 0.1895.

Conclusions • Probabilistic Model • Maximum Likelihood parameter estimation • Random sequence model • Markov chain model --------------------------------- • Hidden Markov Model • Aggregate Markov Model

Any questions?

Probabilistic Model of Sequences