1 / 37

370 likes | 651 Vues

Maximum Likelihood Estimation & Expectation Maximization. Lectures 3 – Oct 5, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022. Outline. Probabilistic models in biology Model selection problem

Télécharger la présentation
## Maximum Likelihood Estimation & Expectation Maximization

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Maximum Likelihood Estimation & Expectation Maximization**Lectures 3 – Oct 5, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022**Outline**• Probabilistic models in biology • Model selection problem • Mathematical foundations • Bayesian networks • Learning from data • Maximum likelihood estimation • Maximum a posteriori (MAP) • Expectation and maximization**Parameter Estimation**For example, {i0,d1,g1,l0,s0} • Assumptions • Fixed network structure • Fully observed instances of the network variables: D={d[1],…,d[M]} • Maximum likelihood estimation (MLE)! “Parameters” of the Bayesian network from Koller & Friedman**The Thumbtack example**• Parameter learning for a single variable. • Variable • X: an outcome of a thumbtack toss • Val(X) = {head, tail} • Data • A set of thumbtack tosses: x[1],…x[M] X**Maximum likelihood estimation**• Say that P(x=head) = Θ, P(x=tail) = 1-Θ • P(HHTTHHH…<Mh heads, Mt tails>; Θ) = • Definition: The likelihood function • L(Θ : D) = P(D; Θ) • Maximum likelihood estimation (MLE) • Given data D=HHTTHHH…<Mh heads, Mt tails>, find Θ that maximizes the likelihood function L(Θ : D).**MLE for the Thumbtack problem**• Given data D=HHTTHHH…<Mh heads, Mt tails>, • MLE solution Θ* = Mh / (Mh+Mt ). • Proof:**Continuous space**Assuming sample x1, x2,…, xn is from a parametric distribution f (x|Θ) , estimate Θ. Say that the n samples are from a normal distribution with mean μ and variance σ2.**Continuous space (cont.)**Let Θ1=μ, Θ2= σ2**Any Drawback?**• Is it biased? • Is it? Yes. As an extreme, when n = 1, =0. • The MLE systematically underestimates θ2 . Why? A bit harder to see, but think about n = 2. Then θ1 is exactly between the two sample points, the position that exactly minimizes the expression for . Any other choices for θ1, θ2 make the likelihood of the observed data slightly lower. But it’s actually pretty unlikely that two sample points would be chosen exactly equidistant from, and on opposite sides of the mean, so the MLE systematically underestimates θ2 .**Maximum a posteriori**Incorporating priors. How? MLE vs MAP estimation**MLE for general problems**• Learning problem setting • A set of random variables X from unknown distribution P* • Training data D = M instances of X: {d[1],…,d[M]} • A parametric model P(X; Θ) (a ‘legal’ distribution) • Define the likelihood function: • L(Θ : D) = • Maximum likelihood estimation • Choose parameters Θ* that satisfy:**x1**x2 x3 x4 MLE for Bayesian networks PG = P(x1,x2,x3,x4) • Likelihood decomposition: • The local likelihood function for Xi is: Structure G = P(x1) P(x2) P(x3|x1,x2) P(x4|x1,x3) More generally? Parameters θ Θx1, Θx2 ,Θx3|x1,x2 , Θx4|x1,x3 (more generally Θxi|pai) Given D: x[1],…x[m]…,x[M], estimate θ. (x1[m],x2[m],x3[m],x4[m])**Bayesian network with table CPDs**The Student example The Thumbtack example Intelligence Difficulty X vs Grade P(X) P(I,D,G) = Joint distribution Parameters θ θI, θD, θG|I,D D: {H…x[m]…T} D: {(i1,d1,g1)…(i[m],d[m],g[m])…} Data Likelihood function θMh(1-θ)Mt L(θ:D) = P(D;θ) MLE solution**Maximum Likelihood Estimation Review**• Find parameter estimates which make observed data most likely • General approach, as long as tractable likelihood function exists • Can use all available information**Example – Gene Expression**• Instruction for making the proteins • Instruction for when and where to make them “Coding” Regions “Regulatory” Regions (Regulons) • Regulatory regions contain “binding sites” (6-20 bp). • “Binding sites” attract a special class of proteins, known as “transcription factors”. • Bound transcription factors can initiate transcription (making RNA). • Proteins that inhibit transcription can also be bound to their binding sites. • What turns genes on (producing a protein) and off? • When is a gene turned on or off? • Where (in which cells) is a gene turned on? • How many copies of the gene product are produced?**Regulation of Genes**Transcription Factor (Protein) RNA polymerase (Protein) DNA AC..T CG..A Gene Regulatory Element (binding sites) source: M. Tompa, U. of Washington**Regulation of Genes**Transcription Factor (Protein) RNA polymerase (Protein) DNA AC..T CG..A Regulatory Element Gene source: M. Tompa, U. of Washington**Regulation of Genes**Transcription Factor (Protein) RNA polymerase DNA CG..A AC..T Regulatory Element Gene source: M. Tompa, U. of Washington**Regulation of Genes**RNA polymerase Transcription Factor DNA AC..T CG..A Regulatory Element source: M. Tompa, U. of Washington New protein**The Gene regulation example**• What determines the expression level of a gene? • What are observed and hidden variables? • e.G, e.TF’s: observed; Process.G: hidden variables want to infer! Expression level of TF1 ... e.TF1 e.TF2 e.TFN e.TF3 e.TF4 Biological process the gene is involved in Process.G = p1 = p2 = p3 Expression level of a gene e.G**The Gene regulation example**• What determines the expression level of a gene? • What are observed and hidden variables? • e.G, e.TF’s: observed; Process.G: hidden variables want to infer! • How about BS.G’s? How deterministic is the sequence of a binding site? How much do we know? ... e.TF1 e.TF2 e.TFN e.TF3 e.TF4 Process.G ... BS1.G BSN.G = Yes = Yes Expression level of a gene Whether the gene has TF1’s binding site e.G**Not all data are perfect**• Most MLE problems are simple to solve with complete data. • Available data are “incomplete” in some way.**Outline**• Learning from data • Maximum likelihood estimation (MLE) • Maximum a posteriori (MAP) • Expectation-maximization (EM) algorithm**Continuous space revisited**• Assuming sample x1, x2,…, xn is from a mixture of parametric distributions, x1 x2 … xm xm+1 … xn x**A real example**• CpG content of human gene promoters GC frequency “A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006;103:1412-1417**Mixture of Gaussians**Parameters θ means variances mixing parameters P.D.F**Apply MLE?**• No closed form solution known for finding θ maximizing L. • However, what if we knew the hidden data?**EM as Chicken vs Egg**• IF zij known, could estimate parameters θ • e.g., only points in cluster 2 influence μ2, σ2. • IF parameters θ known, could estimate zij • e.g., if |xi - μ1|/σ1 << |xi – μ2|/σ2, then zi1 >> zi2 • BUT we know neither; (optimistically) iterate: • E-step: calculate expected zij, given parameters • M-step: do “MLE” for parameters (μ,σ), given E(zij) • Overall, a clever “hill-climbing” strategy Convergence provable? YES**“Classification EM”**• If zij < 0.5, pretend it’s 0; zij > 0.5, pretend it’s 1 i.e., classifypoints as component 0 or 1 • Now recalculate θ, assuming that partition • Then recalculate zij , assuming that θ • Then recalculate θ, assuming new zij , etc., etc.**Full EM**• xi’s are known; Θ unknown. Goal is to find the MLE Θ of: L (Θ : x1,…,xn ) (hidden data likelihood) • Would be easy if zij’s were known, i.e., consider L (Θ : x1,…,xn, z11,z12,…,zn2 ) (complete data likelihood) • But zij’s are not known. • Instead, maximize expected likelihood of observed data E[ L(Θ : x1,…,xn, z11,z12,…,zn2 ) ] where expectation is over distribution of hidden data (zij’s).**The E-step**• Find E(zij), i.e., P(zij=1) • Assume θ known & fixed. Let • A: the event that xi was drawn from f1 • B: the event that xi was drawn from f2 • D: the observed data xi • Then, expected value of zi1 is P(A|D) P(A|D) =**Complete data likelihood**• Recall: • so, correspondingly, • Formulas with “if’s” are messy; can we blend more smoothly?**M-step**• Find θ maximizing E[ log(Likelihood) ]**EM summary**• Fundamentally an MLE problem • Useful if analysis is more tractable when 0/1 • Hidden data z known • Iterate: E-step: estimate E(z) for each z, given θ M-step: estimate θ maximizing E(log likelihood) given E(z) where “E(logL)” is wrt random z ~ E(z) = p(z=1)**EM Issues**• EM is guaranteed to increase likelihood with every E-M iteration, hence will converge. • But may converge to local, not global, max. • Issue is intrinsic (probably), since EM is often applied to NP-hard problems (including clustering, above, and motif-discovery, soon) • Nevertheless, widely used, often effective**Acknowledgement**• Profs Daphne Koller & Nir Friedman, “Probabilistic Graphical Models” • Prof Larry Ruzo, CSE 527, Autumn 2009 • Prof Andrew Ng, ML lecture note

More Related