Essential Probability & Statistics

Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Purpose of Prob. & Statistics • Deductive vs. Plausible reasoning • Incomplete knowledge -> uncertainty • How do we quantify inference under uncertainty? • Probability: models of random process/experiments (how data are generated) • Statistics: draw conclusions on the whole population based on samples (inference on data)

Basic Concepts in Probability • Sample space: all possible outcomes, e.g., • Tossing 2 coins, S ={HH, HT, TH, TT} • Event: ES, E happens iff outcome in E, e.g., • E={HH} (all heads) • E={HH,TT} (same face) • Probability of Event : 1P(E) 0, s.t. • P(S)=1 (outcome always in S) • P(A B)=P(A)+P(B) if (AB)=

Interpretation of Bayes’ Rule Hypothesis space: H={H1 ,…,Hn} Evidence: E If we want to pick the most likely hypothesis H*, we can drop p(E) Posterior probability of Hi Prior probability of Hi Likelihood of data/evidence if Hi is true

Random Variable • X: S  (“measure” of outcome) • Events can be defined according to X • E(X=a) = {si|X(si)=a} • E(Xa) = {si|X(si)  a} • So, probabilities can be defined on X • P(X=a) = P(E(X=a)) • P(aX) = P(E(aX)) (f(a)=P(a>x): cumulative dist. func) • Discrete vs. continuous random variable (think of “partitioning the sample space”)

An Example • Think of a DNA sequence as results of tossing a 4-face die many times independently • P(AATGC)=p(A)p(A)p(T)p(G)p(C) • A model specifies {p(A),p(C), p(G),p(T)}, e.g., all 0.25 (random model M0) • P(AATGC|M0) = 0.25*0.25*0.25*0.25*0.25 • Comparing 2 models • M1: coding regions • M2: non-coding regions • Decide if AATGC is more likely a coding region

Probability Distributions • Binomial: Times of successes out of N trials • Gaussian: Sum of N independent R.V.’s • Multinomial: Getting ni occurrences of outcome i

Parameter Estimation • General setting: • Given a (hypothesized & probabilistic) model that governs the random experiment • The model gives a probability of any data p(D|) that depends on the parameter  • Now, given actual sample data X={x1,…,xn}, what can we say about the value of ? • Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” • Generally an optimization problem

Maximum Likelihood Estimator Data: a sequence d with counts c(w1), …, c(wN), and length |d| Model: multinomial M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) We’ll tune p(wi) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

Maximum Likelihood vs. Bayesian • Maximum likelihood estimation • “Best” means “data likelihood reaches maximum” • Problem: small sample • Bayesian estimation • “Best” means being consistent with our “prior” knowledge and explaining data well • Problem: how to define prior?

Bayesian Estimator • ML estimator: M=argmax M p(d|M) • Bayesian estimator: • First consider posterior: p(M|d) =p(d|M)p(M)/p(d) • Then, consider the mean or mode of the posterior dist. • p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…,N) : our prior on the model parameters • conjugate = prior can be interpreted as “extra”/“pseudo” data • Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” counts e.g., i= p(wi|REF)

Dirichlet Prior Smoothing (cont.) Posterior distribution of parameters: The predictive distribution is the same as the mean: Bayesian estimate (|d| ?)

Likelihood: p(D|) D=(c1,…,cN) Prior: p() : prior mode : posterior mode ml: ML estimate Illustration of Bayesian Estimation Posterior: p(|D) p(D|)p() 

Basic Concepts in Information Theory • Entropy: Measuring uncertainty of a random variable • Kullback-Leibler divergence: comparing two distributions • Mutual Information: measuring the correlation of two random variables

Entropy Entropy H(X) measures the average uncertainty of random variable X Example: Properties: H(X)>=0; Min=0; Max=log M; M is the total number of values

Interpretations of H(X) • Measures the “amount of information” in X • Think of each value of X as a “message” • Think of X as a random experiment (20 questions) • Minimum average number of bits to compress values of X • The more random X is, the harder to compress A fair coin has the maximum information, and is hardest to compress A biased coin has some information, and can be compressed to <1 bit on average A completely biased coin has no information, and needs only 0 bit

Cross Entropy H(p,q) What if we encode X with a code optimized for a wrong distribution q? Expected # of bits=? Intuitively, H(p,q)  H(p), and mathematically,

Kullback-Leibler Divergence D(p||q) What if we encode X with a code optimized for a wrong distribution q? How many bits would we waste? Properties: - D(p||q)0 - D(p||q)D(q||p) - D(p||q)=0 iff p=q Relative entropy KL-divergence is often used to measure the distance between two distributions • Interpretation: • Fix p, D(p||q) and H(p,q) vary in the same way • If p is an empirical distribution, minimize D(p||q) or H(p,q) is equivalent to maximizing likelihood

Cross Entropy, KL-Div, and Likelihood Likelihood: log Likelihood: Criterion for estimating a good model

Mutual Information I(X;Y) Comparing two distributions: p(x,y) vs p(x)p(y) Conditional Entropy: H(Y|X) Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent Interpretations: - Measures how much reduction in uncertainty of X given info. about Y - Measures correlation between X and Y

What You Should Know • Probability concepts: • sample space, event, random variable, conditional prob. multinomial distribution, etc • Bayes formula and its interpretation • Statistics: Know how to compute maximum likelihood estimate • Information theory concepts: • entropy, cross entropy, relative entropy, conditional entropy, KL-div., mutual information, and their relationship

Essential Probability & Statistics