230 likes | 353 Vues
Introduction to Probabilistic Models for Computational Biology. Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022. DNA.
E N D
Introduction to Probabilistic Models for Computational Biology Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022
DNA AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC Gene AUGAUUAU AUGCGCGUC AUGAUUGAU AUGAUUGAU AUGUUACGCACCUAC RNA AUGUGGAUUGUU RNA degradation MID Protein MID MID MWIV MLRTY MRV gene Genetic regulatory network Review: Gene Regulation a switch! (“transcription factor binding site”) Gene regulation transcription AUGCGCGUC translation MRV “Gene Expression” Genes regulate each others’ expression and activity.
T G C T A X X X X X U C X X X X T X Protein MID X MWIV MLRTY MRV X C L gene Review: Variations in the DNA “Single nucleotide polymorphism (SNP)” AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC AUGCGCGUC AUGAUUGAU AUGUUACGCACCUAC RNA AUGUGGAUUGUU Sequence variations perturb the regulatory network. Genetic regulatory network
Outline • Probabilistic models in biology • Model selection problems • Mathematical foundations • Bayesian networks • Probabilistic Graphical Models: Principles and Techniques, Koller & Friedman, The MIT Press • Learning from data • Maximum likelihood estimation • Expectation and maximization
Example 1 • How a change in a nucleotide in DNA, blood pressure and heart disease are related? • There can be several “models”… DNA alteration DNA alteration DNA alteration OR Blood pressure Heart disease Blood pressure Heart disease Blood pressure Heart disease
A A B C B C A B C Example 2 • How genes A, B and C regulate each other’s expression levels (mRNA levels) ? • There can be several models… OR ?
Model I Model II Model III … Exp 1 Exp 2 Exp N Gene A A A Gene B Gene C B C B C A B C OR ? N instances • Probabilistic graphical models • A graphical representation of statistical dependencies. • Statistical dependencies between expression levels of genes A, B, C? • Probability that model x is true given the data • Model selection: argmaxx P(model x is true | Data)
Outline • Probabilistic models in biology • Model selection problem • Mathematical foundations • Bayesian networks • Learning from data • Maximum likelihood estimation • Expectation and maximization
Probability Theory Review • Assume random variables Val(A)={a1,a2,a3}, Val(B)={b1,b2} • Conditional probability • Definition • Chain rule • Bayes’ rule • Probabilistic independence
Probabilistic Representation • Joint distribution P over {x1,…, xn} • xi is binary • 2n-1 entries • If x’s are independent • P(x) = p(x1) … p(xn)
Conditional Parameterization • The Diabetes example • Genetic risk (G), Diabetes (D) • Val (G) = {g1,g0}, Val (D) = {d1,d0} • P(G,D) = P(G) P(D|G) • P(G): Prior distribution • P(D|G): Conditional probabilistic distribution (CPD) Genetic risk Diabetes
Naïve Bayes Model - Example • Elaborating the diabetes example, • Genetic Risk (G), Diabetes (D), Hypertension (H) • Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} • 8 entries • If S and G are independent given I, • P(G,D,H) = P(G)P(D|G)P(H|G) • 5 entries; more compact than joint Genetic risk Diabetes Hypertension
Naïve Bayes Model • A class C where Val (C) = {c1,…,ck}. • Finding variables x1,…,xn • Naïve Bayes assumption • The findings are conditionally independent given the individual’s class. • The model factorizes as: • The Diabetes example • class: Genetic risk, findings: Diabetes, Hypertension
Naïve Bayes Model - Example • Medical diagnosis system • Class C: disease • Findings X: symptoms • Computing the confidence: • Drawbacks • Strong assumptions
Bayesian Network • Directed acyclic graph (DAG) • Node: a random variable • Edge: direct influence of one node on another • The Diabetes example revisited • Genetic risk (G), Diabetes (D), Hypertension (H) • Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} Genetic risk Diabetes Hypertension
Bayesian Network Semantics • A Bayesian network structure G is a directed acyclic graph whose nodes represent random variables X1,…,Xn. • PaXi: parents of Xi in G • NonDescendantsXi: variables in G that are not descendants of Xi. • G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G): For each variable Xi: x2 x1 x11 x3 x3 x10 x4 x7 x8 x5 x9 x6
The Genetics Example • Variables • B: blood type (a phenotype) • G: genotype of the gene that encodes a person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>
Bayesian Network Joint Distribution • Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as: • A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes.
The Student Example • More complex scenario • Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G) • Val(D) = {easy, hard}, Val(L) = {strong, weak}, Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3} • Joint distribution requires 47 entries
The Student Bayesian network • Joint distribution • P(I,D,G,S,L) = from Koller & Friedman
Parameter Estimation For example, {i0,d1,g1,l0,s0} • Assumptions • Fixed network structure • Fully observed instances of the network variables: D={d[1],…,d[M]} • Maximum likelihood estimation (MLE)! “Parameters” of the Bayesian network from Koller & Friedman
Outline • Probabilistic models in biology • Model selection problem • Mathematical foundations • Bayesian networks • Learning from data • Maximum likelihood estimation • Expectation and maximization
Acknowledgement • Profs Daphne Koller & Nir Friedman, “Probabilistic Graphical Models”