230 likes | 360 Vues
This course explores the application of probabilistic models in computational biology, focusing on mathematical foundations, Bayesian networks, and model selection problems. Through lectures, students learn about gene regulation, RNA degradation, and the impact of DNA variations on genetic regulatory networks. The course emphasizes learning from data using techniques such as maximum likelihood estimation and expectation-maximization. Key examples include the relationship between genetic risk and diseases, and how genes interact to influence expression levels.
E N D
Introduction to Probabilistic Models for Computational Biology Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022
DNA AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC Gene AUGAUUAU AUGCGCGUC AUGAUUGAU AUGAUUGAU AUGUUACGCACCUAC RNA AUGUGGAUUGUU RNA degradation MID Protein MID MID MWIV MLRTY MRV gene Genetic regulatory network Review: Gene Regulation a switch! (“transcription factor binding site”) Gene regulation transcription AUGCGCGUC translation MRV “Gene Expression” Genes regulate each others’ expression and activity.
T G C T A X X X X X U C X X X X T X Protein MID X MWIV MLRTY MRV X C L gene Review: Variations in the DNA “Single nucleotide polymorphism (SNP)” AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC AUGCGCGUC AUGAUUGAU AUGUUACGCACCUAC RNA AUGUGGAUUGUU Sequence variations perturb the regulatory network. Genetic regulatory network
Outline • Probabilistic models in biology • Model selection problems • Mathematical foundations • Bayesian networks • Probabilistic Graphical Models: Principles and Techniques, Koller & Friedman, The MIT Press • Learning from data • Maximum likelihood estimation • Expectation and maximization
Example 1 • How a change in a nucleotide in DNA, blood pressure and heart disease are related? • There can be several “models”… DNA alteration DNA alteration DNA alteration OR Blood pressure Heart disease Blood pressure Heart disease Blood pressure Heart disease
A A B C B C A B C Example 2 • How genes A, B and C regulate each other’s expression levels (mRNA levels) ? • There can be several models… OR ?
Model I Model II Model III … Exp 1 Exp 2 Exp N Gene A A A Gene B Gene C B C B C A B C OR ? N instances • Probabilistic graphical models • A graphical representation of statistical dependencies. • Statistical dependencies between expression levels of genes A, B, C? • Probability that model x is true given the data • Model selection: argmaxx P(model x is true | Data)
Outline • Probabilistic models in biology • Model selection problem • Mathematical foundations • Bayesian networks • Learning from data • Maximum likelihood estimation • Expectation and maximization
Probability Theory Review • Assume random variables Val(A)={a1,a2,a3}, Val(B)={b1,b2} • Conditional probability • Definition • Chain rule • Bayes’ rule • Probabilistic independence
Probabilistic Representation • Joint distribution P over {x1,…, xn} • xi is binary • 2n-1 entries • If x’s are independent • P(x) = p(x1) … p(xn)
Conditional Parameterization • The Diabetes example • Genetic risk (G), Diabetes (D) • Val (G) = {g1,g0}, Val (D) = {d1,d0} • P(G,D) = P(G) P(D|G) • P(G): Prior distribution • P(D|G): Conditional probabilistic distribution (CPD) Genetic risk Diabetes
Naïve Bayes Model - Example • Elaborating the diabetes example, • Genetic Risk (G), Diabetes (D), Hypertension (H) • Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} • 8 entries • If S and G are independent given I, • P(G,D,H) = P(G)P(D|G)P(H|G) • 5 entries; more compact than joint Genetic risk Diabetes Hypertension
Naïve Bayes Model • A class C where Val (C) = {c1,…,ck}. • Finding variables x1,…,xn • Naïve Bayes assumption • The findings are conditionally independent given the individual’s class. • The model factorizes as: • The Diabetes example • class: Genetic risk, findings: Diabetes, Hypertension
Naïve Bayes Model - Example • Medical diagnosis system • Class C: disease • Findings X: symptoms • Computing the confidence: • Drawbacks • Strong assumptions
Bayesian Network • Directed acyclic graph (DAG) • Node: a random variable • Edge: direct influence of one node on another • The Diabetes example revisited • Genetic risk (G), Diabetes (D), Hypertension (H) • Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} Genetic risk Diabetes Hypertension
Bayesian Network Semantics • A Bayesian network structure G is a directed acyclic graph whose nodes represent random variables X1,…,Xn. • PaXi: parents of Xi in G • NonDescendantsXi: variables in G that are not descendants of Xi. • G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G): For each variable Xi: x2 x1 x11 x3 x3 x10 x4 x7 x8 x5 x9 x6
The Genetics Example • Variables • B: blood type (a phenotype) • G: genotype of the gene that encodes a person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>
Bayesian Network Joint Distribution • Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as: • A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes.
The Student Example • More complex scenario • Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G) • Val(D) = {easy, hard}, Val(L) = {strong, weak}, Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3} • Joint distribution requires 47 entries
The Student Bayesian network • Joint distribution • P(I,D,G,S,L) = from Koller & Friedman
Parameter Estimation For example, {i0,d1,g1,l0,s0} • Assumptions • Fixed network structure • Fully observed instances of the network variables: D={d[1],…,d[M]} • Maximum likelihood estimation (MLE)! “Parameters” of the Bayesian network from Koller & Friedman
Outline • Probabilistic models in biology • Model selection problem • Mathematical foundations • Bayesian networks • Learning from data • Maximum likelihood estimation • Expectation and maximization
Acknowledgement • Profs Daphne Koller & Nir Friedman, “Probabilistic Graphical Models”