Conditional Graphical Models for Protein Structure Prediction
Conditional Graphical Models for Protein Structure Prediction. Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 24, 2006. Nobelprize.org. DSCTFTTAAAAKAGKAKAG. Protein sequence. +. Protein function. Protein structure.
Conditional Graphical Models for Protein Structure Prediction
E N D
Presentation Transcript
Conditional Graphical Models forProtein Structure Prediction Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 24, 2006
Nobelprize.org DSCTFTTAAAAKAGKAKAG Protein sequence + Protein function Protein structure Snapshot of Cell Biology
Protein Structures and Functions Example: triple beta-spiral fold Adenovirus Fibre Shaft Virus Capsid Courtesy of Nobelprize.org
Protein Structure Determination • Lab experiments: time and labor- consuming • X-ray crystallography Nobel Prize, Kendrew & Perutz, 1962 • NMR spectroscopy Nobel Prize, Kurt Wuthrich, 2002 • The gap between sequence and structure necessitates computational methods of protein structure determination • 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) 1MBN 1BUS
Protein Structure Hierarchy We focus on predicting the topology of the structures from sequences APAFSVSPASGACGPECA
Major Challenges • Protein structures are non-linear • Long-range dependencies • Structural similarity often does not indicate sequence similarity • Sequence alignment reaches twilight zone (under 25% similarity) β-α-β motif Ubiquitin (blue) Ubx-Faf1 (gold)
Previous Work • Sequence similarity perspective • Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] • Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] • Window-based methods, e.g. PSI_pred [Jones, 2001] • Physical forces perspective • Homology modeling or threading, e.g. Threader [Jones, 1998] • Structural biology perspective • Methods of careful design for specific structures, e.g.αα- and ββ- hairpins, β-turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Fail to capture the structure properties Generative models based on physical free-energy Hard to generalize due to the various informative features
Structured Prediction • Many prediction tasks involve outputs with correlations or constraints Structure Sequence • Tree Grid Input John ate the cat . SEQUENCEXS…WGIKQLQAR Output HHHCCCEEE…EECCCCEEE • Fundamental importance in many areas • Potential for significant theoretical and practical advances
Graphical Models • A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999] • Node: random variables • Edges: dependency relations • Directed graphical model (Bayesian networks) • Undirected graphical model (Markov random fields)
Conditional Random Fields • Hidden Markov model (HMM)[Rabiner, 1989] • Conditional random fields (CRFs)[Lafferty et al, 2001] • Model conditional probability directly • Allow arbitrary dependencies in observation • Adaptive to different loss functions and regularizers • Promising results in multiple applications
Protein Structure Prediction • Dependency between residues (single observation) • Dependency between components (subsequences of observations)
Outline • Brief introduction to protein structures • Graphical models for structured-prediction • Conditional graphical models for protein structure prediction • General framework • Specific models • Experiment results • Conclusion and discussion
Our Solution: Conditional Graphical Models Local dependency Long-range dependency • Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si} • Feature definition • Node feature • Local interaction feature • Long-range interaction feature
Conditional Graphical Models (II) • Conditional probability given observed sequences x is defined as • Prediction: • Training phase : learn the model parameters λ • Minimizing regularized negative log loss • Iterative search algorithms by seeking the direction whose empirical values agree with the expectation
Major Components • Graph topology • Secondary structure prediction: CRF, kernel CRF • Tertiary fold recognition: Segmentation CRF, Chain graph model • Quaternary fold recognition: Linked segmentation CRF • Efficient inference • Prefer exact inference with O(nd) complexity • Resort to approximate inference • Features • Allows flexible and rich feature definition
Protein Secondary Structure Prediction • Given a protein sequence, predict its secondary structure assignments • Three classes: helix (H), sheets (E) and coil (C) • Input: APAFSVSPASGACGPECA • Output: CCEEEEECCCCCHHHCCC
CRF on Secondary Structure Prediction [Liu et al, Bioinformatics 2004] C C E E …. ... C • Node semantics –secondary structure assignment • Graphical model - conditional random fields (CRFs) or kernel CRF • Inference algorithm - efficient inferences exists, such as forward-backward or Viterbi algorithm
Training Phase Testing Phase • Input: ..APAFSVSPASGACGPECA.. • Output 1: Does the target fold exist? • Output 2: ..NNEEEEECCCCCHHHCCC.. Yes Protein Fold Recognition and Alignment • Protein fold: identifiable regular arrangement of secondary structural elements • Different from previous simple fold classification • Provide important information and novel biological insights
Conditional Graphical Model for Fixed Template Fold[Liu et al, RECOMB 2005] • Node semantics - secondary structure elements of variable lengths • Graphical model - segmentation conditional random fields (SCRFs) • Inference - forward-backward and Viterbi-like algorithm can be derived given some assumptions β-α-β motif
Conditional Graphical Model for Repetitive Fold Recognition [Liu et al, ICML 2005] • Node semantics - two layer segmentation Y = {M, {Ξi}, T} • Level 1: envelop, or one repeat, level 2: components of one repeat • Graphical model - Chain graph model • A graph consisting of directed and undirected graphs • Inference - forward-backward algorithm and Viterbi-like algorithm
Conditional Graphical Model for for Quaternary Fold Recognition[Liu et al, IJCAI 2007] • Node semantics – secondary structure elements and/or simple fold • Graphical model - linked segmentation CRF (L-SCRF) • Fix template and/or repetitive subunits • Inter-chain and intra-chain interactions
Approximate Inference • Varying dimensionality requires reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] • Four types of Metropolis proposals • State switching • Position switching • Segment split • Segment merge • Simulated annealing reversible jump MCMC [Andireu et al, 2000] • Replace the sample with RJ MCMC • Theoretically converge on the global optimum
Conditional Graphical Models for Protein Structure Prediction
Kernelization Segment Correlations Local and Global Tradeoff Inter-chain Segment Correlations Model Roadmap Generalized as conditional graphical models Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Linked segmentation CRFs
Outline • Brief introduction to protein structures • Graphical models for structured prediction • Conditional graphical models for protein structure prediction • Experiment results • Fold recognition • Fold alignment prediction • Discovery of potential membership proteins • Conclusion and discussion
Experiments: Target Fold • Right-handedβ-helix fold [Yoder et al, 1993] • Bacterial infection of plants, binding the O-antigen and so on • Leucine-rich repeats (LLR) [Kobe & Deisenhofer, 1994] • Structural framework for protein-protein interaction
Experiments: Target Quaternary Fold • Triple beta-spirals [van Raaij et al. Nature 1999] • Virus fibers in adenovirus, reovirus and PRD1 • Double barrel trimer [Benson et al, 2004] • Coat protein of adenovirus, PRD1, STIV, PBCV
Tertiary Fold Recognition: β-Helix fold • Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times
Quaternary Fold Recognition: Triple β-Spirals • Histogram and ranks for known triple β-spirals against PDB-minus dataset
Quaternary Fold Recognition: Double Barrel-Trimer • Histogram and ranks for known double barrel-trimer against PDB-minus dataset
Fold Alignment Prediction:β-Helix • Predicted alignment for known β-helices on cross-family validation
Fold Alignment Prediction:LLR and Triple β-Spirals • Predicted alignment for known LLRs using chain graph model (left) and triple β-spirals using L-SCRFs
Discovery of Potential β-helices • Hypothesize potential β-helices from Uniprot reference databases • Full list can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html • Verification on proteins with later resolved structures from different organisms • 1YP2: potato tuber ADP-glucose pyrophosphorylase • 1PXZ: major allergen from Cedar Pollen • GP14 of Shigella bacteriophage as a β-helix protein
Conclusion • Thesis Statement • Conditional graphical models are effective for protein structure prediction • Strong claims • Effective representation for protein structural properties • Flexibility to incorporate different kinds of informative features • Efficient inference algorithms for large-scale applications • Weak claims • Ability to handle long-range interactions • Best performance bounded by prior knowledge
Contribution and Limitation • Contribution to machine learning • Enrichment of graphical models • Formulation to incorporate domain knowledge • Contribution to computational biology • Effective for protein structure prediction and fold recognition • Solutions for the long-range interactions (inter-chain and intra-chain) • Limitation • Manual feature extraction • Difficulty in verification • High complexity
Protein structure prediction Protein function and protein-protein interaction prediction Drug target design Graph-based semi-supervised learning Active learning for structured data Graph topology learning + Future Work • Computational biology • Machine Learning
Acknowledgement • Jaime Carbonell, Eric Xing, John Lafferty, Vanathi Gopalakrishnan • Chris Langmead, Yiming Yang, Roni Rosenfeld, Peter Weigele , Jonathan King, Judith Klein-Seetharaman, , Ivet Bahar, James Conway and many more • And fellow graduate students …
Features for Tertiary Fold Recognition • Node features • Regular expression template, HMM profiles • Secondary structure prediction scores • Segment length • Inter-node features • β-strand Side-chain alignment scores • Preferences for parallel alignment scores • Distance between adjacent B23 segments • Features are general and easy to extend
Discovery of Potential Double Barrel-Trimer • Potential proteins suggested in [Benson, 2005]
Inference Algorithm for SCRF • Backward-forward algorithm* • Viterbi algorithm* p(state yr ends at r |xl+1 xl+2… xr-1xrand state yl ends at l) =
Reversible jump MCMC Algorithm • Three types of proposals • Position switching: randomly select a segment j and a new position assignment dj(i+1) ~U(dj-1(i),dj+1(i)) • Segment split: randomly select a segment j and split it into two segments where (dj(i+1) , dj+1(i+1) ) = G(dj-1(i) ,u(i) ) where u(i) ~ U • Segment merge: randomly select a segment j and merge segment j and j+1 • Simulated annealing reversible jump MCMC for computing y = argmax P(y|x) [Andireu et al, 2000]
Protein Structure Determination • Lab experiments: time and labor- consuming • X-ray crystallography • NMR spectroscopy • Electron microscopy and many more • Computational methods: • Homology modeling: ≥ 30% sequence similarity • Fold recognition: < 30% sequence similarity • Ab inito methods: no template structure needed • Active research area in multiple scientific fields
Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients Evaluation Measure
Outline • Brief introduction to protein structures • Discriminative graphical models • Generalized discriminative graphical models for protein fold recognition • Experiment results • Conclusion and discussion
Graphical Models for Structured Prediction • Conditional Random Fields • Model conditional probability directly, not joint probability • Allow arbitrary dependencies in observation (e.g. long range, overlapping) • Adaptive to different loss functions and regularizers • Promising results in multiple applications • Recent developments • Alternative estimation algorithms (Collins, 2002, Dietterich et al, 2004) • Alternative loss functions, use of kernels (Taskar et al., 2003, Altun et al, 2003, Tsochantaridis et al, 2004) • Baysian formulation (Qi and Minka, 2005) and semi-markov version (Sarawagi and cohen, 2004)