790 likes | 1.25k Vues
Machine Learning Algorithms for Protein Structure Prediction. Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006. Outline. Introduction 1D Prediction 2D Prediction (Beta-Sheet Topology)
E N D
Machine Learning Algorithms for Protein Structure Prediction Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006
Outline • Introduction • 1D Prediction • 2D Prediction (Beta-Sheet Topology) • 3D Prediction (Fold Recognition) • Publications and Bioinformatics Tools
Importance of Protein Structure Prediction AGCWY…… Cell Sequence Structure Function
Four Levels of Protein Structure Primary Structure (a directional sequence of amino acids/residues) N C … Residue1 Residue2 Peptide bond Secondary Structure (helix, strand, coil) Alpha Helix Beta Strand / Sheet Coil
Four Levels of Protein Structure Tertiary Structure Quaternary Structure (complex) G Protein Complex
1D: Secondary Structure Prediction MWLKKFGINLLIGQSV… Helix Neural Networks + Alignments Coil CCCCHHHHHCCCSSSSS… Accuracy: 78% Strand Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
1D: Solvent Accessibility Prediction Exposed MWLKKFGINLLIGQSV… Neural Networks + Alignments eeeeeeebbbbbbbbeeeebbb… Accuracy: 79% Buried Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
1D: Disordered Region Prediction Using Neural Networks MWLKKFGINLLIGQSV… Disordered Region 1D-RNN OOOOODDDDOOOOO… 93% TP at 5% FP Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005
1D: Protein Domain Prediction Using Neural Networks MWLKKFGINLLIGQSV… Boundary + SS and SA 1D-RNN NNNNNNNBBBBBNNNN… Inference/Cut HIV capsid protein Domain 1 Domain 2 Domains Top ab-initio domain predictor in CAFASP4 Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.
1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine Correlation = 0.76 • First method to predict energy changes from sequence accurately • Useful for protein engineering, protein design, and mutagenesis analysis Support Vector Machine …MWLAVFILINLK… Cheng, Randall, and Baldi. Proteins, 2006
2D: Contact Map Prediction 2D Contact Map 3D Structure 1 2 ………..………..…j...…………………..…n 1 2 3 . . . . i . . . . . . . n Distance Threshold = 8Ao Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
2D: Disulfide Bond Prediction Cysteine i Support Vector Machine yes 2D-RNN Disulfide Bond Graph Matching Cysteine j [1] Baldi, Cheng, Vullo. NIPS, 2004. [2] Cheng, Saigo, Baldi. Proteins, 2005
2D: Prediction of Beta-Sheet Topology N terminus • Ab-Initio Structure Prediction • Fold Recognition • Protein Design • Protein Folding Beta Sheet Beta Strand Cheng and Baldi, Bioinformatics, 2005 C terminus Beta Residue Pair
An Example of Beta-Sheet Topology Level 1 4 5 2 1 3 6 7 Structure of Protein 1VJG Beta Sheets
An Example of Beta-Sheet Topology Level 1 Level 2 4 5 Antiparallel 2 1 3 6 7 Parallel Strand Strand Pair Strand Alignment Pairing Direction Structure of Protein 1VJG Beta Sheets
An Example of Beta-Sheet Topology Level 1 Level 2 Level 3 4 5 Antiparallel H-bond 2 1 3 6 7 Parallel Strand Strand Pair Strand Alignment Pairing Direction Structure of Protein 1VJG Beta Sheets Beta Residue Residue Pair
Three-Stage Prediction of Beta-Sheets • Stage 1 Predict beta-residue pairing probabilities using 2D-Recursive Neural Networks (2D-RNN, Baldi and Pollastri, 2003) • Stage 2 Use beta-residue pairing probabilities to align beta-strands • Stage 3 Predict beta-strand pairs and beta-sheet topology using graph algorithms
Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks Input Matrix I (m×m) Output / Target Matrix (m×m) Iij 2D-RNN O = f(I) (i,j) i j Oij: Pairing Prob. Tij: 0/1 …AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK…. 20 for Residues 3 SS 2 SA
An Example (Target) 1 2 3 4 5 6 7 Protein 1VJG Beta-Residue Pairing Map (Target Matrix)
An Example (Target) 1 2 3 4 5 6 7 Antiparallel Parallel Protein 1VJG Beta-Residue Pairing Map (Target Matrix)
Stage 2: Beta-Strand Alignment Antiparallel • Use output probability matrix as scoring matrix • Dynamic programming • Disallow gaps and use the simplified search algorithm Parallel Total number of alignments = 2(m+n-1)
Strand Alignment and Pairing Matrix • The alignment score is the sum of the pairing probabilities of the aligned residues • The best alignment is the alignment with the maximum score • Strand Pairing Matrix Strand Pairing Matrix of 1VJG
Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology (a) Seven strands of protein 1VJG in sequence order (b) Beta-sheet topology of protein 1VJG
Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (a) Complete SPG Strand Pairing Matrix
Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (b) True Weighted SPG (a) Complete SPG Strand Pairing Matrix Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm
An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 1: Pair strand 4 and 5 1 2 3 4 5 4 5 6 7 Strand Pairing Matrix of 1VJG
An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 2: Pair strand 1 and 2 1 2 3 4 5 4 5 6 7 2 1 Strand Pairing Matrix of 1VJG N
An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 3: Pair strand 1 and 3 1 2 3 4 5 4 5 6 7 2 1 3 Strand Pairing Matrix of 1VJG N
An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 4: Pair strand 3 and 6 1 2 3 4 5 4 5 6 7 2 1 3 6 Strand Pairing Matrix of 1VJG N
An Example of MST Like Algorithm 1 2 3 4 5 6 7 Step 5: Pair strand 6 and 7 1 2 3 4 5 4 5 6 C 7 2 1 3 6 7 Strand Pairing Matrix of 1VJG N
1.Beta Residue Pairing 2. Beta Strand Alignment 3. Beta Strand Pairing
3D Structure Prediction MWLKKFGINLLIGQSV… • Ab-Initio Structure Prediction Simulation …… Physical force field – protein folding Contact map - reconstruction Select structure with minimum free energy • Template-Based Structure Prediction Query protein Fold MWLKKFGINKH… Recognition Alignment Template Protein Data Bank
A Machine Learning Information Retrieval Framework for Fold Recognition Fold Recognition Cheng and Baldi, Bioinformatics, 2006 Query Protein Alignment MWLKKFGIN…… Template Protein Data Bank Machine Learning Ranking
Classic Fold Recognition Approaches Sequence - Sequence Alignment (Needleman and Wunsch, 1970. Smith and Waterman, 1981) Query ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL Template ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL Alignment (similarity) score Works for >40% sequence identity (Close homologs in protein family)
Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Query Family Average Score Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)
Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) 12………………………………….………………n ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Query Family Position Specific Scoring Matrix Or Hidden Markov Model Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)
Classic Fold Recognition Approaches Profile - Profile Alignment (Rychlewski et al., 2000) ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Query Family Template Family ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM More sensitive for very distant homologs. (> 15% identity)
Classic Fold Recognition Approaches Sequence - Structure Alignment (Threading) (Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994) Fit Query Fitness Score MWLKKFGINLLIGQS…. Template Structure Useful for recognizing similar folds without sequence similarity. (no evolutionary relationship)
Integration of Complementary Approaches FR Server1 Query Meta Server FR server2 Consensus (Lundstrom et al.,2001. Fischer, 2003) FR server3 Internet • Reliability depends on availability of external servers • Make decisions on a handful candidates
Machine Learning Classification Approach Support Vector Machine (SVM) Class 1 Class 2 Proteins Class m Classify individual proteins to several or dozens of structure classes (Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004) Problem 1: can’t scale up to thousands of protein classes Problem 2: doesn’t provide templates for structure modeling
Machine Learning Information Retrieval Framework Query-Template Pair Score 1 Relevance Function (e.g., SVM) + Score 2 Rank . . . - Score n • Extract pairwise features • Comparison of two pairs (four proteins) • Relevant or not (one score) vs. many classes • Ranking of templates (retrieval)
Pairwise Feature Extraction • Sequence / Family InformationFeatures Cosine, correlation, and Gaussian kernel • Sequence – Sequence Alignment Features Palign, ClustalW • Sequence – ProfileAlignmentFeatures PSI-BLAST, IMPALA, HMMer, RPS-BLAST • Profile – ProfileAlignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM • Structural Features Secondary structure, solvent accessibility, contact map, beta-sheet topology
Relevance Function: Support Vector Machine Learning Feature Space Positive Pairs (Same Folds) Support Vector Machine Negative Pairs (Different Folds) Training/Learning Hyperplane Training Data Set
Relevance Function: Support Vector Machine Learning (2) (1) Margin Margin f(x) = K is Gaussian Kernel:
Training and Cross-Validation • Standard benchmark (Lindahl’s dataset, 976 proteins) • 976 x 975 query-template pairs (about 7,468 positives) Query Query 1’s pairs 975 pairs 1 2 3 . . . . . 976 Query 2’s pairs Train / Learn 975 pairs . . . (90%: 1- 878) Rank 975 templates for each query Test (10%: 879 – 976) 975 pairs
Results for Top Five Ranked Templates • Family: close homologs, more identity • Superfamily: distant homologs, less identity • Fold: no evolutionary relation, no identity