Physics-Based Protein Modeling vs. Template-Based Modeling -- CASP7 & CASP8

Protein Structure Prediction by Global Optimization and its Application to Biological SystemsJooyoung Leehttp://lee.kias.re.krCenter for In SilicoProtein ScienceKorea Institute for Advanced StudySeoul, KoreaOct.12, 2009 • Physics-Based Protein Modeling vs. Template-Based Modeling • -- CASP7 & CASP8 • High-Accuracy Protein Modeling by Global Optimization • Accurate Protein 3D Modeling  Better Understanding of Biology???

KIAS Protein Folding Laboratory http://lee.kias.re.kr Anti-bodies Enzymes Control all cellular processes hormones scientific structure (post-genome) bottleneck The most challenging problem of this century function Proteins are important Proteins sequence (genome) (Protein Folding Problem)

Human hemoglobin KIAS Protein Folding Laboratory http://lee.kias.re.kr Protein folding problem sequence function structure • For a given amino acid sequence (of size n), • find the native structure of the protein. • Total # of protein structures:10n • mathematically NOT well defined problem

Protein folding problem • Protein Structure Prediction: • For a given protein sequence, to determine its 3D structure by computation • 2. Protein-FoldingMechanisms: • By what process does a protein folds into its native and biologically active conformation? • Inverse Folding: • For a given protein structure, to design its 1D sequence

Protein Structure Prediction • Physics-based approaches: Principle based-modeling • Accurate potential energy function • Powerful global optimization method  what we can do better than others • Ab initio, de novo, new fold targets (10-20%) • Informatics-based approaches: Template based-modeling • Map the original problem to a problem with solution  mapping problem (alignment problem) • Use templates (problems with solutions) to obtain the solution of the original problem (multiple alignment) • Comparative modeling, fold recognition (80-90%)

Narrows the search while maintaining diversity of sampling. Annealing in conformational space”. Conformational Space Annealing J ComputChem 18 1222 (1997) Phys Rev Lett91080201 (2003)

Examples of successful optimizations • Optimization of ECEPP/3 for a 20-residue membrane-bound portion of melittin [Biopolymers 46, 103-115 (1998)] • Unbiased global optimization of Lennard Jones clusters up to N =201[Phys Rev Lett91, 080201 (2003)] • Ground state in the frustrated XY model and lattice coulomb gas with f =1/6, [Physica A 315 314-320 (2002)] • Conformational space annealing and an off-lattice frustrated model protein, [J Chem Phys 119 10274-10279 (2003)] • Structure optimization of an off-lattice AB protein model [Phys Rev E 72 011916 (2005), Submitted] • Efficient molecular docking using conformational space annealing, [J ComputChem26 78-87 (2005)] • Ground-state energy and energy landscape of the Sherrington-Kirkpatrick spin glass [Phys. Rev. B 76, 184412 (2007)] • Successful High Accuracy Template Based Modeling in the CASP7 experiments [Proteins, Vol. 69, 83-89 Suppl. 8 (2007)] • Multiple sequence alignment by conformational space annelaing [Biophysical J. 95 4813-4819 (2008)]:

att532

What is CASP? • Critical Assessment of Techniques for Protein Structure Prediction (http://predictioncenter.gc.ucdavis.edu/). • Goal is to help advance the methods of identifying protein structure from sequence. • Community-wide experiments held every two years starting 1994 to prepare the post-genomic era • Blind prediction (and blind assessment). • Since CASP1 (1994), there are a total of 514 protein sequences predicted. • Since CASP5 (2002), ~200 methods have been tested for each CASP.

Protein Structure Prediction • Physics-based approaches: Principal based modeling • Accurate potential energy function • Powerful global optimization method • Ab initio, de novo, new fold targets (10-20%) • Informatics-based approaches: Template based modeling • Map the original problem to a problem with solution  mapping problem (alignment problem) • Use templates (problems with solutions) to obtain the solution of the original problem (multiple alignment) • Comparative modeling & fold recognition (80-90%)

HDEA RMSD=4.2 Å for 61 residues (80%, residues 25-85) HDEA Segment RMSD=2.9 Å for 27 residues (36%, residues 16-42) PNAS 96,5482

Past CASP Performances of KIAS protein folding lab - CASP5 (2002): 18th out of 165 team in new-foldcategory - CASP6 (2004): selected as a member of 12 elite teams innew-fold CASP6 example: T0199_D3 (FR/A, Nres=82, 145-226) Native structure Model4

Physics & Protein Structure Prediction (I) Proteins are polypeptide chains containing many atoms, and the interaction between atoms is considered to be reasonably well described by physics and chemistry. However, there are only a few anecdotal examples of successful physics-based protein modeling (compared to the informatics-based method). Currently, protein structure prediction methods relying only on physics-based approaches do not work as well as informatics-based methods.

Protein Structure Prediction • Physics-based approaches: Principal based modeling • Accurate potential energy function • Powerful global optimization method • Ab initio, de novo, new fold targets (10-20%) • Informatics-based approaches: Template based modeling • Map the original problem to a problem with solution  mapping problem (alignment problem) • Use templates (problems with solutions) to obtain the solution of the original problem (multiple alignment) • Comparative modeling & fold recognition (80-90%)

Physics & Protein Structure Prediction (II) The goal is to achieve better protein modeling by fusinginformatics-based methods with a principle of physics (global optimization) The task was to map protein modeling using templates into a series of combinatorial optimization problem The reality was to learn TBM (template-based modeling) by making lots of mistakes in a real situation (CASP7)

CASP7 Experiment • 2006, May -- August • About 200 prediction methods are tested • Total of 104 targets (9 cancelled) • Three major categories: • High Accuracy Template Based Modeling (28 domains) • Use fine resolution measures for backbone assessment • Side-chains are also assessed • Only model 1s are considered • Template Based Modeling (108 domains) • Free Modeling (16 domains) • Physics-based methods have chances for providing competitive protein models • Official results are available from CASP7 conference homepage (11/26-11/30/2006) and Proteins CASP7 issue

Homology modeling (template-based modeling) methods in the literature • Conventional methods: • Minimal amount of computing resources • Human power intensive (several days per target) • A series of decision making procedures require human expertise. • More advanced (and successful) methods: • Requires some/significant computing power • Fragments are reassembled • Complicated score functions (not available to others) are optimized • TASSER by Zhang and Skolnick & ROSETTA by Baker • Our approach: • Problems are all mapped onto combinatorial optimization problems • Computing power intensive (CSA is used) • Requires no human expertise (this is our first-ever TMB attempt in the CASP) • Score functions are made up with those available in public • Goal was to learn TBM by making mistakes in a real situation

We formulate protein modeling as a series of combinatorial optimization problems: • Multiple Sequence Alignment (MSA)  optimization of a frustrate system [Biophysical J. 95 4813-4819 (2008)]: • generate pair-wise alignments between all pairs • from each pair-wise alignment, generate residue-to-residue restraints  a library of restraints  a frustrated system • All-atom chain building from MSA  another combinatorial problem of the modeller energy function [Proteins 75 1010-1023 (2009)]: • modeller energy is a collection of competing terms including distant restraint terms from MSA and stereo-chemistry terms  inherent frustration when dealing with more than one template • modeller energy is treated as a black box for optimization • Side-chain modeling is a combinatorial optimization of rotamers for a given backbone structure

CASP Strategy Proteins, Vol. 69, 83-89 Suppl. 8 (2007) Biophysical J. 95, 4813 (2008) Proteins 75 1010-1023 (2009)

CASP7 High Accuracy Template Based Modelingz 0.995 Proteins 69, Issue S8, 27 – 37 (2007)

CASP7 High Accuracy Template Based Modeling Proteins 69, Issue S8, 27 – 37 (2007)

Conclusion of the official CASP7 assessment for HA/TBM targets (Proteins 69, Issue S8, 38 – 56 (2007) reads: “A number of groups did well in the HA/TBM category. Group 556 (LEE) stood out as the only group that performed near the top according to all criteria investigated: fold quality (particularly GDT-HA), side-chain rotamer quality, and molecular replacement model quality”.

Template Based Modeling (Skolnick) Proteins 69, Issue S8, 38 – 56 (2007)

CASP8 • 2008, May 5 – Aug 23. • Over 200 prediction methods are tested. • Total of 128 targets (6 cancelled). • We tried 2 methods: LEE and LEE-SERVER (server) • Partial assessment data released during the CASP8 meeting (Sardinia, Italy, Dec 3-7, 2008). • Highlights of LEE & LEE-SERVER prediction: • 50 HA-TBM targets: LEE & LEE-SERVER are 2 best methods • Binding site prediction: LEE & LEE-SERVER are 2 best methods. • Refinement category: Best model1 prediction by LEE

LEE LEE-SERVER BAKER-ROBETTA keasar-server Zhang-Server YASARA SAM-T08 Top 20 methods sorted by GDT-HA for HA-TBM targets http://casp.kias.re.kr

Predictor Group Rankings (CASP8 TBM category)

What can one do better with more accurate protein models? • Predict protein functions: • CASP8 performance (best HA-TBM prediction by LEE and LEE-S) • Best Binding site prediction  protein design • Suggest working mechanism of proteins at atomic resolution (insulin analogs collaboration with Prof H Shin) • Screen natural proteins to find more efficient enzymes: • Discovery of more efficient amino-transferases by protein modeling and docking simulation  confirmed by wet experiments where 30-60 folds increased in the reaction rate is validated (collaboration with Prof BG. Kim) • Determine a protein complex structure by combining X-ray diffraction data and protein modeling (Cell 136 85-96 Jan 9 2009 in collaboration with Prof BH Oh)

Cell 136 85-96, Jan 9 2009

X-tal structure of condensin complex MukBEF Cell 136 85-96, 2009

Screening of w-aminotransferase for the asymmetric synthesis of chiral amine • Caulobacter w-TA were selected and PSI-BLAST was run: 250 sequences were selected • 250 sequences were multiply-aligned and 4 subgroups were identified. • 51 sequences belong to w-TA and all the sequences were used for model building. • The models were docked with aminodiphenylmethane(ADPM), and the distance between PLP and the N atom of ADPM was measured.

CASP8 Binding Site Prediction

T0391 (Human/Server) Magenta X-ray (3d89A) Blue LEE Ligand: FES complex PDB code: 3d89 HETERO ATOMS: FES FES 57 59 60 61 62 80 82 83 85 Prediction: FESBinding Accuracy = 9/10 = 90 % Coverage = 9/9 = 100 % GDT-HA: 50.73

NMR Protein Structure Determination CSA PDB Better Structure qualities Comparable quality

Protein Structure Determination by X-ray crystallography & MR

Conclusions • We have successfully mapped the template-based protein modeling into three layers of combinatorial optimization problems: MSACSA, ModellerCSAand ROTCSA. • We have demonstrated that high accuracy protein 3D modeling can be achieved simply by rigorous optimization of relevant score functions. • The proposed method requires a large amount of computational resources (100 CPU days per 300aa protein), but produces significantly better results. • There are rooms for improvement for better template detection and loop modeling • Application to real/experimental systems is in the preliminary stage but quite promising.

Acknowledgements 3D modeling: KeehyoungJoo, Jinwoo Lee, Dept. of Math., Kwangwoon U. Sung Jong Lee, Dept. of Phys., Suwon U. Function: Mina Oh NMR Structure Optimization: Jinhyuk Lee Collaboration with experimental groups: Byung-Gee Kim, School of Chemical and Biological Engineering, SNU Byung-Ha Oh, POSTECH (moved to KAIST) H Shin, Soongsil U. DH Shin, EwhaWemen’s Univ. Cluster computers: KIAS http://casp.kias.re.kr/ for head-to-head comparison between servers

Thank You!

Physics-Based Protein Modeling vs. Template-Based Modeling -- CASP7 & CASP8