Content Protein fold and structure Homology modeling Protein-protein docking

BL5203: Molecular Recognition & Interaction Lecture 6: Modeling Protein Structure and Protein-Protein Interaction Y.Z. ChenDepartment of PharmacyNational University of SingaporeTel: 65-6616-6877; Email: phacyz@nus.edu.sg ; Web: http://bidd.nus.edu.sg • Content • Protein fold and structure • Homology modeling • Protein-protein docking

Sizes of protein databases 500M 1.6M 26K 1K

Swiss-Prot database

Protein superfamily Protein world New Fold Protein family Protein fold Protein structure classification

PDB New Fold Growth • The number of unique folds in nature is fairly small (possibly a few thousands) • 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Old folds New PDB structures New folds

Protein classification • Number of protein sequences grow exponentially • Number of solved structures grow exponentially • Number of new folds identified very small (and close to constant) • Protein classification can • Generate overview of structure types • Detect similarities (evolutionary relationships) between protein sequences

Problems in Protein Bioinformatics • 20,000 entries of proteins in the PDB • 1000 - 2000 distinct protein folds in nature • Thought to be only several thousand unique folds in all • Prediction of structure from sequence • Fold recognition • Fragment construction • Proteome annotation • Protein-protein docking

Protein folding code Protein folding code Protein sequence Protein structure

Prediction of correct fold Matched fold Query sequence Fold recognition Match sequence against library of known folds Eisenberg et al. Jones, Taylor, Thornton

Computational Requirements • 1 sequence search takes 12 mins (3Ghz) • Benchmarking on 100 proteins with 100 runs for a simplex search of parameter space = 80 days • 30 approaches explored = 7 years (on 1 cpu)

Types of Structure Prediction • De novo protein • methods seek to build three-dimensional protein models "from scratch" • Example: Rosetta • Comparative protein • modeling uses previously solved structures as starting points, or templates. • Example: protein threading

Factors that Make Protein Structure Prediction a Difficult Task • The number of possible structures that proteins may possess is extremely large, as highlighted by the Levinthal paradox • The physical basis of protein structural stability is not fully understood. • The primary sequence may not fully specify the tertiary structure. • chaperones • Direct simulation of protein folding is not generally tractable for both practical and theoretical reasons.

Homology Modeling • Homolog a protein related to it by divergent evolution from a common ancestor • 40 % amino-acid identity with its homolog • NO large insertions or deletions • Produces a predicted structure equivalent to that of a medium resolution experimentally solved structure • 25 % of known protein sequences fall in a safe area implying they can be modeled reliably

Homology Modeling Defined • Homology modeling • Based on the reasonable assumption that two homologous proteins will share very similar structures. • Given the amino acid sequence of an unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated computationally, into the corresponding amino acid from the unknown structure.

Homology Modeling Limitations • Cannot study conformational changes • Cannot find new catalytic/binding sites • Brainstorm lack of activity vs activity • Chymotrypsionogen, trypsinogen and plasminogen • 40% homologous • 2 active, 1 no activity, cannot explain why • Large Bias towards structure of template • Models cannot be docked together

Why Homology Modeling? • Value in structure based drug design • Find common catalytic sites/molecular recognition sites • Use as a guide to planning and interpreting experiments • 70-80 % chance a protein has a similar fold to the target protein due to X-ray crystallography or NMR spectroscopy • Sometimes it’s the only option or best guess

Protein Threading • A target sequence is threaded through the backbone structure of a collection of template proteins (fold library) • Quantitative measure of how well the sequence fits the fold • Based on assumptions • 3-D structures of proteins have characteristics that are semi-quantitatively predictable • reflect the physical-chemical properties of amino acids • Limited types of interactions allowed within folding

Fold Recognition Methods • Bowie, Lüthy and Eisenberg (1991) • 2 approaches to recognition methods • Derive a 1-D profile for each structure in the fold library and align the target sequence to these profiles • Identify amino acids based on core or external positions • Part of secondary structure • Consider the full 3-D structure of the protein template • Modeled as a set of inter-atomic distances • NP-Hard (if include interactions of multiple residues)

Protein Threading • The word threading implies that one drags the sequence (ACDEFG...) step by step through each location on each template

Protein Threading

Generalized Threading Score • Want to correctly recognize arrangements of residues • Building a score function • potentials of mean force • from an optimization calculation. • G(rAB) = kTln (ρAB/ ρAB°) • G, free energy • k and T Boltzmanns constant and temperature respectively • ρ is the observed frequency of AB pairs at distance r. • ρ° the frequency of AB pairs at distance r you would expect to see by chance. • Z-score = (ENat - <Ealt>)/σ Ealt • Natural energies and mean energies of all the wrong structures/ standard deviation

Scoring Different Folds • Goodness of fit score • Based on empirical energy function • Modify to take into account pairwise interactions and solvation terms • High score means good fit • Low score means nothing learned

Some Threading Programs • 3D-pssm (ICNET). Based on sequence profiles, solvatation potentials and secondary structure. • TOPITS (PredictProtein server) (EMBL). Based on coincidence of secondary structure and accesibility. • UCLA-DOE Structure Prediction Server (UCLA). Executes various threading programs and report a consensus. • 123D+ Combines substitution matrix, secondary structure prediction, and contact capacity potentials. • SAM/HMM (UCSC). Basen on Markov models of alignments of crystalized proteins. • FAS (Burnham Institute). Based on profile-profile matching algorithms of the query sequence with sequences from clustered PDB database. • PSIPRED-GenThreader (Brunel) • THREADER2 (Warwick). Based on solvatation potentials and contacts obtained from crystalized proteins. • ProFIT CAME (Salzburg)

Process of 3D Structure Prediction by Threading • Has this protein sequence similarity to other with a known structure? • Structure related information in the databases • Results from threadingprograms • Predicted folding comparison • Threadingon the structure and mapping of the known data • A comparison between the threading predicted structure and the actual one

Protein Threading Based on Multiple Protein Structure AlignmentTatsuya Akutsu and Kim Lan SimHuman Genome Center, Institute of Medical Science, University of Tokyo • NP-Hard if include interactions between 2 or more AA • Determine multiple structural alignments based on pair wise structure alignments • Center Star Method

Center Star Method • Let I0 be the maximum number of gap symbols placed before the first residue of S0 in any of the alignments A(S0; S1); : : : ;A(S0; SN). Let IS0jbe the maximum number of gaps placed after the last character of S0 in any of the alignments, and let Ii be the maximum number of gaps placed between character S0;iand S0;i+1, where Sj:idenotes the i-th letter of string Si • Create a string S0 by inserting I0 gaps before S0, IjSo gaps after S0, andIjgaps between S0;Iand S0;i+1. • For each Sj(j > 0), create a pairwise alignment A(S0; Sj) between S0 and Sj by inserting gaps into Sjso that deletion of the columns consisting of gaps from A(S0; Sj) results in the same alignment as A(S0; Sj). • Simply arrange A(S0; Sj)'s into a single matrix A (note that all A(S0; Sj )'s have the same length).

Simple Threading Algorithm • Apply simple score function based on structure alignment algorithm • Let X = x1……xN (input amino acid sequence) • Ci ( i-th column in A) • Test and analyze results and/or apply constraints

Protein Threading with Constraints • Assume part of the input sequence xi…xi+k must correspond to part of the structure alignment cj…cj+k • Apply constraints

Prediction Power • Entered in CASP3 competition • 17 predictions made • 3 targets evaluated as similar to correct folds • Only team to create a nearly correct model for structure T0043 • Best in competition • 8 evaluated as similar to correct

Next time…. • In depth detail of • Multiple structural alignment program • Multiprospector • Global Optimum Protein Threading with Gapped Alignment • Quality measures for protein threading models • Improvements on threading-based models

Gapped Alignment

Fragment based method1 -Predict structure of segment Trial structures for a local sequence taken from database of segments of known 3D structure .

Fragment based method2 - Construct trial model from segments

Fragment based method3 - Identify good trial structures • 1 Low resolution energy function used in initial search through conformational space • 2 - Side chains represented by single “centroid” pseudoatom • 3 - Major contributions from • Hydrophobic burial • Beta strand pairing • Steric overlap • Specific residue pair interactions • 4 - Models then refined using explicit rotamer based side chain representation and potential from design method

Fragment-based protein folding Cro repressor (1orc) observed

Computational Requirements • Methodology performs numerous simulations and looks for clusters • One simulation takes 3 mins (3Ghz) • Require 1,000 simulations per protein = 2 days • Benchmark on 50 proteins = 100 days

3D-GENOMICS - proteome annotation Proteome sequences Annotation procedure Database sequences WWW MySQL database Database structures New research Functional data

No similar sequence - orphan • E. coli • Protein325 • homology • but no • function structure • Enzyme ABC • EC 1.2.3.4 • function • suggested membrane protein Types of annotation

3D-Genomics database-structural and functional annotation size

Computational requirements • Today 800,000 protein sequences. • Each sequence 15 mins to annotate on 2.5GHz cpu. • Time today = 8,000 cpu days = 2.5 months with 100 processor farm. • Need to update every 6 months. • No of sequences will double in 2-3 years and so will keep pace with increase in compute power.

Modelling protein-protein docking

Protein-protein docking Coordinates of mol 1 Coordinates of mol 2 Experimental information Rigid body search List of possible complexes Evaluate association energy Flexibility to refine List of complexes

Step 1 - Generating Complexes

Shape complementarity +1 -15 overlap +1 x -15 +1 match +1 x +1 A(i,j,k) B(l,m,n) SSS A(i,j,k) x B(l,m,n) C =

Electrostatic complementarity +1 - 1 - 1 +1 Potential outside 2 V(l,m,n) Charge in 1 = Q(i,j,k) SSS E = Q(i,j,k) x V(l,m,n)

Step 2 - Modelling residue-residue interactions E V I

Empirical residue pair potentials < distance cut off (4.5A) a b Analyse residues packing across 90 hetero-protein interfaces A pair of residues pack if one atom-atom contact Score(a,b) = log10 (Observed no a/b pairs) (Expected no a/b pairs)

Step 3 - Including informationabout functional residues From literature E

Content Protein fold and structure Homology modeling Protein-protein docking