UNDERSTANDING INTELLIGENCE

Predicting Protein Structures and Structural Features on a Genomic ScalePierre BaldiSchool of Information and Computer SciencesInstitute for Genomics and BioinformaticsUniversity of California, Irvine

UNDERSTANDING INTELLIGENCE • Human intelligence (inverse problem) • AI (direct problem) • Choice of specific problems is key • Protein structure prediction is a good problem

PROTEINS R1 R3 | | Cα N Cβ Cα / \ / \ / \ / \ N Cβ Cα N Cβ | R2

Utility of Structural Information (Baker and Sali, 2001)

CAVEAT

REMARKS • Structure/Folding • Backbone/Full Atom • Homology Modeling • Fold Recognition (Threading) • Ab Initio (Physical Potentials/Molecular Dynamics, Statistical Mechanics/Lattice Models) • Statistical/Machine Learning (Training Sets, SS prediction) • Mixtures: ab-initio with statistical potentials, machine learning with profiles, etc.

PROTEIN STRUCTURE PREDICTION (ab initio)

Helices 1GRJ (Grea Transcript Cleavage Factor From Escherichia Coli)

Antiparallel β-sheets 1MSC (Bacteriophage Ms2 Unassembled Coat Protein Dimer)

Parallel β-sheets 1FUE (Flavodoxin)

Contact map

Secondary structure prediction

GRAPHICAL MODELS: BAYESIAN NETWORKS • X1, … ,Xn random variables associated with the vertices of a DAG = Directed Acyclic Graph • The local conditional distributions P(Xi|Xj: j parent of i) are the parameters of the model. They can be represented by look-up tables (costly) or other more compact parameterizations (Sigmoidal Belief Networks, XOR, etc). • The global distribution is the product of the local characteristics:P(X1,…,Xn) = Πi P(Xi|Xj : j parent of i)

DATA PREPARATION Starting point: PDB data base. · Remove sequences not determined by X ray diffraction. · Remove sequences where DSSP crashes. · Remove proteins with physical chain breaks (neighboring AA having distances exceeding 4 Angstroms) · Remove sequences with resolution worst than 2.5 Angstroms. · Remove chains with less than 30 AA. · Remove redundancy (Hobohm’s algorithm, Smith-Waterman, PAM 120, etc.) · Build multiple alignments (BLAST, PSI-BLAST, etc.)

SECONDARY STRUCTURE PROGRAMS · DSSP (Kabsch and Sander, 1983): works by assigning potential backbone hydrogen bonds (based on the 3D coordinates of the backbone atoms) and subsequently by identifying repetitive bonding patterns. ·STRIDE (Frishman and Argos, 1995): in addition to hydrogen bonds, it uses also dihedral angles. ·DEFINE (Richards and Kundrot, 1988): uses difference distance matrices for evaluating the match of interatomic distances in the protein to those from idealized SS.

SECONDARY STRUCTURE ASSIGNMENTS DSSP classes: • H = alpha helix • E = sheet • G = 3-10 helix • S = kind of turn • T = beta turn • B = beta bridge • I = pi-helix (very rare) • C = the rest CASP (harder) assignment: • α = H and G • β = E and B • γ = the rest Alternative assignment: • α = H • β = B • γ = the rest

ENSEMBLES

FUNDAMENTAL LIMITATIONS 100% CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE FOR SEVERAL REASONS • SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY NEED CHAPERONES • QUATERNARY STRUCTURE [BETA-STRAND PARTNERS MAY BE ON A DIFFERENT CHAIN] • STRUCTURE MAY DEPEND ON OTHER VARIABLES [ENVIRONMENT, PH] • DYNAMICAL ASPECTS • FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES

BB-RNNs

2D RNNs

2D INPUTS AA at positions i and j Profiles at positions i and j Correlated profiles at positions i and j + Secondary Structure, Accessibility, etc.

PERFORMANCE (%)

Protein Reconstruction Using predicted secondary structure and predicted contact map PDB ID : 1HCR, chain A Sequence: GRPRAINKHEQEQISRLLEKGHPRQQLAIIFGIGVSTLYRYFPASSIKKRMN True SS : CCCCCCCCHHHHHHHHHHHCCCCHHHHHHHCECCHHHHHHHCCCCCCCCCCC Pred SS : CCCCCCCHHHHHHHHHHHHCCCCHHHHEEHECHHHHHHHHCCCHHHHHHHCC PDB ID: 1HCR Chain A (52 residues) Model # 147 RMSD 3.47Å

Protein Reconstruction Using predicted secondary structure and predicted contact map PDB ID : 1BC8, chain C Sequence: MDSAITLWQFLLQLLQKPQNKHMICWTSNDGQFKLLQAEEVARLWGIRKNKPNMNYDKLSRALRYYYVKNIIKKVNGQKFVYKFVSYPEILNM True SS : CCCCCCHHHHHHHHCCCHHHCCCCEECCCCCEEECCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHCCEEECCCCCCEEEECCCCHHHCC Pred SS : CCCHHHHHHHHHHHHHCCCCCCEEEEECCCEEEEECCHHHHHHHHHHHCCCCCCCHHHHHHHHHHHHHCCCEEECCCCEEEEEEECCHHHHCC PDB ID: 1BC8 Chain C (93 residues) Model # 1714 RMSD 4.21Å

CASP6 Self AssessmentEvaluation based on GDT_TS of first submitted modelGDT_TS: Global Distance Test Total ScoreGDT_TS = (GDT_P1 + GDT_P2 + GDT_P4 + GDT_P8 ) / 4Pn : percentage of residues under distance cutoff n

Hard Target Summary • Top 10 groups displayed, of 65 registered servers • Assessment on25 new fold and fold recognition analogous target domains N: number of targets predicted Av.R.: average rank sumZ: sum of Z scores on all targets in set sumZpos: sum of Z scores for predictions with positive Z score group N Av.R. sumZ sumZpos BAKER-ROBETTA 25 9.12 27.81 27.94 baldi-group-server 24 10.04 20.47 22.33 Rokky 25 11.60 17.56 18.46 Pmodeller5 20 12.55 14.35 16.11 ZHOUSPARKS2 25 15.00 12.67 15.91 ACE 25 13.08 11.63 14.89 Pcomb2 24 16.04 10.82 13.58 RAPTOR 24 16.33 9.81 13.21 zhousp3 25 15.72 9.30 12.90 PROTINFO-AB 19 16.74 8.62 12.59

Hard Target Summary • Top 10 groups displayed, of 65 registered servers • Assessment on 19 new fold and fold recognition analogous target domains less than 120 residues N: number of targets predicted Av.R.: average rank sumZ: sum of Z scores on all targets in set sumZpos: sum of Z scores for predictions with positive Z score group N Av.R. sumZ sumZpos baldi-group-server 19 6.74 20.61 20.61 BAKER-ROBETTA 19 9.11 20.44 20.57 Rokky 19 12.11 12.30 13.20 PROTINFO-AB 16 12.63 11.53 12.59 ZHOUSPARKS2 19 15.32 8.91 11.87 Pcomb2 18 15.39 9.54 11.48 Pmodeller5 15 14.47 9.25 11.00 PROTINFO 18 16.22 8.66 10.56 ACE 19 14.21 6.98 10.24 RAPTOR 18 17.00 7.22 9.64

Target T0281Detailed Target Analysis • Target Information • Length: 70 amino acids • Resolution: 1.52 Å • PDB code: 1WHZ • Description: Hypothetical Protein From Thermus Thermophilus Hb8 • Domains: single domain • Assessment • GDT_TS server rank of our 1st model: 2 • GDT_TS: 51.07 • RMSD to native: 6.15

Target T0281Contact Map Comparison *note: true map is lower left True Map vs. Predicted Map True Map vs. Recovered Map

Target T0281Structure Comparison true structure predicted structure

Target T0281Structure Comparison Superposition True structure: thick trace Predicted structure: thin trace

Target T0280_2Detailed Target Analysis • Target Information • Length: 51 amino acids • Resolution: 2.00 Å • PDB code: 1WD5 • Description: Putative phosphoribosyl transferase, T. thermophilus • Domains: 2nd domain, residues 53-103 of 208 AA sequence • Assessment • GDT_TS server rank of our 1st model: 1 (also 1st among human groups) • GDT_TS: 54.41 • RMSD to native: 5.81

Target T0280_2Contact Map Comparison *note: true map is lower left True Map vs. Predicted Map True Map vs. Recovered Map

Target T0281Structure Comparison true structure predicted structure

Target T0281Structure Comparison Superposition True structure: thick trace Predicted structure: thin trace

THE SCRATCH SUITE www.igb.uci.edu • DOMpro: domains • DISpro: disordered regions • SSpro: secondary structure • SSpro8: secondary structure • ACCpro: accessibility • CONpro: contact number • DI-pro: disulphide bridges • BETA-pro: beta partners • CMAP-pro: contact map • CCMAP-pro: coarse contact map • CON23D-pro: contact map to 3D • 3D-pro: 3D structure (homology + fold recognition + ab-initio)

UNDERSTANDING INTELLIGENCE