960 likes | 975 Vues
Understand the importance of protein folding, structure levels, and bioinformatics tools for building accurate protein models. Explore how proteins function based on their 3D shapes and sequences.
 
                
                E N D
Prof Shoba RanganathanDept of Chemistry & Biomolecular SciencesMacquarie University, Sydney(shoba.ranganthan@mq.edu.au) Molecular Modeling: building a 3D protein structure from its sequence
Why protein structure? • In the factory of the living cell, proteins are the workers, performing a variety of tasks • Each protein adopts a particular folding pattern that determines its function • The 3D structure of a protein brings into close proximity residues that are far apart in the amino acid sequence
How does a protein fold? • Most newly synthesized proteins fold without assistance! • Ribonuclease A: denatured protein could refold and recover its activity (C. Anfinsen -1966) • “Structure implies function” • The amino acid sequence encodes the protein’s structural information
Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model
The basics • Proteins are linear heteropolymers: one or more polypeptide chains • Repeat units: 20 amino acid residues • Range from a few 10s-1000s • Three-dimensional shapes (“folds”) adopted vary enormously • Experimental methods: X-ray crystallography, electron microscopy and NMR (nuclear magnetic resonance)
The (L-)amino acid N C O O R Side chain = H,CH3,… Backbone Amino C a + - Carboxylate
Levels of protein structure • Zeroth: amino acid composition • Primary • This is simply the order of covalent linkages along the polypeptide chain, i.e. the sequence itself
Levels of protein structure • Secondary • Local organization of the protein backbone: a-helix, b-strand (which assemble into b-sheets), turn and interconnecting loop
Ramachandran / phi-psi plot b-sheet a-helix (left handed) y a-helix (right handed) f
Levels of protein structure • Tertiary • packing of secondary structure elements into a compact spatial unit • “Fold” or domain – this is the level to which structure prediction is currently possible
Levels of protein structure • Quaternary • Assembly of homo- or heteromeric protein chains • Usually the functional unit of a protein, especially for enzymes
Structural classes All-a (helical) All-b (sheet)
Structural classes a/b(parallelb-sheet) a+b(antiparallelb-sheet)
Structural information • Protein Data Bank: maintained by the Research Collaboratory for Structural Bioinformatics • http://www.rcsb.org/pdb • > 74,888structures of proteins • Also contains structures of DNA, carbohydrates, protein-DNA complexes and numerous small ligand molecules.
The PDB data • Text files • Each entry is identified by a unique 4-letter code: say 1emg • 1emg entry • Header information • Atomic coordinates in Å (1 Ångstrom = 1.0e-10 m)
PDB Header details • identifies the molecule, any modifications, date of release of PDB entry • organism, keywords, method • Authors, reference, resolution if X-ray structure • Sequence, x-reference to sequence databases HEADER GREENFLUORESCENT PROTEIN 12-NOV-98 1EMG TITLE GREEN FLUORESCENT PROTEIN (65-67 REPLACED BY CRO, S65T TITLE 2 SUBSTITUTION, Q80R) COMPND MOL_ID: 1; COMPND 2 MOLECULE: GREEN FLUORESCENT PROTEIN; COMPND 3 CHAIN: A; COMPND 4 ENGINEERED: YES; COMPND 5 MUTATION: 65 - 67 REPLACED BY CRO, S65T SUBSTITUTION, Q80R COMPND 6 SUBSTITUTION; COMPND 7 BIOLOGICAL_UNIT: MONOMER
The data itself • Coordinates for each heavy (non-hydrogen) atom from the first residue to the last • Any ligands (starting with HETATM) follow the biomacromolecule • O of water molecules (also HETATM) at the end ATOM 1 N SER A 2 29.089 9.397 51.904 1.00 81.75 ATOM 2 CA SER A 2 27.883 10.162 52.185 1.00 79.71 ATOM 3 C SER A 2 26.659 9.634 51.463 1.00 82.64 ATOM 4 O SER A 2 26.718 8.686 50.686 1.00 81.02 ATOM 5 CB SER A 2 28.039 11.660 51.932 1.00 75.59 ATOM 6 OG SER A 2 27.582 12.038 50.639 1.00 43.28 ------- ATOM 1737 CD1 ILE A 229 39.535 21.584 52.346 1.00 41.62 TER 1738 ILE A 229
Structural Families • SCOP - Structural Classification Of Proteins • http://scop.mrc-lmb.cam.ac.uk/scop • FSSP – Family of Structurally Similar Proteins • http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+FSSP • CATH – Class, Architecture, Topology, Homology • http://www.cathdb.info/
Structure comparison facts • Proteins adopt a limited number of topologies. • Homologous sequences show very similar structures, with strong conservation in secondary structural elements: variations in non-conserved regions. • In the absence of sequence homology, some folds are preferred by vastly different sequences.
Structure comparison facts • The “active site” (a collection of functionally critical residues) is remarkably conserved, even when the protein fold is different. • Structural models (especially those based on homology) provide insights into possible function for new proteins. • Implications for • protein engineering • ligand/drug design, • function assignment of genomic data.
Visualizing PDB information • RASMOL: most popular, available for all platforms (Sayle et al, 2005) http://www.bernstein-plus-sons.com/software/rasmol • DeepView Swiss-PDBViewer: from Swiss-Prot (Guex & Peitsch, 1997) http://spdbv.vital-it.ch/ • Chemscape Chime Plug-in: for PC and Mac http://accelrys.com/resource-center/downloads/freeware/ • PyMOL: Available for all platforms (DeLano, W.L. The PyMOL Molecular Graphics System, 2002) http://www.pymol.org/ • ICM: Very good, available for all platforms (Abagyan et al, 1994) http://www.molsoft.com/index.html
RASMOL views - SH2 domain All-atom model Space-filling model Atom colors:NOCS
RASMOL views – 1sha Ca Trace Ribbon Rainbow coloring:Nto C Coloring:by structural units
Homologous folds • Hemoglobinand erythrocruorin: 31% sequence identity
Analogous folds • Hemoglobin and phycocyanin: 9% sequence identity
Surface Properties Cro repressor – DNA complex • Basic residues in blue • Acidic residues in red
Mapping Functional Regions Immunoglobulin l light chain - dimer • Hydrophobhic residues in magenta • Hydrophilic and charged residues in cyan
Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model
Siblings and Cousins • Siblings or homologues: sequences with at least 30% sequence identity over an alignment length of at least 125 residues and conservation of function. • Cousins or paralogues: < 30% identity but with conservation of function • Both show structural conservation • Homologues located using a database search tool such as BLAST (free webserver): http://www.ncbi.nlm.nih.gov/BLAST • Paralogues require a more sensitive method such as PSI-BLAST
Multiple Sequence Alignment Finding the best way to match the residues of related sequences • Identical residues must be lined up • The rest should be arranged, based on • observed substitution in protein families • chemical similarity • charge similarity • Where it is impossible to get the residues to line up, the biological concept of insertion/deletion in invoked: the ‘gap’ in alignments
MSA Methods • CLUSTALW / CLUSTALX (Thompson et al, 1997): freely available for all platforms and one of the best alignment programs http://bips.u-strasbg.fr/fr/Documentation/ClustalX/ • MAXHOM (Sander & Schneider, 1991): alignment based on maximum homology; available via the PredictProtein webserver, free for academics http://www.predictprotein.org/ • MALIGN (Johnson et al, 1994): freely available web server that uses MALIGN, based on the structural alignment of protein families http://caps.ncbs.res.in/iws/malign.html
Alignment Checks • Conservation of functionally important residues: e.g. the catalytic triad (Asp-Ser-His) that are essential for serine proteinase activity • Line up of structurally important residues: e.g. cysteines forming disulfide bonds • Overall, maximizing the alignment of “like” residues • Completely conserved residues usually indicate some conserved structural or functional role, especially buried charges
Sequence Motifs & Patterns • From the analysis of the alignment of protein families • Conserved sequence features, usually associated with a specific function • PROSITE (Hulo et al, 2006) database for protein “signature” patterns: http://prosite.expasy.org/
Aligned Sequence Families • From alignments of homologous sequences: • PRINTS • PRODOM: http://prodom.prabi.fr/prodom/current/html/home.php • From Hidden Markov Model based methods: • PFAM: http://pfam.sanger.ac.uk/
Protein Domains • Most proteins are composed of structural subunits called domains • A domain is a compact unit of protein structure, usually associated with a function. • It is usually a “fold” - in the case of monomeric soluble proteins. • A domain comprises normally only one protein chain: rare examples involving 2 chains are known. • Domains can be shared between different proteins: like a LEGO block
Protein Architectures • Beads-on-a-string: sequential location: tyrosine-protein kinase receptor TIE-1 (immunoglobulin, EGF, fibronectin type-3 and protein kinase). • Domain insertions: “plugged-in” - pyruvate kinase (1pyk) • SMART: smart.embl-heidelberg.de Simple Modular Architecture Retrieval Tool
Dissection into Domains • A sequence, usually > 125 residues should be routinely checked to see how many domains are present. • Conserved Domain Architecture Retrieval Tool (CDART) uses information in Pfam and SMART to assign domains along a sequence • E.g. NP_002917 shows similarity to G-protein regulators:
Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model
Structural Homologues • BLASTP vs. PDB database or PSI-BLAST: look for 4-character PDB ID • E < 0.005 • Domain coverage: at least 60% coverage is recommended • Gaps: we don’t want them. Choose between: • few gaps and reasonable similarity scores or • lots of gaps and high similarity scores?
Small Proteins: Disulfide bonds • BLAST-type methods may not locate homologues, if Conserved Domain search is not turned on. • Are the Cys residues conserved? • Gaps: where are they on the structure? gnl|Pfam|pfam00095, wap, WAP-type (Whey Acidic Protein) • four-disulfide core'. CD-Length = 46 residues, 100.0% aligned • Score = 43.9 bits (102), Expect = 1e-06 • Q:49 KAGFCPWNLLQMISSTGPCPMKIECSSDRECSGNMKCCNVDCVMTCTPP 97 • D: 1 KPGVCPWVSISE---AGQCLELNPCQSDEECPGNKKCCPGSCGMSCLTP46
Metal-binding domains C2H2 Zinc Finger • 2 Cys & 2 His binding to Zinc • Not detected even by CD-search in BLAST • Detected by Pfam & SMART • Sequence Pattern: #-X-C-X(1-5)-C-X3-#-X5-#-X2-H-X(3-6)-[H/C]
Structure Prediction Methods • Secondary Structure Prediction: identify local structural elements such as helices, strands and loops. • > 75% accuracy achievable • PredictProtein or PHD http://www.predictprotein.org/ • PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/ • SSPro http://scratch.proteomics.ics.uci.edu/
Folds from Secondary Structure Predictions • Assembling SSEs into folds is a combinatorial problem • Current methods depend on available structural data for mapping predictions: • FORREST http://abs.cit.nih.gov/foresst/foresst.html • TOPITS from the PHD server http://www.rostlab.org/papers/1995_topits/
Tertiary Structure Prediction • Fold recognition/Threading: < 20% identity typically • Best results obtained by combining several database search and knowledge-based tools: • 3D-PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/ • FUGUE http://tardis.nibio.go.jp/fugue/
Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model
One or many templates? • Sequence similarity: extract template sequences and align with query: select the most similar structure • Completeness: Missing data? REMARK 465 MISSING RESIDUES REMARK 465 THE FOLLOWING RESIDUES WERE NOT LOCATED IN THE REMARK 465 EXPERIMENT. (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN REMARK 465 IDENTIFIER; SSSEQ=SEQUENCE NUMBER; I=INSERTION CODE.) REMARK 465 REMARK 465 M RES C SSSEQI REMARK 465 MET A 1 REMARK 465 THR A 230 REMARK 470 M RES CSSEQI ATOMS REMARK 470 GLU A 5 OE2 REMARK 470 GLU A 6 CG CD OE1 OE2 REMARK 470 GLU A 17 OE1
One or many templates? • X-ray or NMR?: • Lowest resolution X-ray structure • X-ray and then NMR • NMR average over assembly • One or many?: • Structure alignment of Ca atoms • If 2 templates are very close, keep only one • Keep templates that provide new information
Many templates • Sequence alignment from structure comparison of templates (SSA) can be different from a simple sequence alignment (SA). • For model building, • align templates structurally • extract the corresponding SSA