Bioinformatics Tools

Bioinformatics Tools

Bioinformatics • The use of computer science, mathematics, and information theory to model and analyze biological systems, especially systems involving genetic material. • Things I can do with a computer to improve and accelerate my work.

Applications of Bioinformatics • Manage and analyze data in very large databases • genetic info (DNA) • protein sequences, structures • Collections of scientific papers, experimental results • Compare sequences and structures • Do similar sequences or folding indicate proteins have similar functions? • Modeling and prediction • Predict 3D structure from known structures (homology) or based on some computational approach without modeling (ab initio) • Prediction of function from structure • Molecular mechanics/ molecular dynamics • Prediction of molecular interactions, docking • Perform energy minimization calculations • Predict useful mutations for protein engineering

Sources of Data • Sequence databases (EBI) • FASTA (sequence similarity) • http://www.ebi.ac.uk/Tools/fasta33/ • SwissProt (database of protein sequences) • http://expasy.org/sprot/ • 3D structure database: the RCSB – PDB • http://www.rcsb.org/pdb/home/home.do

Sequence Analyses • Sequence Alignment • Single or Multiple Sequences • Motif or Pattern Search • Prediction of Secondary Structure • 1E9N:A|PDBID|CHAIN|SEQUENCEMPKRGKKGAVAEDGDELRTEPEAKKSKTAAKKNDKEAAGEGPALYEDPPDQKTSPSGKPATLKICSWNVDGLRAWIKKKGLDWVKEEAPDILCLQETKCSENKLPAELQELPGLSHQYWLAL

Sequence Alignment • Usually first step in analysis of any new/unidentified sequence is to perform comparisons with sequence databases to find existing homologues. • This might give you some idea of: • How the protein might potentially fold • What other proteins it is related to • What it’s function might be • FASTA (http://www.ebi.ac.uk/Tools/fasta33/) • One of several web servers you can use for this. • Provides similarity search against protein database. • Lets you select substitution matrix (BLOSUM50, BLOSUM62, etc.) for search. • a substitution matrix describes the rate at which one character in a sequence changes to other character states over time. • One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.

Gaps In Sequence Alignments • When aligning sequences, score is affected by how much penalty is assigned to gaps in sequence. • For larger gaps: • Assumes greater evolutionary distance between sequences • Probably should be assigned a higher penalty ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC ATCTTCAGTGTTTCCCCTGTTTTGCCC....................ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGCCCCCCCCCCCCCCCCCCCATTTAGTTCGCTC Smaller gap, smaller penalty

Other Sequence Databases • BLAST and PSI-BLAST also commonly used. • BLAST can be found at: • http://www.ncbi.nlm.nih.gov/BLAST/ • PSI-BLAST can be found at: • http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins&PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on

BLAST Entry Window Enter FASTA sequence or upload file Choose your search set Select Program

Results for mutant 1exr Calmodulin At 1.68 Angstroms Resolution Length=148 Score = 267 bits (683), Expect = 2e-70, Method: Compositional matrix adjust. Identities = 142/153 (92%), Positives = 142/153 (92%), Gaps = 8/153 (5%) Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN Sbjct 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 Query 61 GTIDFPEFLSLMARKMKEQDDQFDQSEEELIEAFKVFDRFFFGLISAAELRHV---LGEK 117 GTIDFPEFLSLMARKMKEQD SEEELIEAFKVFDR GLISAAELRHV LGEK Sbjct 61 GTIDFPEFLSLMARKMKEQD-----SEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEK 115 Query 118 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK 150 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK Sbjct 116 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK 148 Deleted Inserted Mutated

Swiss-Prot/UniProt Database • Central hub for the collection of functional information on proteins. • amino acid sequence • protein name or description • taxonomic data and citation information • Access point to ProSite • http://www.expasy.ch/tools/scanprosite/ • Prosite identifies sequences and displays any associated motifs, or accepts a motif and returns related sequences.

What are Motifs? • A different approach for incorporating multiple sequence information into a database search is to use a Motif. • Motif’s do not assign score at every position in an alignment, but describes key residues that are conserved and define the family. Sometimes this is called a "signature". • Example of pseudo-EF-Hand motif (Calciomics – Pattern Search, developed in Dr. Yang’s lab) • [LMVITNF]-[FY]-X(2)-[YHIVF]-[SAITV]-X(5,9)-[LIMV]-X(3)-[EDS]-[LFM]-[KRQL]-X(20,28)-[LQKF]-[DNG]-X(1)-[DNSC]-X(1)-[DKN]-X(4)-[FY]-X(1)-[EKS] • Specific residues can also be excluded by enclosing in curly brackets {DE}

Multiple Sequence Alignment (MSA) Alignments can provide information on: • domain structure • location of residues likely to be involved in protein function • Solvent exposure of residues • Evolutionary relationships • Build profiles for more sensitive searches What you can do with this information: • Create ‘signatures’ for pattern searching • Identify conserved vs. variable regions • Identify structural and/or functional motifs

MSA with ClustalW http://www.ebi.ac.uk/Tools/clustalw2/index.html

Ca-O-C angles for a) Non-EF-Hand and b) EF-Hand

Distribution of SC angles S100 S100 1WDC C 2BL0 C R2 S100 S100 Calbindin d9k Calbindin d9k Penta-EF Penta-EF Parvalbumin Parvalbumin Osteonectin 2HQ8 Parvalbumin Parvalbumin 2H2K S100 Parvalbumin R1 Parvalbumin Parvalbumin Polcalcin Polcalcin Unrooted N-J Phylogenic Tree generated by Treeview

Distribution of MC angles S100 S100 1WDC C 2BL0 C R1 S100 S100 Calbindin d9k Calbindin d9k Penta-EF Penta-EF Parvalbumin Parvalbumin Osteonectin 2HQ8 Parvalbumin Parvalbumin 2H2K S100 Parvalbumin Parvalbumin Parvalbumin Polcalcin R2 Polcalcin Unrooted N-J Phylogenic Tree generated by Treeview

Secondary Structure: PDBSum • http://www.ebi.ac.uk/pdbsum/ • Predicted 2° structure from sequence • Either enter PDB file or can load new/existing sequence

Secondary Structure: PDBSum 2oky

Protein Data Bank • http://www.rcsb.org/pdb/home/home.do • Comprehensive database of protein structures • Provides: • 3D structural data • Fasta sequence • Citation Info (who solved it, related publications, etc.) • experimental methods (X-Ray Diffraction, NMR) • resolution • classification (e.g. – metal transporter) • ligands, cofactors • Related PDB entries

PDB ATOM/HETATM Record Format Data Record Partitioning Occupancy: Indicates frequency an atom is detected in specific location. Where occupancy < 1.00, x-ray diffraction indicates more than 1 position, i.e. – there is flexibility or disorder. B-Factor: Thermal motion of atom. High B-factor implies uncertainty. Text View of PDB File 1-6 Record name "ATOM " or "HETATM“ 7-11 Atom serial number 13-14 Chemical symbol (right justified) 18-20 Residue name 22 Chain identifier 23-26 Residue sequence number 31-38 X- coordinate 39-46 Y- coordinate 47-54 Z- coordinate 55-60 Occupancy 61-66 Isotropic B-factor 77-78 Element symbol ATOM 1 N ALA A 43 69.834 21.345 42.623 1.00 76.76 N ATOM 2 CA ALA A 43 69.016 22.376 41.988 1.00 72.63 C ATOM 3 C ALA A 43 67.991 21.777 41.038 1.00 63.96 C ATOM 4 O ALA A 43 66.942 22.368 40.784 1.00 56.68 O ATOM 5 CB ALA A 43 69.924 23.339 41.198 1.00 72.97 C

Pymol Viewer Can save session, including labels, angles, distances, etc. These features can be turned on or off without loss of data.

Proteomics Tools: External tools to extract PDB Data • http://bip.weizmann.ac.il/oca-bin/lpccsu/ • LPC Analysis of interatomic Contacts in Ligand-Protein complexes • CSU Analysis of interatomic contacts in protein entries • OCA allows the user to rapidly search through the contents of the entire PDB Archive for entries obeying certain constraints • Ex. I want to find all proteins that have Zn2+ bound to structure, deposited in PDB between certain dates

Revising the PDB File • Adding Hydrogen Atoms (Required for using Delphi) • Reduce (http://kinemage.biochem.duke.edu/software/reduce.php) • Runs on Mac, Linux, Windows • Free to download • Sybyl (http://www.tripos.com) • Runs on Linux • Not free • Calculating Electrostatic Potential • Delphi (http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:DelPhi) • Runs on Mac, Linux, Windows (C and Fortran Compilers req’d) • Free to download

Protein Structure: Adding Hydrogen SYBYL • In addition to adding Hydrogen atoms to a PDB file, Sybyl can be used to compare structures, calculate RMSD values between structures, perform minimization calculations.

Protein Structure Analysis • PONDR (Predictor of Naturally Disordered Regions) • (http://www.pondr.com/) • Internet-based • Not free • VADAR (Volume, Area, Dihedral Angle Reporter) • (http://redpoll.pharmacy.ualberta.ca/vadar/) • Internet-based • Free to use Leigh Willard, Anuj Ranjan,Haiyan Zhang,Hassan Monzavi, Robert F. Boyko, Brian D. Sykes, and David S. Wishart "VADAR: a web server for quantitative evaluation of protein structure quality" Nucleic Acids Res. 2003 July 1; 31 (13): 3316.3319

Protein Structure Analysis: PONDR Use a series of neural network predictors (NNPs) that use sequence data to predict disorder (i.e. – lack of fixed 3° structure) in a given region.

Protein Structure Analysis: VADAR • A compilation of 15+ algorithms for analyzing and assessing peptide and protein structures from PDB data. • Ramachandran plot: • Shows possible conformations of phi and psi angles for residues in a protein based on energy considerations. • Very useful for determining whether model structures are likely conformations • Disallowed regions involve steric clash (VDW distances) β-sheet LH α-helix RH α-helix http://www.bmb.uga.edu/wampler/tutorial/prot2.html

Visualizing Electrostatic Potential: DelPhi and Grasp • DelPhi • (http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:DelPhi) • SGI Unix • Free to download • GRASP (Graphical Representation and Analysis of Structural Properties) • (http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:GRASP) • SGI Unix • Free to download

Visualizing Electrostatic Potential: DelPhi and Grasp DelPhi takes as input a coordinate file format of a molecule or equivalent data for geometrical objects (PDB File) calculates electrostatic potential in and around the system, using a finite difference solution to the Poisson-Boltzmann equation. Produces modified PDB file and emap file as input to a 3rd party visualization software (e.g. – GRASP). GRASP then displays and manipulates the surfaces of molecules and their electrostatic properties.

Proteomics Tools: GetArea 1.1 • To quickly calculate solvent accessible surface area or solvation energy of a protein molecule. • Ex. Is a proposed metal-binding site solvent accessible? Total Area Area by Residue http://pauli.utmb.edu/cgi-bin/get_a_form.tcl

Prediction and Design • Prediction of protein functional site • Prediction of protein structure • Design of protein functional site • Design of protein structure • Why prediction and design?

Protein Structure Prediction • Modeller • http://www.salilab.org/modeller/ • Homology modeling • Tasser • http://zhang.bioinformatics.ku.edu/I-TASSER/ • Treading • Rosetta • http://robetta.bakerlab.org/ • Ab initio • CASP (many others) • http://predictioncenter.org/ • A center providing objective testing of prediction programs

Protein Structure Homology Modeling: SWISS-MODEL • http://swissmodel.expasy.org/ • Submit a FASTA sequence (known or unknown) • Swiss-Model conducts BLAST search to align sequence with known structures • Build 3D output model that can be viewed using DeepView (expasy) • Graphic file can be saved • Many other features including alignment modeling with MSA’s. 1EXR.pdb viewed using DeepView

Protein Structure Homology Modeling: PredictProtein Similar to Swiss-Model, Modeller Requires registration/login

Protein Structure Homology Modeling: Modeller

Modeller Šali and Blundell, JMB, 1993, Comparative protein modeling by satisfaction of spatial restraints

TASSER 1. Find templates (seq. with known structure) that share seq similarity (global or local) with query seq. 2. Based on 1, query seq. is divided into aligned segments (have template) and unaligned segments. 3. Using Monte Carlo method to connect the aligned segments 4. Outputs (multiple possible structures) are clustered and find structure obtained, Zhang and Skolnick, PNAS, 2004. Automated structure prediction of weakly homologous proteins on a genomic scale

Rosetta 1. Construct a fragment library for each three and nine residue The fragments are extracted from observed structures in PDB. 2. Model the structure of the fragments from the library 3. Connect the fragments. 4. Rank the predicted structures according to a scoring function.

Programs for Predicting Metal Binding Site • FEATURE • http://feature.stanford.edu/webfeature/ • Machine learning (Bayesian method) • MUG • http://chemistry.gsu.edu/faculty/Yang/Calciomics.htm • Geometric search to predict calcium binding site • CHED • http://ligin.weizmann.ac.il/ched • Combine machine learning and geometric search to predict zinc and other transition metal binding sites.

FEATURE • 1. Designed and tested their algorithm on protein holo structures. • 2. The protein structure is embedded into a 3D grid. • 3. Each grid point is evaluated by probability scoring function (Wei and Altman) • 4. The points of high score are the predicted Ca2+ location Wei and Altman, Protein Science, 1998

O O O C dist(C,O) - dist ( Ca , C ) 2 + Ca O O Ca - O - C O C Monodentate C O Bidentate Observation A B C D filters <6.0Å MUG Wang, Kirberger, Qiu, Chen and Yang, Proteins, 2009

CHED • 1. Use protein apo structures • 2. Geometric search for a qualified triad of C, H, E, D • 3. Side-chain rotation of a unqualified triad • 4. Apply filters to resulting qualified triad to classify the triad as binding triad or non-binding triad d3 d2 d1 qualified unqualified qualified output binding/nonbinding triad Babor et al., Proteins, 2008

Design Program • DEZYMER (Hellinga) • Given a ligand and a protein with known structure, suggest residues to be mutated so that the resulting protein binds the ligand. • ORBIT (Mayo) • Given a backbone structure, design a sequence such that it folds to that backbone. • Rosetta (Baker) • One program to treat diverse problems • Prediction and design

DEZYMER 1. Define the expected binding geometry 2. Find backbone places where if appropriate side chains are added, the predefined geometry is satisfied 3. Place the side chains and ligand, and optimize there position 4. Repack residues in positions other than binding residues. If necessary, change residue type Hellinga and Richards, JMB, 1991. Construction of new ligand binding sites in protein of known structure

ORBIT 1. Divide the target structure into three parts: core, surface and boundary 2. Core: Ala, Val, Leu, Ile, Phe, Tyr, Trp Surface: Ala, Ser, Thr, His, Asp, Asn, Glu, Gln, Lys, and Arg Boundary: union of the above two 3. 1.9*1027 possible sequence 4. Select best sequence efficiently, using dead end elimination (DDE) Solution structure of the designed protein. Stereoview showing the best-fit superposition of the 41 Comparison between the designed backbone (averaged NMR structure, blue) and the target backbone (red) Dahiyat and Mayo, Science, 1997. De Novo Protein Design: Fully Automated Sequence Selection

Supplemental Slides

Calciomics • Calciomics is a specialized area of biochemistry focusing on the study of calcium-binding biological macromolecules and proteins to understand the factors that contribute to calcium-binding affinity and the selectivity of proteins and calcium-dependent conformational change. • http://lithium.gsu.edu/faculty/Yang/Calciomics.htm

Bioinformatics Tools