Download
bioinformatics tools n.
Skip this Video
Loading SlideShow in 5 Seconds..
Bioinformatics Tools PowerPoint Presentation
Download Presentation
Bioinformatics Tools

Bioinformatics Tools

799 Views Download Presentation
Download Presentation

Bioinformatics Tools

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Bioinformatics Tools

  2. Bioinformatics • The use of computer science, mathematics, and information theory to model and analyze biological systems, especially systems involving genetic material. • Things I can do with a computer to improve and accelerate my work.

  3. Applications of Bioinformatics • Manage and analyze data in very large databases • genetic info (DNA) • protein sequences, structures • Collections of scientific papers, experimental results • Compare sequences and structures • Do similar sequences or folding indicate proteins have similar functions? • Modeling and prediction • Predict 3D structure from known structures (homology) or based on some computational approach without modeling (ab initio) • Prediction of function from structure • Molecular mechanics/ molecular dynamics • Prediction of molecular interactions, docking • Perform energy minimization calculations • Predict useful mutations for protein engineering

  4. Sources of Data • Sequence databases (EBI) • FASTA (sequence similarity) • http://www.ebi.ac.uk/Tools/fasta33/ • SwissProt (database of protein sequences) • http://expasy.org/sprot/ • 3D structure database: the RCSB – PDB • http://www.rcsb.org/pdb/home/home.do

  5. Sequence Analyses • Sequence Alignment • Single or Multiple Sequences • Motif or Pattern Search • Prediction of Secondary Structure • 1E9N:A|PDBID|CHAIN|SEQUENCEMPKRGKKGAVAEDGDELRTEPEAKKSKTAAKKNDKEAAGEGPALYEDPPDQKTSPSGKPATLKICSWNVDGLRAWIKKKGLDWVKEEAPDILCLQETKCSENKLPAELQELPGLSHQYWLAL

  6. Sequence Alignment • Usually first step in analysis of any new/unidentified sequence is to perform comparisons with sequence databases to find existing homologues. • This might give you some idea of: • How the protein might potentially fold • What other proteins it is related to • What it’s function might be • FASTA (http://www.ebi.ac.uk/Tools/fasta33/) • One of several web servers you can use for this. • Provides similarity search against protein database. • Lets you select substitution matrix (BLOSUM50, BLOSUM62, etc.) for search. • a substitution matrix describes the rate at which one character in a sequence changes to other character states over time. • One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.

  7. Gaps In Sequence Alignments • When aligning sequences, score is affected by how much penalty is assigned to gaps in sequence. • For larger gaps: • Assumes greater evolutionary distance between sequences • Probably should be assigned a higher penalty ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC ATCTTCAGTGTTTCCCCTGTTTTGCCC....................ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGCCCCCCCCCCCCCCCCCCCATTTAGTTCGCTC Smaller gap, smaller penalty

  8. Other Sequence Databases • BLAST and PSI-BLAST also commonly used. • BLAST can be found at: • http://www.ncbi.nlm.nih.gov/BLAST/ • PSI-BLAST can be found at: • http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins&PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on

  9. BLAST Entry Window Enter FASTA sequence or upload file Choose your search set Select Program

  10. Results for mutant 1exr Calmodulin At 1.68 Angstroms Resolution Length=148 Score = 267 bits (683), Expect = 2e-70, Method: Compositional matrix adjust. Identities = 142/153 (92%), Positives = 142/153 (92%), Gaps = 8/153 (5%) Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN Sbjct 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 Query 61 GTIDFPEFLSLMARKMKEQDDQFDQSEEELIEAFKVFDRFFFGLISAAELRHV---LGEK 117 GTIDFPEFLSLMARKMKEQD SEEELIEAFKVFDR GLISAAELRHV LGEK Sbjct 61 GTIDFPEFLSLMARKMKEQD-----SEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEK 115 Query 118 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK 150 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK Sbjct 116 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK 148 Deleted Inserted Mutated

  11. Swiss-Prot/UniProt Database • Central hub for the collection of functional information on proteins. • amino acid sequence • protein name or description • taxonomic data and citation information • Access point to ProSite • http://www.expasy.ch/tools/scanprosite/ • Prosite identifies sequences and displays any associated motifs, or accepts a motif and returns related sequences.

  12. What are Motifs? • A different approach for incorporating multiple sequence information into a database search is to use a Motif. • Motif’s do not assign score at every position in an alignment, but describes key residues that are conserved and define the family. Sometimes this is called a "signature". • Example of pseudo-EF-Hand motif (Calciomics – Pattern Search, developed in Dr. Yang’s lab) • [LMVITNF]-[FY]-X(2)-[YHIVF]-[SAITV]-X(5,9)-[LIMV]-X(3)-[EDS]-[LFM]-[KRQL]-X(20,28)-[LQKF]-[DNG]-X(1)-[DNSC]-X(1)-[DKN]-X(4)-[FY]-X(1)-[EKS] • Specific residues can also be excluded by enclosing in curly brackets {DE}

  13. Multiple Sequence Alignment (MSA) Alignments can provide information on: • domain structure • location of residues likely to be involved in protein function • Solvent exposure of residues • Evolutionary relationships • Build profiles for more sensitive searches What you can do with this information: • Create ‘signatures’ for pattern searching • Identify conserved vs. variable regions • Identify structural and/or functional motifs

  14. MSA with ClustalW http://www.ebi.ac.uk/Tools/clustalw2/index.html

  15. Ca-O-C angles for a) Non-EF-Hand and b) EF-Hand

  16. Distribution of SC angles S100 S100 1WDC C 2BL0 C R2 S100 S100 Calbindin d9k Calbindin d9k Penta-EF Penta-EF Parvalbumin Parvalbumin Osteonectin 2HQ8 Parvalbumin Parvalbumin 2H2K S100 Parvalbumin R1 Parvalbumin Parvalbumin Polcalcin Polcalcin Unrooted N-J Phylogenic Tree generated by Treeview

  17. Distribution of MC angles S100 S100 1WDC C 2BL0 C R1 S100 S100 Calbindin d9k Calbindin d9k Penta-EF Penta-EF Parvalbumin Parvalbumin Osteonectin 2HQ8 Parvalbumin Parvalbumin 2H2K S100 Parvalbumin Parvalbumin Parvalbumin Polcalcin R2 Polcalcin Unrooted N-J Phylogenic Tree generated by Treeview

  18. Secondary Structure: PDBSum • http://www.ebi.ac.uk/pdbsum/ • Predicted 2° structure from sequence • Either enter PDB file or can load new/existing sequence

  19. Secondary Structure: PDBSum 2oky

  20. Protein Data Bank • http://www.rcsb.org/pdb/home/home.do • Comprehensive database of protein structures • Provides: • 3D structural data • Fasta sequence • Citation Info (who solved it, related publications, etc.) • experimental methods (X-Ray Diffraction, NMR) • resolution • classification (e.g. – metal transporter) • ligands, cofactors • Related PDB entries

  21. PDB ATOM/HETATM Record Format Data Record Partitioning Occupancy: Indicates frequency an atom is detected in specific location. Where occupancy < 1.00, x-ray diffraction indicates more than 1 position, i.e. – there is flexibility or disorder. B-Factor: Thermal motion of atom. High B-factor implies uncertainty. Text View of PDB File 1-6 Record name "ATOM " or "HETATM“ 7-11 Atom serial number 13-14 Chemical symbol (right justified) 18-20 Residue name 22 Chain identifier 23-26 Residue sequence number 31-38 X- coordinate 39-46 Y- coordinate 47-54 Z- coordinate 55-60 Occupancy 61-66 Isotropic B-factor 77-78 Element symbol ATOM 1 N ALA A 43 69.834 21.345 42.623 1.00 76.76 N ATOM 2 CA ALA A 43 69.016 22.376 41.988 1.00 72.63 C ATOM 3 C ALA A 43 67.991 21.777 41.038 1.00 63.96 C ATOM 4 O ALA A 43 66.942 22.368 40.784 1.00 56.68 O ATOM 5 CB ALA A 43 69.924 23.339 41.198 1.00 72.97 C

  22. Pymol Viewer Can save session, including labels, angles, distances, etc. These features can be turned on or off without loss of data.

  23. Proteomics Tools: External tools to extract PDB Data • http://bip.weizmann.ac.il/oca-bin/lpccsu/ • LPC Analysis of interatomic Contacts in Ligand-Protein complexes • CSU Analysis of interatomic contacts in protein entries • OCA allows the user to rapidly search through the contents of the entire PDB Archive for entries obeying certain constraints • Ex. I want to find all proteins that have Zn2+ bound to structure, deposited in PDB between certain dates

  24. Revising the PDB File • Adding Hydrogen Atoms (Required for using Delphi) • Reduce (http://kinemage.biochem.duke.edu/software/reduce.php) • Runs on Mac, Linux, Windows • Free to download • Sybyl (http://www.tripos.com) • Runs on Linux • Not free • Calculating Electrostatic Potential • Delphi (http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:DelPhi) • Runs on Mac, Linux, Windows (C and Fortran Compilers req’d) • Free to download

  25. Protein Structure: Adding Hydrogen SYBYL • In addition to adding Hydrogen atoms to a PDB file, Sybyl can be used to compare structures, calculate RMSD values between structures, perform minimization calculations.

  26. Protein Structure Analysis • PONDR (Predictor of Naturally Disordered Regions) • (http://www.pondr.com/) • Internet-based • Not free • VADAR (Volume, Area, Dihedral Angle Reporter) • (http://redpoll.pharmacy.ualberta.ca/vadar/) • Internet-based • Free to use Leigh Willard, Anuj Ranjan,Haiyan Zhang,Hassan Monzavi, Robert F. Boyko, Brian D. Sykes, and David S. Wishart "VADAR: a web server for quantitative evaluation of protein structure quality" Nucleic Acids Res. 2003 July 1; 31 (13): 3316.3319

  27. Protein Structure Analysis: PONDR Use a series of neural network predictors (NNPs) that use sequence data to predict disorder (i.e. – lack of fixed 3° structure) in a given region.

  28. Protein Structure Analysis: VADAR • A compilation of 15+ algorithms for analyzing and assessing peptide and protein structures from PDB data. • Ramachandran plot: • Shows possible conformations of phi and psi angles for residues in a protein based on energy considerations. • Very useful for determining whether model structures are likely conformations • Disallowed regions involve steric clash (VDW distances) β-sheet LH α-helix RH α-helix http://www.bmb.uga.edu/wampler/tutorial/prot2.html

  29. Visualizing Electrostatic Potential: DelPhi and Grasp • DelPhi • (http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:DelPhi) • SGI Unix • Free to download • GRASP (Graphical Representation and Analysis of Structural Properties) • (http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:GRASP) • SGI Unix • Free to download

  30. Visualizing Electrostatic Potential: DelPhi and Grasp DelPhi takes as input a coordinate file format of a molecule or equivalent data for geometrical objects (PDB File) calculates electrostatic potential in and around the system, using a finite difference solution to the Poisson-Boltzmann equation. Produces modified PDB file and emap file as input to a 3rd party visualization software (e.g. – GRASP). GRASP then displays and manipulates the surfaces of molecules and their electrostatic properties.

  31. Proteomics Tools: GetArea 1.1 • To quickly calculate solvent accessible surface area or solvation energy of a protein molecule. • Ex. Is a proposed metal-binding site solvent accessible? Total Area Area by Residue http://pauli.utmb.edu/cgi-bin/get_a_form.tcl

  32. Prediction and Design • Prediction of protein functional site • Prediction of protein structure • Design of protein functional site • Design of protein structure • Why prediction and design?

  33. Protein Structure Prediction • Modeller • http://www.salilab.org/modeller/ • Homology modeling • Tasser • http://zhang.bioinformatics.ku.edu/I-TASSER/ • Treading • Rosetta • http://robetta.bakerlab.org/ • Ab initio • CASP (many others) • http://predictioncenter.org/ • A center providing objective testing of prediction programs

  34. Protein Structure Homology Modeling: SWISS-MODEL • http://swissmodel.expasy.org/ • Submit a FASTA sequence (known or unknown) • Swiss-Model conducts BLAST search to align sequence with known structures • Build 3D output model that can be viewed using DeepView (expasy) • Graphic file can be saved • Many other features including alignment modeling with MSA’s. 1EXR.pdb viewed using DeepView

  35. Protein Structure Homology Modeling: PredictProtein Similar to Swiss-Model, Modeller Requires registration/login

  36. Protein Structure Homology Modeling: Modeller

  37. Modeller Šali and Blundell, JMB, 1993, Comparative protein modeling by satisfaction of spatial restraints

  38. TASSER 1. Find templates (seq. with known structure) that share seq similarity (global or local) with query seq. 2. Based on 1, query seq. is divided into aligned segments (have template) and unaligned segments. 3. Using Monte Carlo method to connect the aligned segments 4. Outputs (multiple possible structures) are clustered and find structure obtained, Zhang and Skolnick, PNAS, 2004. Automated structure prediction of weakly homologous proteins on a genomic scale

  39. Rosetta 1. Construct a fragment library for each three and nine residue The fragments are extracted from observed structures in PDB. 2. Model the structure of the fragments from the library 3. Connect the fragments. 4. Rank the predicted structures according to a scoring function.

  40. Programs for Predicting Metal Binding Site • FEATURE • http://feature.stanford.edu/webfeature/ • Machine learning (Bayesian method) • MUG • http://chemistry.gsu.edu/faculty/Yang/Calciomics.htm • Geometric search to predict calcium binding site • CHED • http://ligin.weizmann.ac.il/ched • Combine machine learning and geometric search to predict zinc and other transition metal binding sites.

  41. FEATURE • 1. Designed and tested their algorithm on protein holo structures. • 2. The protein structure is embedded into a 3D grid. • 3. Each grid point is evaluated by probability scoring function (Wei and Altman) • 4. The points of high score are the predicted Ca2+ location Wei and Altman, Protein Science, 1998

  42. O O O C dist(C,O) - dist ( Ca , C ) 2 + Ca O O Ca - O - C O C Monodentate C O Bidentate Observation A B C D filters <6.0Å MUG Wang, Kirberger, Qiu, Chen and Yang, Proteins, 2009

  43. CHED • 1. Use protein apo structures • 2. Geometric search for a qualified triad of C, H, E, D • 3. Side-chain rotation of a unqualified triad • 4. Apply filters to resulting qualified triad to classify the triad as binding triad or non-binding triad d3 d2 d1 qualified unqualified qualified output binding/nonbinding triad Babor et al., Proteins, 2008

  44. Design Program • DEZYMER (Hellinga) • Given a ligand and a protein with known structure, suggest residues to be mutated so that the resulting protein binds the ligand. • ORBIT (Mayo) • Given a backbone structure, design a sequence such that it folds to that backbone. • Rosetta (Baker) • One program to treat diverse problems • Prediction and design

  45. DEZYMER 1. Define the expected binding geometry 2. Find backbone places where if appropriate side chains are added, the predefined geometry is satisfied 3. Place the side chains and ligand, and optimize there position 4. Repack residues in positions other than binding residues. If necessary, change residue type Hellinga and Richards, JMB, 1991. Construction of new ligand binding sites in protein of known structure

  46. ORBIT 1. Divide the target structure into three parts: core, surface and boundary 2. Core: Ala, Val, Leu, Ile, Phe, Tyr, Trp Surface: Ala, Ser, Thr, His, Asp, Asn, Glu, Gln, Lys, and Arg Boundary: union of the above two 3. 1.9*1027 possible sequence 4. Select best sequence efficiently, using dead end elimination (DDE) Solution structure of the designed protein. Stereoview showing the best-fit superposition of the 41 Comparison between the designed backbone (averaged NMR structure, blue) and the target backbone (red) Dahiyat and Mayo, Science, 1997. De Novo Protein Design: Fully Automated Sequence Selection

  47. Supplemental Slides

  48. Calciomics • Calciomics is a specialized area of biochemistry focusing on the study of calcium-binding biological macromolecules and proteins to understand the factors that contribute to calcium-binding affinity and the selectivity of proteins and calcium-dependent conformational change. • http://lithium.gsu.edu/faculty/Yang/Calciomics.htm