430 likes | 564 Vues
What you need to simulate flexibility. A “how to do it” guide. Andrew Emerson, CINECA, Bologna (Italy). What you need to simulate flexibility. A system model Software Hardware Time. I have a scientific problem. Can I use simulation ?.
E N D
What you need to simulate flexibility A “how to do it” guide Andrew Emerson, CINECA, Bologna (Italy).
What you need to simulate flexibility • A system model • Software • Hardware • Time Parma School 2007
I have a scientific problem. Can I use simulation ? • Before embarking on a simulation useful to see what has been done and what is being done.. • Is my problem feasible ? Parma School 2007
Computer (MD) simulation – some Landmarks • 1959: First MD simulation (Alder and Wainwright) • Hard spheres at constant velocity. 500 particles on IBM-704. Simulation time >2 weeks • 1964: First MD of a continuous potential (A. Rahman) • Lennard-Jones spheres (Argon), 864 particles on a CDC3600. 50,000 timesteps > 3 weeks • 1977: First large biomolecule (McCammon, Gelin and Karplus). • Bovine Pancreatic Trypsine inhibitor. 500 atoms, 9.2ps • 1998: First μs simulation (Duan and Kollman) • villin headpiece subdomain HP-36. Simulation time on Cray T3D/T3E ~ several months 1953: First Monte Carlo simulation , Metropolis et al. Parma School 2007
MD simulation – current milestones • 2006. MD simulation of the complete satellite tobacco mosaic virus (STMV) • 1 million atoms, 50ns using NAMD on 46 AMD and 128 Altix nodes • Found that capsid unstable without RNA Parma School 2007
MD simulation - current milestones • 2006: Longest run. Folding @ home (Jayachandran,V. Vishal and Vijay S. Pandea + general public) • 500 μs of Villin Headpiece protein (34 residues). Many short trajectories combined by Markovian State Models. Parma School 2007
Hardware milestones • 1975: First supercomputer, Cray -1 • 80 Mflops, $9M • 2006: Most powerful supercomputer, BlueGene/L (LLNL) • 128K processors, 360 tera FLOPS, 32 Tb memory. • 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk. • Blue Matter MD program. Designed to make use of all Blue Gene processors, e.g. μs simulation of rhodopsin. folding@home equivalent to ~712 Tflops, peak ~150 Tflops (Wikipedia) Parma School 2007
But for most people .. For classical MD simulations, • Simulations on 2-128 processors (or cores), memory ca. 1-2 GB/processor (but memory rarely a problem for MD) • Systems containing < 10^4 atoms (incl. solvent) • Trajectories < 100ns Haven’t included ab-initio MD, Car-Parinello, mixed MM/QM, surfaces, solid-state … Parma School 2007
Overall procedure for classical MD GUI/manual MD control/params raw structure analysis paper trajectory clean structure + solvent preparation software MD engine text output other files (connectivity, atom types) • add H • add missing residues • apply patches • correct atom types + bond orders. final structure standard force-field additional ff params Parma School 2007
force-field Software • Preparation system • Molecular dynamics engine • Analysis programs Parma School 2007
Preparation of system • Need at this stage to decide on force-field and MD engine because some prep software more suitable. • Best to use the software provided or suggested by the makers of the force-field or MD engine. • Examples • VMD for NAMD or other charmm-based programs • Antechamber for Amber • Accelrys Materials Studio for Discover • Sybyl/Tripos • Mixing different softwares often results in files with incompatible atom types or definitions (TIP3 or HOH ? LP ?) Parma School 2007
Preparation of system • Do we need a custom program at all ? • Very often an editor or unix commands can make modifications or analyze structures more quickly. • sed –e /TIP3/HOH/ file1.pdb > file2.pdb (change water definition for force-field) • grep –v ‘HETATM’ 3DPI.pdb > output/protein.pdb (remove non-protein segments) • grep –c TIP3 file1.pdb (no. of waters in pdb) • grep “>” file.fasta | wc –l (no. of sequences in a FASTA file) • For more expert users, program libraries or toolkits are available for writing own programs (CDK, OpenEye, python, VMD/TCL,..) Parma School 2007
Common force fields • Amber (proteins, DNA,..) • charmm (proteins, DNA, macromolecules) • ESFF (many elements, incl. metals) • CVFF (small molecules and macromolecules) • Gromos (proteins, nucleotides, sugars, etc for GROMACS) • OPLS (params optimized for liquid simulations) • MMFF94 (broad range of small organic molecules, often used in docking) Parma School 2007
force fields • But must be remembered that force-fields are empirical, i.e. approximations designed to agree with computer calculations (e.g. MMFF94) and/or experimental results such as calorimetry data. • Doesn’t mean they are the “right” values. • Parameters or energy expression may have to be modified: • from literature • QM calculations (e.g VMD provides a file to calculate partial charges with Gaussian) Parma School 2007
CASE Study 1. Parameters for unfolding simulation BLG (bovine β-lactoglobulin) • Aim to perform very long trajectories of BLG in urea (10M)+water and water only using NAMD/Charmm. • We selected BLG, because its structure and chemico-physical properties have been extensively investigated, it is easy and cheap to purify and it can be studied by DGGE in parallel. Parma School 2007
β-lactoglobulin Ligand binding protein (transport protein) 1498 atoms 152 amino acids Parma School 2007
Simulation of ProteinL • First validate method + params with a previously studied (and smaller) system. • We have chosen ProteinL, an immunoglobulin-binding protein isolated from the bacteria Peptostreptococcus magnus. • System: protein (61 aa), water (3384), urea (8M) + NaCl • Experimental and simulation (Gromacs/Gromos) say it should unfold at 480K (water) in less than 10ns. In 10M Urea at 400K unfolds in ~20ns (α-helix) Guerini Rocco et al. Parma School 2007
coil -sheet -bridge bend turn 3-helix -helix 5-helix secondary structure prediction of proteinL at 480K with thanks to Ivano Eberini Parma School 2007
CASE Study 1. Unfolding simulation of proteinL • However we wish to use NAMD/Charmm (why?). • We have repeated these results using “classical” urea parameters of Caflisch and Karplus (1995) with charmm22. Parma School 2007
NAMD/Charmm simulations of proteinL DSSP 0 12ns water only at 480K after 12ns Parma School 2007
NAMD/Charmm simulations of proteinL DSSP 0 10 20 water + 10M urea after 20ns Parma School 2007
NAMD/Charmm simulations of proteinL • i.e. standard urea parameters not suitable for NAMD/charmm unfolding expt • Check web + literature and came up with some refined parameters for urea/charmm from Jorgensen et al based on OPLS (thanks to the charmm forum) • Rerun simulation, using vega package to prepare the .inp files for charmm Parma School 2007
Refined parameters for charmm/urea 0 8 16 proteinL +10M urea + water during 16ns Parma School 2007
MD engines • NAMD (University of Illinois) • The best performing code in terms of parallel scalability. Claimed to scale up to thousands of processors. • Written in Charm++ (a C++ library). Difficult to modify for FORTRAN/C programmers. Lacks some features. • GROMACS (Groningen University) • Very fast on single processor, many features, excellent analysis facilities. User extensible. • Poor parallel scalability (<=16 procs reported) • Amber, Charmm • Popular and feature-rich with well-validated standard force-fields. • Scalability dependent on system. • Commercial (e.g. Tripos, Accelrys, MOE, Maestro) • Usually come with convenient graphical interfaces. • For those without source code difficult to implement effectively on custom hardware. Parma School 2007
Case study 2- Molecular Dynamics simulation of the NFκB Transcription Factor • The Rel/NF-kB proteins belong to one of the most studied transcription factor families and are involved in many normal cellular and organism processes, immune and inflammatory responses, development, cellular growth and apoptosis. • The activity of NF-kB is regulated primarily by interaction with inhibitory IkB proteins through the formation of a stable IkB/NF-kB complex. One of the best-characterized members of the NF-kB family is a heterodimer composed of p50 and p65 subunits and the most studied NF-kB-IkB interaction is that with IkBα. • The protein complex simulated consists of the NF-kB p50/p65 heterodimer in the open conformation and the IkBα inhibitor. I. Eberini, A. Emerson, R. Gambari, E. Rossi, S. Giuliani, L. Piccagli, in progress Parma School 2007
Molecular Dynamics simulation of the NFkB Transcription Factor The three dimensional structures of the complexes suggest that the presence of IkBα allows a marked conformational change of the DNA-bound ‘open’ conformation of the p50-p65 heterodimer with the adoption in the resting state of a IkBα-bound ‘closed’ conformation. p65 p50 ikBα Fig. 1 - Starting structure for the MD simulation: NF-kB p65 and p50 subunits bound to its inhibitor IkB-α Parma School 2007
Molecular Dynamics simulation of the NFκB Transcription Factor • A 13 ns molecular dynamics simulation of this complex was performed using the NAMD program with the following parameters: CHARMm27 force field, explicit solvent (TIP3P water), 300 K, and Particle Mesh Ewald (P.M.E.) for the electrostatic force calculation. • The calculations were run in parallel on the IBM Linux cluster installed at CINECA. On this machine 1ns took about 800 CPU hours using 16 processors (ca. 150,000 atoms). Parma School 2007
Molecular Dynamics simulation of the NFκB Transcription Factor Fig. 2 – Root mean square deviation of NF-kBα and IkBα macromolecular complexes during 13 ns of MD Parma School 2007
Molecular Dynamics simulation of the NFκB Transcription Factor Fig 3 – Extreme projections of the MD trajectory (0-7ns) along the first eigenvector computed by essential dynamics (IkBα inhibitor not shown). Parma School 2007
Experience of NfkB • Many desktop packages could not handle the large system size (150,000 atoms). Often used unix for maniuplation + VMD. • Long runs needed to be split up and chained together for the batch system. bash/perl scripts to automate the process simplified life considerably. • Time spent in file conversion and labelling, organising and archiving trajectory files (DCD). • Analysis with gromacs: • catdcd –o run_0-2.trr –otype trr run_0.dcd run_1.dcd • g_rms –f run_0-2.trr –s nfkb.pdb Parma School 2007
Hardware • Laptops or PCs fine for setting up the system and trial runs, but larger clusters or supercomputers still used for production. • Supercomputers are difficult to use and expensive. What about Grids ? Parma School 2007
Grids • Grid middleware links heterogeneous resources in a transparent way, hiding differences in hardware, operating system and batch schedulers. • Major efforts include TeraGrid (US), EGEE and DEISA. • EGEE community grid based on virtual organisations • DEISA links together Europe’s supercomputer centres Have been used for “Grand challenge” projects such as project folding, earthquake simulation and climate modeling. Parma School 2007
Grids • Should be remembered that Grid doesn’t actually provide more resources. • Although has been used for some very large simulations, major advantage is ease of use for linking heterogenous resources and simplifying user interface. • For some users the presence of grid middleware is a disadvantage: • performance overheads • fragility of middleware • need for digital certificates • Direct access to a supercomputer still the most convenient solution for medium-expert computer users. Parma School 2007
Using supercomputers.. • Despite massive advances in personal computers, all but small systems or short trajectories tend to be done on departmental clusters or larger computers • Even for small or interactive software some users prefer centralised resources to a pc (support, backup services, ..) Parma School 2007
But supercomputers are difficult • Access still almost exclusively command line only, despite web or grid portals. • Forget about sophisticated graphics. Sometimes graphics libraries (e.g. X11) are not installed. • Some knowledge of UNIX essential just to get started. • Batch or queueing systems (LSF, Loadleveler, PBS, Torque), different file systems, interconnects,.. • Program must be parallelised. If not might as well use a PC (at least for MD). Parma School 2007
Benchmarking • Before using a parallel computer the program + system should be benchmarked on the hardware you intend to use. • Published benchmarks can be very selective. Parma School 2007
Speedup R where P = performance (e.g. ps/day) Why did we use NAMD for NFkB ? NAMD: NFkB +solvent(150k atoms) GROMACS: BLG+solvent (33k atoms) Parma School 2007
Scalability (or efficiency) S where P = performance (e.g. ps/day) Benchmarks Parma School 2007
IBM Power5 (SP5) 64 nodes, 8 procs/node Benchmarks – hardware dependence Parma School 2007
Benchmarks – hardware dependence performance ps/day but does not take into account how long a job has to wait in the batch queue ! Parma School 2007
Final comments • Although laptops and personal computers are often used to prepare the system and run small tests, production runs still performed on supercomputers. • At our centre, we haven’t noticed a drop in usage. Parma School 2007
Final comments • Supercomputers are difficult to use and require knowledge of UNIX, batch systems, disk technologies, etc. • Benchmarking of codes is essential. Worth running MD simulations on PC first. • Main problem though lack of (good) standards: much time is spent converting files from one program or format to another. • Many attempts to standardise chemical information in a consistent and lossless way (e.g. CML) have not gained wide acceptance. Parma School 2007
Acknowledgements • I. Eberini and his group (Uni. Milan) • L. Piccagli and R. Gambari (Uni. Ferrara) • S. Giuliani and E. Rossi (CINECA) • Users of CINECA’s systems • Developers and providers of NAMD, GROMACS, VMD, Swissprot, Amber, Charmm, … Parma School 2007