Prof. D. VELMURUGAN DEPARTMENT OF CRYSTALLOGRAPHY & BIOPHYSICS

High Throughput Technique in Structural Bioinformatics-Application to Catalase, an enzyme of 57 kDa molecular weight By Prof. D. VELMURUGAN DEPARTMENT OF CRYSTALLOGRAPHY & BIOPHYSICS UNIVERSITY OF MADRAS GUINDY CAMPUS CHENNAI – 600 025

One of the main interests in the molecular biosciences is in understanding structure function relations and X-ray crystallography plays a major role in this. ab initio solutions of the crystal structures of small molecules are possible by using atomic-resolution diffraction data, usually at ~0.8 Å. Most of these small molecular crystal structures are usually solved using direct-methods programs. Macromolecules have mainly been solved at resolutions less than atomic and this has necessitated determination of initial phases either from experimental techniques such as Molecular replacement techniques, MIR or MAD .

During the last decade, admirable advances have taken place in the data-collection facilities and techniques available to the macromolecular crystallographer. To get better X-ray intensity data for this purpose, new techniques like cryo temperature data collection, halide soaking and passing of Ar, Ne, Hg gas have been developed. With the above advances, more data sets appear to be coming from atomic-resolution data. The above possibility of gaining atomic resolution data even for macromolecules prompted the direct methods practitioners to make attempts to extend the direct methods using other macromolecular techniques to enable them to tackle the structure solution of macromolecules.

X-ray Crystallography has become a central tool in modern drug and target discovery, providing important insights into molecular interactions and biological function. The past few years have seen many advances in the methods underlying macromolecular crystallography such as protein production, crystallization, cryo-crystallography and synchrotron technology. Together these advances mean that X-ray data can be collected extremely quickly for many different crystals and ligand-bound complexes. The challenge is to ensure rapid and accurate interpretation of the data to provide valuable structural information. The High Throughput Crystallography (HTC) Consortium offers scientists a valuable new dimension to the drug discovery process. The HTC Consortium aims to accelerate crystallographic structure determination by developing new science as well as utilizing current technology to go from initial phasing through to structure refinement and analysis while minimizing the amount of human intervention that is required. The ability to examine in atomistic detail the interactions between many different proteins and ligands provides scientists unprecedented insight into the mechanics of drug binding.

Rapid and revolutionary developments in genome sciences, combinatorial chemistry, informatics and robotics are having major impacts on drug discovery. Genome sequencing projects in man and micro-organisms have provided an unprecedented number of potential drug targets. These have given impetus to the study of protein expression (proteomics) and structure (structural genomics), and have allowed a clearer description of drug targets as molecular components of disease processes. At the same time, there is rapidly expanding range of screening technologies, as well as consolidations in medicinal chemistry arising from the combinatorial approaches that were pioneered in the 1990s. These developments have created an environment for the emergence of new strategies for drug discovery. High-Throughput Crystallography is essential for structure-based lead discovery – a strategy that combines features of random screening and rational structure-based design.

More than 29,000 protein structures are deposited in the Protein Data Bank (PDB) and more than 1,50,000 sequence (SWISS-PROT) entries exist for which the three dimensional structures are not available. In Structural Genomics, one is interested in determining the structure in the fastest way to understand new folds and this has opened up the “High Throughput Crystallography”. An understanding of the three-dimensional structure (fold) correlates the function of the molecule. High Throughput Crystallography using Automated procedures promotes a quicker elimination of the structure having the same fold among the deposited ones when analyzing thousands of macromolecular structures for which functional assignments are yet to be known.

ACORN is a comprehensive and efficient phasing procedure involving direct methods for the determination of protein structures when atomic-resolution data are available (better than 1.2 Å) (Foadi et al., 2000; Mcauley et al., 2001; Yao, 2002; Foadi, 2003; Dodson & Yao, 2003). The fragment can be as a small-idealized piece of secondary structure (Rajakannan et al., 2004a, b; Selvanayagam et al., 2004) or an experimental substructure, such as a metal or a set of S, Se or similar atoms which can be located from anomalous scattering measurements. ACORN then uses a combination of approaches, most importantly dynamic density modification, to develop a refined set of phases. Key to the procedure is the use of a correlation factor for the weak amplitudes as a criterion of phase quality.

Dynamic Density Modification (DDM) is designed to modify the densities in three steps: ’ = 0 if <0 ’ = tanh{0.2[/()]3/2} if >0 ’ = kn() if ’>kn(), • It sets all negative densities to zero. • It modifies the positive densities according the ratio /(). • It truncates the modified densities to a value of kn(), where k is a constant given by the user (default value is 3); n is the cycle number of DDM.

The reflections are divided into three groups (strong, medium and weak) according to their normalized structure- factor (E) values. The strong reflections (E > 1.2) are used in the phase refinement by the Dynamic density modification (DDM) and Patterson superposition (SUPP) procedures. Both strong and weak reflections (E < 0.1) are used in Sayre-equation refinement (SER). The medium reflections (0.1 < E < 1.2) are used to calculate a correlation coefficient (CC) for each potential solution of DDM.

An important component of ACORN is a CC that describes the extent to which the magnitudes of the calculated normalized structure factors (Ec) resemble the observed normalized structure-factor amplitudes (Eo). A fragment in a particular position and orientation in the unit cell will have an associated set of structure factors and the CC will be expressed by where  = <E2> - <E2>½

Ec and CC values are calculated from the starting fragment for all reflections to find the correct orientation and position in molecular replacement (MR) or random MR or for single random-atom searching. In phase refinement Ec and CC values are calculated from the modified map for medium reflections, which are not used for computing the map, to indicate solutions of DDM.

The ACORN procedure, as implemented in CCP4, is divided into two parts, ACORN-MR and ACORN-PHASE, as illustrated in the flow diagram.

ACORN-MR, deals with finding the position of a fragment of the structure, even a single atom, that provides an initial set of estimated phases. This set is passed into ACORN-PHASE, where phase refinement by a number of real-space processes is performed. For locating a single atom, this approach randomly generates thousands of positions in the asymmetric unit. For each random position, the calculated normalized structure factor values and corresponding CCs are calculated for all reflections. 1000 sets with highest CCs are saved as starting points for further calculations. In most cases, the solution is normally found in the top 100 sets. This approach can be used to determine a native protein structure from AR data, if the structure contains at least one heavy atom (sulphur or heavier).

Foadi (2003) has given a detailed explanation of the reasons for the failure of ACORN when the resolution is below 1.2 Å. At atomic resolution, two neighbouring atomic peaks will be two separate entities and DDM will enhance both of them. At lower resolutions, these two peaks will merge into a single peak and DDM will just enhance it and no positive phase refinement can be expected in this situation. The present work overcomes the above problem at low resolution using the fragments for seed phasing information.

The use of ACORN in solving a 57 kDa macromolecule with atomic resolution (0.88 A) / truncated synchrotron data (1.5Å resolution) Micrococcus lysodeikticus catalase (Murshudov et al., 2002)

Details of the crystallographic data, helices, sheets and sets

Ab initio phasing using ACORN ACORN was run with 5000 random single atom trials and the 40 positions with highest CCs’ were selected. ACORN refined the phases from the random atom trials using DDM and led to the solution with good agreement of CC. In this run, 78 cycles of DDM increased the CC for medium reflections with E values from 0.0285 to 0.5246 in 14.2 hours of CPU time. In this ab initio case 8 chains could be automatically built with the ARP/wARP (Perrakis et al., 1999) followed by REFMAC (Murshudov et al., 1999) (482 residues). Manual model building was carried out for the missing residues and the final Rw and Rf values are 14.0 and 16.2% respectively. The superposition using PROFIT of the backbone atoms of this structure with the backbone atoms of the same structure solved using conventional technique gives the r.m.s deviation of 0.143 Å.

Details of ACORN, ARP/wARP and REFMAC results for ab initio case

Applications of truncated data at 1.5 Å resolution For set 23(minimum input), all sheets and one helix (helix4) containing 76 residues were given as input to ACORN. Here, the ACORN-PHASE option was selected for the structure solution. The R-factor and correlation coefficient for the medium E value reflections of the initial model are 54.2% and 0.0469, respectively. Within 56 cycles of DDM the R-factor and correlation coefficient attained 53.9% and 0.0771 indicating a good solution. The phases were then fed to ARP/wARP (Perrakis et al., 1999) followed by REFMAC (Murshudov et al., 1999). After the initial model building by ARP/wARP, the Rw and Rf values were 44.8 and 44.4% respectively. This initial model was refined for ten cycles of auto building along with five cycles of REFMAC in each auto-building cycle. Finally, ARP/wARP was able to build 212 residues. At this stage Rw and Rf values were 28.9 and 36.3% respectively. An iterative cycle carried out with these output phases revealed 481 residues out of 503 residues with a connectivity index of 0.97.

Manual model building was carried out in the missing regions as densities were clear. After the manual model building, 20 cycles of maximum-likelihood refinement were performed using REFMAC and solvent atoms were updated after the refinement using ARP/wARP ‘build solvent atoms’ script. The final Rw and Rf values were 13.6 and 15.6% respectively. The backbone of this final model was superimposed with the structure conventionally solved by the molecular replacement method. The root-mean square deviation was 0.176 Å and the details are shown in Table 2. The results for sets 1-16 and 23 are also shown in Table 2. Figs 3a to 3q describe the final models obtained after all the sets were used for ‘seed-phasing’ information to ACORN. Table 2 lists the ACORN statistics and the ARP/wARP details for all these cases. The final results obtained in each case are also mentioned in this table.

Table 2.Details of ACORN phasing, ARP/wARP model building and REFMAC refinement

Seed phasing using Cα atoms Only the 503 Cα atoms from the known structure were used for seed phasing to ACORN with the truncated data extending to 1.3 Å resolution. Successful model could be built with 474 amino acids (a.a), the backbone atoms of which had an r.m.s deviation 0.132 Å with the actual structure (1gwe). To mimic the above ‘seed feeding’ in real situations, mean positional errors (MPE, hereafter) of 0.1, 0.2 Å were introduced for the above Cα atoms using MOLEMAN (Kleywegt, 1992-2004). Successful model could be built with 483, 481 a.a corresponding to input fragments with MPE of 0.1 and 0.2 Å respectively. The backbone atoms of these had an r.m.s deviation of 0.169, 0.163 Å respectively with the actual structure (1gwe).

Results of ACORN and ARP/wARP using only Cα atoms (1gwe)Resolution 20-1.3 Å

PDB i.d. : 1gwe Total residues:503 Input: Calpha atoms (503) Auto built: 474 residues Input: 0.1Angstrom error at calpha atoms Auto built: 483 residues Input: 0.2Angstrom error at calpha atoms Auto built: 481 residues

Seed phasing using 120 a.a as polyala The first 120 a.a from the actual structure were treated as polyala model and the above procedures were carried out to obtain the final model. Results are detailed in Table. With the 120 residues as polyala model, ARP/wARP was able to build 111 residues in 15 chains when the above procedures were followed. An iterative cycle carried out with this output as input revealed 480 residues out of 503 residues with a connectivity index of 0.98. In the case of first 120 residues of polyala model with 0.1 Å MPE, ARP/wARP initially built only 6948 dummy atoms. Two iterative cycles carried out with this as input finally built 481 residues. These two models have an r.m.s deviation of 0.176, 0.173 Å respectively with the backbone atoms of the actual structure (1gwe).

Results of ACORN and ARP/wARP using polyala model (5atoms/a.a) (1gwe)Resolution 20 – 1.5 Å

PDB i.d. : 1gwe Total residues:503 Input: First 120 a.a as polyala model after introducing the MPE of 0.1Angstrom Auto built: 481 residues Input: First 120 a.a as polyala model Auto built: 480 residues

STEREO VIEW OF THE ELECTRON DENSITY (2FO-FC|) MAP SUPERPOSED WITH FINAL MODEL (Input: Polyala model for the first 120a.a with a MPE of 0.1 Å)

STEREO VIEW OF THE FINAL ELECTRON DENSITY (2FO-FC|) MAP STARTING WITH THE POLYALA MODEL OF FIRST 120A.A WITH MPE OF 0.1 Å

FINAL ELECTRON DENSITY (2FO-FC|) MAP FOR POLY ALA MODEL

ELECTRON DENSITY (2FO-FC|) MAP FOR HEME GROUP IN POLYALA MODEL

Seed phasing using Ncap, Ccap and Middle portions of helices/sheets Instead of feeding the entire helices or sheets [Selvanayagam et al., 2004 (a minimum of 76 residues were found to be sufficient for seed phasing with 1.5 Å truncated data to solve the three dimensional structure of catalase)] either the N cap/C cap regions or the mid portion in the helices or sheets could also be fed as input for seed phasing. Successful model can be built in these cases also. The results obtained are listed in Table.

Results of ACORN and ARP/wARP using Ncap, Ccap and Middle portions of helices/sheets (1gwe)Resolution 20-1.5 Å

Input: Ncap region of helices/sheets(76 a.a) Auto Built: 470 residues Input: Ccap region of helices/sheets(76 a.a) Auto Built: 474 residues Input: Middle region of helices/sheets(76 a.a) Auto Built: 479 residues Black shaded regions correspond to the input residues from 1gwe

Conclusion • Based on the published work and the work being carried out by our group (Rajakannan et al., 2004a; 2004b), it has now become very clear that very little information (15%) is needed to determine the structure of a protein using ACORN. • Ours is the first case of ACORN applications using seed-phasing information to solve even larger molecular weight protein (57 kDA) when the resolution extends to 1.5 Å. • Among the multiple solutions, the correct solutions can be obtained in all trials with high reliability by the working of correlation coefficient and hence high resolution and fairly complete diffraction data enable one to solve a protein ab initio, in a relatively short amount of time.

ACORN has the great potential to establish itself as program for high-throughput structure determination. • Currently, in order to extend the applicability of ACORN to lower resolutions, the seed phasing has been obtained from the native structure itself (as the structure had already been solved by traditional macromolecular crystallographic methods). Data mining approach to feed fragments using the PDB entries is in progress.

References Banumathi, S., Rajakannan, V., Velmurugan, D., Dauter, Z., Dauter, M., Tsai, M. D. & Sekar, K. (2002). Japanese Crystallographic Society Meeting, Poster,P3-II-27, 123. Collaborative Computational Project, Number 4 (1994). Acta Cryst.D50, 760-763. Dodson, E. J. & Yao, J. -X. (2003). Crystallogr. Rev.9, 67-72. Foadi, J. (2003). Crystallogr. Rev.9, 43-65. Foadi, J., Woolfson, M. M., Dodson, E. J., Wilson, K. S., Yao, J. -X. & Chao-de, Z. (2000). Acta Cryst. D56, 1137-1147. Kleywegt, G. J. (1992-2004). Uppsala University, Uppsala, Sweden. Unpublished program. McAuley, K. E., Yao, J. –X., Dodson, E. J., Lehmbeck, J., Ostergaard, P. R. & Wilson, K. S. (2001). Acta Cryst.D57, 1571-1578. Murshudov, G. N., Lebedev, A., Vagin, A. A., Wilson, K. S. & Dodson, E. J. (1999). Acta Cryst.D55, 247-255. Murshudov, G. N., Grebenko, A. I., Brannigan, J. A., Antson, A. A., Barynin, V. V., Dodson, G. G., Dauter, Z., Wilson, K. S. & Melik-Adamyan, W. R. (2002). Acta Cryst.D58, 1972-1982. Perrakis, A., Morris, R. M. & Lamzin, V. S. (1999). Nature Struct. Biol.6, 458-463. Rajakannan, V., Velmurugan, D., Yamane, T., Dauter, Z., Dauter, M., Tsai, M. D. & Sekar, K. (2002). Japanese Crystallographic Society Meeting, Poster, P3-I-22, 84.

Rajakannan,V., Yamane, T., Shirai, T., Kobayshi, T., Ito, S. & Velmurugan, D. (2003). International Symposium on Diffraction Structural Biology, Tsukuba, Japan, 28-31 May 2003, Poster P-085. Rajakannan, V., Yamane, T., Shirai, T., Kobayshi, T. Ito, S. & Velmurugan, D. (2004a). J. Synchrotron Rad. 11, 64-67. Rajakannan, V., Selvanayagam, S., Yamane, T., Shirai, T., Kobayshi, T., Ito, S. & Velmurugan, D. (2004b). J. Synchrotron Rad. 11, 358-362. Selvanayagam, S., Velmurugan, D., Yamane, T. (2004). Asian Crystallographic Association Meeting (AsCA’04) Poster(P0165). Velmurugan, D., Rajakannan, V., Yamane, T., Dauter, Z. & Sekar, K. (2002). Japanese Crystallographic Society Meeting, Poster, P3-II-26, 122. Yao, J. -X. (2002). Acta Cryst.D58, 1941-1947.

Prof. D. VELMURUGAN DEPARTMENT OF CRYSTALLOGRAPHY & BIOPHYSICS