Data quality and model parameterisation

Data quality and model parameterisation Martyn Winn CCP4, Daresbury Laboratory, U.K. Prague, April 2009

Model Parameters E.g. asymmetric unit contains n copies of a protein of N atoms Coordinates 3 x N x n xyz co-ordinates or ... 6 x M x n if each protein modelled as M rigid bodies or ... ~ 0.5 x N x n torsion angles Displacement parameters 1 x N x n B factors or ... 6 x N x n anisotropic U factors or ... 20 x M x n if each protein has M TLS groups

Model Parameters (2) Occupancies Usually fixed at 1.0 for protein ... except for alternative conformations (usually sum to 1.0) Water/ligand occupancies Scaling parameters etc. koverall, Boverall, kBabinet, BBabinet, ksolvent, Bsolvent twin fraction Ultra-high resolution Multipolar expansion coefficients Interatomic scatterers

Reflection Data Number of independent reflections, dependent on: • spacegroup • resolution • completeness For each reflection, one has at least F/sigF. Might also have reliable experimental phases φor F(+)/F(-) • How many reflections to include? • What I/σI is acceptable for refinement? • Answer: Include ALL reflections no matter how weak • ... unless systematic errors • ... different answer for phasing • ... quoted resolution may be lower

Data / parameter ratio Refinement means minimise -log(likelihood): Nonlinear function of model parameters. Global minimum and many local minima. Need good data/parameter ratio. Strong dependence on resolution. No strong dependence on protein size. Generally not enough data .... Reduce number of parameters - constraints Add data - restraints

Restraints Expected geometry of the protein • treated as additional data bond lengths bond angles torsions / dihedral (but not φ,ψ) chirality (e.g. chiral volume) planarity non-bonded (VdW, H-bonds, etc.) B factors (between bonded atoms) U factor restraints (similarity, sphericity, rigid bond) NCS (position or conformation)

Data / parameter ratio Estimate as: no. reflections + no. restraints no. parameters Not really true ... assumes all data independent bond lengths and angles and planar restraints in ring system bond length restraint vs. high resolution diffraction data Restraints have associated weights: Overall w.r.t. reflection data Individual weights e.g. WB Restraints may be more necessary in poorly determined parts of the structure.

calmodulin at 1.8 Å (1clm) 1132 protein atoms, 4 Ca atoms, 71 waters • 4828 x, y, z, B factors No. of unique reflections 10610 (deposited 1993  no test set!)  data/parameter = 2.2 Bond restraints: 1144 Angle restraints: 1536 Torsion restraints: 429 Chiral restraints: 170 Planar restraints: 874 Non-bonded restraints: 1391 B factor restraints: 2680 (no NCS) total restraints = 8224  data/parameter = 3.9

calmodulin at 1.0 Å (1exr) 1467 protein atoms (inc. alt. conf.), 5 Ca atoms, 178 waters • 4950 x, y, z + 9900 anisotropic U factors + 316 occupancy parameters • total parameter count = 15166 No. of unique reflections 77150 No. in test set 7782 (10%) Data for refinement 69368 No. of restraints (PDB header) 22732  data/parameter = 4.6  data/parameter = 6.1

GCPII at 1.75 Å (3d7g) 5724 protein atoms (inc. alt. conf.), 211 ligand atoms, 617 waters • 26046 x, y, z, B factors + 162 anisotropic U factors (S, Zn, Ca, Cl only) + 225 occupancy parameters • total parameter count = 26433 No. of unique reflections 105077 No. in test set 1550 (1.5%) Data for refinement 103527 No. of restraints (PDB header) 44652  data/parameter = 3.9  data/parameter = 5.6

Thioredoxin reductase at 3.0Å (1h6v) 22514 protein atoms, 552 ligand atoms, 9 waters • 92300 x, y, z, residual B factors 6 TLS groups • 120 TLS parameters No. of unique reflections 69328 No. in test set 3441 (5%) Data for refinement 65887 No. of restraints 209378 (inc. 44484 NCS restraints)  data/parameter = 0.7  data/parameter = 3.0

Avoiding overfitting: Rfree What's wrong?: • Can add any old parameters to improve R-factor, when low data/parameter ratio • May not be physically correct – "overfitting" Solution: • Calculate R-factor on a set of reflections not used in refinement = "Rfree" • If changes to model improve Rfree as well as R, then they are good. • Note: Rfree is global number - useful for refinement strategies, not useful for assessing changes to a few atoms

Choosing your free reflections • Usually a randomly chosen subset. • Typically 5-10% (CCP4 default is 5%) • If you have enough reflections, impose maximum number (2000 in phenix.refine) • Free set also used in maximum likelihood to estimate σA parameters

Rfree and NCS • NCS operators map different regions of reciprocal asymmetric unit onto each other. Reflections in these regions are correlated. gaps = free set working reflections free reflections

Rfree and NCS • Solution: choose free set from thin shells in reciprocal space Pros: NCS operators link regions of same resolution which should be both in a shell or outside it Cons: Large number of shells  thin shells  most free reflections close to edge and correlated to non-free reflections Small number of shells  significant gaps in resolution range, poor determination of σA SFTOOLS: RFREE 0.05 SHELL 0.001 • 3rd argument = width of shells in Å-1 • Also DATAMAN.

1xmp (1.8 Å) XXX (3.8 Å) Width 0.01 3 shells Width 0.005 3 shells Width 0.0013 20 shells (default) Width 0.0005 20 shells (default)

Rfree and NCS • Can increase size of free set to mitigate edge effects • Or use NCS-related free set islands • Reflections also correlated to immediate neighbours in reciprocal space - can exclude these from working and free sets Fabiola, Korostelev & Chapman, Acta Cryst D62, 227, (2006) • Rapidly run out of working reflections! Be aware that correlations can artificially reduce your Rfree

Rfree and twinning Twinning operator might relate e.g. reflection (1,2,3) to (2,1,-3) These two reflections should both be in the working set or the free set. • Select free set in thin shells (as NCS) • Select free reflections in higher lattice symmetry

Transferring free R sets Use the same free set for: additional datasets for same protein datasets from isomorphous proteins (derivatives, complexes, etc.) (how isomorphous is not clear, but play safe ...) Otherwise initial R & Rfree will be similar and low for second structure - it has been refined against most of your free reflections Further refinement may lead to divergence of R & Rfree, masking the bias. Harder to detect over-fitting. Although may eventually reset Rfree. How: Use "CAD" / "Merge MTZ files (CAD)" in CCP4.

Useful resources http://ccp4wiki.org/ - CCP4 Wiki http://strucbio.biologie.uni-konstanz.de/ccp4wiki/ - CCP4 community wiki Proceedings of Study Weekend 2004 (Acta Cryst D, Dec 2004)

Data quality and model parameterisation

Data quality and model parameterisation

Presentation Transcript

Data Quality and Compliance

Measurement and Data Quality

Greek Cadastre Quality Model and Quality Checking of spatial cadastral data

Data Quality and Verification

Information and data quality

Data quality and model parameterisation

Parameterisation of Urban Sprawl

Quality assessment and model Institutes

Observations and model data

Parameterisation of particle fluxes

Data Quality Protocol and Data Synchronization

Data and Information Quality

Over-parameterisation and model reduction

FLake - A Lake Parameterisation Scheme for the COSMO Model: Implementation and Testing

Parameterisation of Glottal Waveforms for Characterisation of Laryngeal Voice-Quality

Data Quality And Stewardship

Data model and RPID

DATA COLLECTION AND IMPROVING DATA QUALITY

Status of shower parameterisation

Data Quality Protocol and Data Synchronization

On Model Parameterisation

Data quality and checking