Native-Source Structural Proteomics

Native-Source Structural Proteomics Nathaniel Echols*, Monica Totir*, Andrew May#, Chloe Zubieta*, Alisa Moskaleva*, Tom Alber* * UC Berkeley # Fluidigm Corporation Protein Structure Initiative Bottlenecks Workshop April 15th, 2008

Native-source structural proteomics • Native sourcesprovide access to samples that may be difficult to obtain by recombinant methods • Project goal: obtain structures of complexes and low-abundance proteins 1. Scale up purification (>100 g protein) 2. Scale down crystallization (picoliter reactions) • No cloning, no overexpression.

Experimental approach • Use E. coli as a model system to develop the purification protocol necessary to go from grams of starting material to 100 μg fractions • Screen the final samples at a concentration of >10 mg/ml in Fluidigm Topaz chips and identify the crystallizable fractions • Identify samples by mass spectrometry • Set the selected samples in diffraction capable chips or nanodrop crystallization trays for X-ray data collection

Proof-of-concept: the E. coli proteome • Small, well-studied proteome, but still some novelty: • 4243 predicted proteins (manageable number of molecular species) • 860 membrane proteins • 1000 proteins with > 90% sequence identity to known structures • 1250 with > 50% sequence identity • 2000 with > 30% sequence identity • Nearly 1400 uncharacterized non-membrane proteins • Existing structures allow us to validate approach • Easy to grow in massive quantities • Lysis and clarification are relatively simple

Proteome component sizes Cellular protein content is dominated by large assemblies

Purification scheme A new philosophy--keepeverything--required new strategies Lyse at pH 7-8 Cross flow size fractionation – 500 kDa TFF Proteins/complexes bigger than 500 kDa Proteins/complexes smaller than 500 kDa Sucrose gradients SP Sepharose Capto Q Steps Size exclusion chromatography Superdex 200 Phenyl MonoQ Phenyl MonoQ/MonoS MonoQ/MonoS Scalable, gentle purification scheme

Purification scheme (continued) Proteins/complexes smaller than 500 kDa Column size Approx. protein quantity 1-2 L 50 g 300 mL 10 g 20-50 mL 1 g 1-8 mL 10-100 mg Blue Heparin Capto MMC Superdex 200 Phenyl MonoQ/MonoS Typical Anion Exchange chromatogram of the final samples

The first large-scale prep Capto Q Phenyl MonoQ/MonoS • 200 g of E. coli cells grown in M9 minimal medium and lysed • Purification scheme: • 272 fractions analyzed in 96-well Caliper electrophoresis robot and selected for crystallization Caliper “gel”

Crystallization pipeline Sub-optimal chip crystals MS identification 96 well sitting drop for further optimization Purity checked by Caliper gel Microfluidic crystallization with the Fluidigm TOPAZ system (8.96 chips) Promising chip crystals MS identification Diffraction-capable chips X-ray data collection

Microfluidic crystallization • 272 samples set in Fluidigm TOPAZ 8.96 chips with Index screen • Automated inspection and scoring required to find crystals efficiently • 190/272 (70%) produced crystals or microcrystals in chips (high redundancy in crystal forms) • 50 unique crystal forms by visual inspection • High-quality crystals possible even in very impure samples http://www.fluidigm.com/topaz.htm

Crystal optimization • 66 samples picked for optimization in nanodrop vapor diffusion trays (using Mosquito robot) • Protocol: sample 40%-100% precipitant concentration with different protein:well ratios (1:3, 1:1, 3:1) • 50 of hits (76%) were reproducible by this method

Diffraction-capable microfluidic chips “Hands-Free” data collection Reagents Samples 10 nL sample chambers ALS Beamline 8.3.1

Structure determination • MS identification of unique crystals should be the first step • 25 unique native datasets collected at ALS 8.3.1/12.3.1 • 15 already published structures identified • 3 structures novel in E. coli, phased by MR • Robotics and automation software used for data collection and processing whenever possible

Rapid structure identification by MR • Concept: identify protein from “anonymous” diffraction data (no mass spec info) • Search set of every PDB structure homologous to an E. coli protein (~10,000 models) • Molecular replacement rotation function run using each model • Identical structures are usually high-scoring • Homologous proteins may still score better than average • Potential solutions can be verified by full MR

Experimental phasing • The largest bottleneck: much more manual labor required • Cryoprotectants contain heavy monovalent ions (Br+, Rb-) • Metal quick-soaks (0.5 - 5 mM): • Ethyl mercury phosphate/thimerosal • HgCl2 or PCMBS (p-Chloro-mercuric-benzenesulphonate acid) • SmCl3 • PtCl4, PtCl6

Current structures, new and old (Structures labelled in red were identified by brute-force search.) New: (% identity to PDB) Methylglyoxal reductase (37%) pGlucose isomerase (65%) ß-glucosidase (?) (bglA) (33%) Old: ycaC Arginosuccinate lyase Catalase HPII (also in truncated form) Lysyl-tRNA synthetase Dihydrodipicolinate synthase Citrate synthase Cystathionine -synthase Transhydrogenase domain I pSer aminotransferase Pyruvate kinase Hsp31 chaperone 5-keto-4-deoxyuronate isomerase PPIase Molybdopterin biosynthesis prot. B

Purity of crystallized samples

Summary • Macro-to-micro strategy tested with E.coli • Large-scale fractionation pipeline: • New approaches and equipment (TFF, larger columns, Caliper CE robot) needed to scale up and keep everything • Currently 464 fractions isolated for crystallization • Small-scale crystallization: • >50% of fractions crystallized in Topaz microfluidic format • Many impure fractions yielded starting crystals • Optimization in sitting drops and new diffraction chips was efficient • Structure determination: • 25 data sets collected, 18 structures phased, all oligomeric • 3 structures novel to E. coli • Brute-force molecular replacement was used in most cases

Future directions • Continue improvements to purification methods • Pathogenic organisms (e.g. Mycobacteria) • Plant/mammalian proteomes: diploid, much larger and more complex • Smaller sets of related proteins: • Protease-resistant domains • Serum proteins • ATP-binding proteins • Metalloproteins • Large complexes

Acknowledgements • Tom Alber, Monica Totir, Chloe Zubieta, Alisa Moskaleva • Andy May (Fluidigm) • Scott Gradia, James Berger (UCB) • James Holton (ALS) • George Meigs, Jane Tanamatchi (ALS) • ALS beamlines 8.3.1, 12.3.1 • Tony Iavarone (QB3 MS facility) • Scripps Center for Mass Spectrometry • W.M. Keck Foundation • Millipore Corporation • Funded in part by UC Discovery/Fluidigm Corporation and NIGMS grant GM71326-02

Lysate at pH 7 Cross flow size fractionation – 500 kDa TFF Proteins/complexes bigger than 500 kDa Proteins/complexes smaller than 500 kDa SP Sepharose Blue Heparin Capto MMC Sucrose gradients Size exclusion chromatography Superdex 200 MonoQ Phenyl MonoQ/MonoS Second large-scale prep – a better purification scheme • 1000 g of E.coli cells grown in M9 minimal medium and lysed • 192 unique final samples to be screened in 8.96 chips and subsequently set up • in diffraction-capable chips

Apparently rare proteins accessible I will have to look this up. Or do we have smth like this? # genes Abundance ( # transcripts)

Native-Source Structural Proteomics