Comprehensive strategy for integrated target selection in structural genomics

Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/talks/ http://cubic.bioc.columbia.edu Comprehensive strategyfor integrated target selectionin structural genomics

Comprehensive strategy for integrated target selection • Our research goal and current reality • Unit: sequence-structure familiesGoals: cover allentire families with good models • STAGE 1: CHOP + CLUP + filtering -> novel automatic organization of sequence-structure space • STAGE 2: Refined, manual selection ->model all family members? stop-work/hold-work? • STAGE 3: Explore experimental structure • Answers and perspectives • How many structures needed for completion? • Euka-proka-archae: overlap? • Why collaborate on targets? • Multiplexing helpful? • High-throughput protein production in eukaryotes?

Computational biology & bioinformatics

Sequence-structure family Sequence-structure family U Sequence-structure family U’

EVA: comparative modelling Cumulative distribution PSI-BLAST 10-3 Marc Marti Renom & Andrej Sali (UCSF) http://eva.compbio.ucsf.edu/~eva/cm/http://cubic.bioc.columbia.edu/eva Accuracy Coverage V Eyrich, MA Marti-Renom, D Przybylski, A Fiser, F Pazos, A Valencia, A Sali & B Rost (2001) Bioinformatics 17, 1242-1243 MA Marti-Renom, MS Madhusudhan, A Fiser, B Rost, A Sali (2002) Structure 10, 435-440

How to decide when we exclude/include? C Sander & R Schneider 1991 Proteins, 9, 56-68 B Rost 1999 Prot Engng, 12, 85-94

Scooping families from proteomes, in practice Problems: • domains • overlaps

Choose targets: single-linkage clustering Conclusions: • NO clustering of full- length proteins • have to chop into structural-domain- like fragments (single-linkage DOES work on PrISM) ~100,000 eukaryotic proteins (yeast, fly, worm, weed, human) 22 112 clusters 46 318 in largest cluster NONSENSE! Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in press Liu & Rost 2003 Proteins, submitted

CHOP proteins into structural domains Liu & Rost 2003 Proteins, submitted

CHOP: dissection of proteins into domains Average domain length • in proteins ≥ 2 domains: ~100 residues • in proteins with 1 domain: 1.7-3 times longer Single-domain proteins: 61% in PDB 28% in 62 proteomes Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in press Liu & Rost 2003 Proteins, submitted

To take or not to take Take if > 50 globular residues and no known 3D

Structural residue coverage in reality (any) 53% of residues to do ! ~28% ~19% J Liu & B Rost 2002 Bioinformatics, 18, 922-933

If you believe 53% is pessimistic ... 53% residue coverage today based on E-value 1!!

Clustering after CHOP 21,000 fragment clusters Jinfeng • 103 796 eukaryotic proteins (Yeast, Fly, Worm, Arabidopsis, Human/30)247 222 domain-like fragments167 717 no PDB (E-value 10-1, HSSP-distance -3) 44 718 not good 4 us (membrane, coil, SEG, NORS, signal peptide) • 122 999 2 go 95 330 non-singleton Liu, Montelione & Rost 2003 Proteins, in press

Main goal of Stage 2 analysis Diana Murray, Cornell • Refine Stage 1 automatic target selection through manual sequence analysis • Concept: USE comparative modeling and structural features directly for refined target selection • For each sequence-structure family from Stage 1:predict minimal set of exp. structures needed to high-quality model entire family.

Refinement protocol 4 new 3DTarget re-prioritization based on weekly PDB updates Diana Murray, Cornell Toolbox Input: PDB + NESG cluster 1. Fold recognitionand sequence-to-structure profiles 2. Comparative modeling (PrISM, Nest) 3. Structure evaluation tools (e.g. Verify3d) 4. Calculate biophysical properties Recommend 2 do additional structure if: 1) NESG-cluster members poorly modeled 2) Biophysical properties of models incompatible with known function 3) Models suggest novel functionality

Target Status Example of stop work recommendation IR21 solved, PDB: 1MOS ET28 Purified JR15 Expressed TT777 Expressed GR7 Expressed AR12 Cloned WR204 Selected XR4 Expressed Diana Murray, Cornell Experimental structure of IR21 yielded high-quality models for all members of this NESG sequence/structure family Stop work SPINE/ ZebaView

HR291 AR1731 HR2295 HR291 AR1731 HR2295 A HR291 AR1731 HR2295 KR12 DR11 B KR12 DR11 Diana Murray, Cornell Two structures required to cover family: Predicted by Stage 2 analysis and verified by Stage 3 analysis NESG family: HR291 (99% identical to 1P9O), AR1731, HR2295, KR12, DR11 breaks into two clusters: A = (HR291, AR1731, HR2295) and B = (KR12, DR11) Recommendation: Solve structure of KR12 (purified)

Model suggests novel function: 30S ribosomal protein S27 Archaeal structure Diana Murray, Cornell NESG ID: GR2; PDB ID: 1QXF Archaeoglobus fulgidis S27e protein has only archae and eukaryotic members. Archae and eukaryotes share conserved hydrophobic motif (yellow). Only eukaryotes have N-terminal extension, and their models have strikingly different electrostatic properties. Human protein recommended for structure determination! Model for human homologue

Summary Stage 2 refinement Diana Murray, Cornell • Statistics: • Many families currently under investigation Hold work recommendation: • family member at advanced experimental stage • predicted to yield good models for entire family -> hold-work for members at early exp. stages re-assess once structure done!

Exploit structure to speculate about function • 43 no previous annotation about functiondefined by ‘no publication in biological journal’39 analyzed • 31 result in some predictions about function • 8 clear success: functional annotation achieved e.g. predicted active site based on structure typically: conformation of annotation transfer • 23 some hints (16 ‘hypothetical proteins’) e.g. some clue about active site mostly completely new! • 8 no clue Sharon Goldsmith & Barry Honig

Answers • How many structures needed for completion? • Euka-proka-archae: overlap? • Why collaborate on targets? • Multiplexing helpful? • High-throughput protein production in eukaryotes?

How many targets for prokaryotes + archae? 16,000 min 8,000 give: 72% fragments 72% proteins 67% residues

How many targets for euka-proka-archae? 8,000 8,000 give: 67% fragments 67% proteins 59% residues BUT: 50% of residues remaining

Overlap between euka-proka-archae? • ~60% of fragments from eukaryotes no sequence-structure family member from prokaryotes or archae • much higher for ‘largest 8,000’: • 2,690 (34%) proka+archae only • 4,277 (53%) euka only • 1,033 (13%) mix • surprisingly small overlap overall • even lower for largest families • most big families are eukaryotic!

Why collaborate on target list? 32% overlap competition between consortia has already hampered success-rate considerably!

Does multiplexing help? Date: 2003-07-28 ~4% Multiplex DOUBLES success rate!

Integrated strategy • NESG unique, comprehensive, integrated strategy optimized to organize sequence space in structural terms: • Stage 1: CHOP+CLUP+filter yields high success in focusing on sequence-structure families • Stage 2: detailed refinement embeds comparative models into selection and optimizes structural coverage for family • Stage 3: use experimental structure to increase structural family coverage and to allow functional exploitation • Needed to do ‘em all: • ~38,000 non-singletons • 8,000 largest -> 50% of the residues that remain! • Genomics: Surprises + our structural perspective changed the ‘world’! The revolutions continue ...

Thanksgiving Data: Jinfeng Liu (CUBIC) Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD) NESG: Guy Montelione (Rutgers) Barry Honig (Columbia) Diana Murray (Cornell, NYC) Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto) Wayne Hendrickson (Columbia) EVA: Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid) Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC) $$: NIH/NSF

Comprehensive strategy for integrated target selection in structural genomics

Comprehensive strategy for integrated target selection in structural genomics

Presentation Transcript

Center for Integrated Animal Genomics

Center for Integrated Animal Genomics

Center for Integrated Animal Genomics

Center for Integrated Animal Genomics

Center for Integrated Animal Genomics

Center for Integrated Animal Genomics

Berkeley Structural Genomics Center

A strategy to produce membrane proteins for structural genomics

Prioritization of targets for Structural Genomics

Center for Integrated Animal Genomics

Center for Integrated Animal Genomics

Center for Integrated Animal Genomics

Target Selection for BigBOSS

Structural and Evolutionary Genomics NATURAL SELECTION IN GENOME EVOLUTION Giorgio Bernardi

Structural Genomics

Center for Integrated Animal Genomics

Crystallization Methods for Structural Genomics @ AFMB

PSI Structural Genomics Knowledgebase

Center for Integrated Animal Genomics

PSI Structural Genomics Knowledgebase

Structural Proteomics Automatic Target Selection