Proteogenomics : Refining and Improving Genome Annotation

Proteogenomics:Refining and ImprovingGenome Annotation Samuel H Payne J Craig Venter Institute

State of Genome Annotation • Most prokaryotic genomes are auto-annotated. • Sequence and function are inferred with comparative genomics; validation is sparse. • Difficulties with novel or HGT genes • Mature protein features • localization • PTM, cleavage Salzberg 2007

Diversity or Confusion

Proteomics • Input: protein sample • Output: list of peptides

Proteogenomics • Definition: using proteomics data to do genome annotation • Goals: • Find all coding regions of the genome, annotated and unannotated • Submit improved annotation to NCBI • Identify “mature protein” features

Proteogenomics Protocol • Data sources • Yersinia pestis - Pieper et al., 2008, 2009 • Bacillus anthracis – PRC/NIAID

Correcting Errors • Unannotated genes • Both known and totally novel

Correcting Errors • Start site assignment

Exceptions to Rules • Multi-ORF genes: self splicing, frame shift

Exceptions to Rules • Non-canonical start codons • infC – ATT (Sacerdot 1982, Payne 2010) in enterobacteria; ATA in Shewanella (Gupta 2007) • Deinococcus (Baudet 2009) suggests new non-standard starts

Overlaps/Wrong Frames

Pseudo?genes • 5 peptides (with splicing) map to a transposable element gene. Sequence alignment to an Arabidopsis Ulp1 Castellana 2008 • Expression of ABC transporter n-terminus. Missing critical motif elements.

Signal Peptide • N-terminal motif, target protein for export • 1983 Perlman & Halvorson • Early basic residue, hydrophobic patch, AxB motif • A = [I,V,L,A,G,S], B = [A,G,S]

Profile of an Exported Protein • Early basic residue, hydrophobic patch, motif

Future • Rinse and repeat • 30 proteomes in 3 years • Stable, robust pipeline for general use • Hosted at TeraGrid

When Gene Predictors Fail • Are GC extremes difficult? • 50% (Y. pestis) – 4 missed • 30’s (B. anthracis, L.interrogans) 4, 20 • 60’s (D. vulgaris, D. radiodurans) 55, 225

Are They Strange? • Relative GC – does it fail on genes with different GC from others?

Are They All Short?

We See What We Know • Proximity to Model Organism • Yersinia/Bacillus errors: 4/4 • ‘Remote species’ errors: 20, 55, >200

We See What We Know • Hypothetical vs. Named • Compare novel genes to observed proteome • Hypergeometric where Null probability is from the observed proteome

Expressed Protein Resource • Protein Sequences • >30 M sequences • nr, uniprot • JCVI metagenomics • JGI genomes • 40,000 clusters • Cross referenced with proteomics, for validated proteins

Acknowledgements • Eli Venter • Shih-Ting Huang, Rembert Pieper • Granger Sutton • Dick Smith, PNNL • NSF

Proteogenomics : Refining and Improving Genome Annotation