Improving the Sensitivity of Peptide Identification

Improving the Sensitivityof Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center

Searching under the street-light… • Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! • Searching traditional protein sequence databases biases the results in favor ofwell-understoodand/orcomputationally predicted proteins and protein isoforms!

Lost peptide identifications • Missing from the sequence database • Search engine strengths, weaknesses, quirks • Poor score or statistical significance • Thorough search takes too long

Lost peptide identifications • Missing from the sequence database • Build exhaustive peptide sequence databases • Build evidence for unannotated proteins and protein isoforms • Search engine strengths, weaknesses, quirks • Use multiple search engines and combine results • Poor score or statistical significance • Use search-engine consensus to boost confidence • Use machine-learning to distinguish true from false • Thorough search takes too long • Harness the power of heterogeneous computational grids

Unannotated Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • Peptide Atlas raftflow, raftapr, raftaug • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications

Unannotated Splice Isoform

Splice Isoform Anomaly • Human erythroleukemia K562 cell-line • Depth of coverage study • Resing et al. Anal. Chem. 2004. • Peptide Atlas A8_IP • SALT1A2 gene: • Sulfotransferase family, cytosolic, 1A • 2 ESTs, 1 mRNA • mRNA from lung, small cell-cancinoma sample • Single (significant) peptide identification • Five agreeing search engines • PepArML FDR < 1%. • All source engines have non-significant E-values

Splice Isoform Anomaly

Peptide Sequence Databases All amino-acid seqs of at most 30 amino-acids from: • IPI and all IPI constituent protein sequences • IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank • SwissProt variants, conflicts, splices, and annotated signal peptide truncations. • Genbank and RefSeq mRNA sequence • 3 frame translation • GenBank EST and HTC sequences • 6 frame translation and found in at least 2 sequences Grouped by Gene/UniGene cluster and compressed.

Peptide Sequence Databases • Formatted as a FASTA sequence database • Easy integration with search engines. • One entry per gene/cluster. • Automated rebuild every few months.

Peptide evidence, in context • Statistically significant identified peptides can be misleading… • Isobaric amino-acid/PTM substitutions • Unsubstantiated peptide termini • Few b-ions or y-ions suggest “random” mass match • Single amino-acids on upstream or downstream exons • Peptides in 5’ UTR with no upstream Met • Need tools to quickly check the corroborating (genomic, transcript, SNP) evidence

Counts: by gene and evidence EST, mRNA, Protein Sequences: accessions by gene UniProt variants nucleotide sequence & link to BLAT alignment Genomic Loci: one-click projection onto the UCSC genome browser peptides with cSNPs too! PeptideMapper Web Service

PeptideMapper Web Service I’m Feeling Lucky

Combining search engine results – harder than it looks! • Consensus boosts confidence, but... • How to assess statistical significance? • Gain specificity, but lose sensitivity! • Incorrect identifications are correlated too! • How to handle weak identifications? • Consensus vs disagreement vs abstention • Threshold at some significance? • We apply unsupervised machine-learning.... • Lots of related work unified in a single framework.

PepArML – Peptide identification Arbiter by Machine-Learning

Peptide Atlas A8_IP LTQ Dataset

Peptide Atlas Halobacterium Dataset

Running many search engines Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Tracking spectrum identifiers • Extracting peptide identifications, especially modifications and protein identifiers

Instrument Precursor Tolerance Fragment Tolerance Max. Charge Sequence Database Target and # of Decoys Modification Fixed/Variable Amino-Acids Position Delta Proteolytic Agent Motif Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks Search Engines Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Peptide Identification Meta-Search Parameters

Simple unified search interface for: Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid Peptide Identification Meta-Search

PepArML Meta-Search Engine X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to 250+ simultaneoussearches X!Tandem, KScore, OMSSA. Single, simplesearch request UMIACS 250+ CPUs

PepArML Meta-Search Engine Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs

Peptide Atlas A8_IP LTQ Dataset • Tryptic search of Human ESTs using PepSeqDB • 107084 spectra (145 files) searched ~ 26 times: • Target + 2 decoys, 5 engines, 1+ vs 2+/3+ charge • 8685 search jobs • 25.7 days of CPU time. • 5211 TeraGrid TKO jobs < 2 hours • Using 143 different machines • Total elapsed time < 26 hours • Bottleneck: Mascot license (1 core, 4 CPUs)

PepArML Meta-Search Engine • Access to high-performance computing resources for the proteomics community • NSF TeraGrid Community Portal • University/Institute HPC clusters • Individual lab compute resources • Contribute cycles to the community and get access to others’ cycles in return. • Centralized scheduler • Compute capacity can still be exclusive, or prioritized. • Compute client plays well with HPC grid schedulers.

Conclusions Improve sensitivity of peptide identification, using • Exhaustive peptide sequence databases • Machine-learning for combining • Meta-search tools to maximize consensus • Grid-computing for thorough search Tools & cycles available to the community... http://edwardslab.bmcb.georgetown.edu

Acknowledgements • Dr. Catherine Fenselau • University of Maryland Biochemistry • Dr. Rado Goldman • Georgetown University Medical Center • Dr. Chau-Wen Tseng & Dr. Xue Wu • University of Maryland Computer Science • Funding: NIH/NCI

Improving the Sensitivity of Peptide Identification