470 likes | 570 Vues
Meta-Search and Result Combining. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Peptide Identifications. Search engines provide an answer for every spectrum... Can we figure out which ones to believe? Why is this hard?
E N D
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Peptide Identifications • Search engines provide an answer for every spectrum... • Can we figure out which ones to believe? • Why is this hard? • Hard to determine “good” scores • Significance estimates are unreliable • Need more ids from weak spectra • Each search engine has its strengths ...... and weaknesses • Search engines give different answers
Translation start-site correction • Halobacterium sp. NRC-1 • Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins • Goo, et al. MCP 2003. • GdhA1 gene: • Glutamate dehydrogenase A1 • Multiple significant peptide identifications • Observed start is consistent with Glimmer 3.0 prediction(s)
Halobacterium sp. NRC-1ORF: GdhA1 • K-score E-value vs PepArML @ 10% FDR • Many peptides inconsistent with annotated translation start site of NP_279651
Search engine scores are inconsistent! Tandem Mascot
Common Algorithmic Framework – Different Results • Pre-process experimental spectra • Charge state, cleaning, binning • Filter peptide candidates • Decide which PSMs to evaluate • Score peptide-spectrum match • Fragmentation modeling, dot product • Rank peptides per spectrum • Retain statistics per spectrum • Estimate E-values • Apply empirical or theoretical model
Comparison of search engines • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment OMSSA Mascot 10% 4% 2% 69% 9% 5% 2% X!Tandem
Simple approaches (Union) • Different search engines confidently identify different spectra: • Due to search space, spectral processing, scoring, significance estimation • Filter each search engine's results and union • Union of results must be more complete • But how to estimate significance for the union? • What if the results for same spectra disagree? • Need to compensate for reduced specificity • How much?
Union of filtered peptide ids Tandem Mascot
Union of filtered peptide ids Tandem Mascot
Union of filtered peptide ids Tandem Mascot
Simple approaches (Intersection) • Different search engines agree on many spectra • Agreement is unexpected given differences • Filter each search engine's results and take the intersection • Intersection of results must be more significant • But how to estimate significance for the intersection? • What about the borderline spectra? • Need to compensate for reduced sensitivity • How and how much?
Intersection of filtered peptide ids Tandem Mascot
Intersection of filtered peptide ids Tandem Mascot
Intersection of filtered peptide ids Tandem Mascot
Combine / Merge Results • Threshold peptide-spectrum matches from each of two search engines • PSMs agree → boost specificity • PSMs from one → boost sensitivity • PSMs disagree → ????? • Sometimes agreement is "lost" due to threshold... • How much should agreement increase our confidence? • Scores easy to "understand" • Difficult to establish statistical significance • How to generalize to more engines?
Consensus and Multi-Search • Multiple witnesses increase confidence • As long as they are independent • Example: Getting the story straight • Independent "random" hits unlikely to agree • Agreement is indication of biased sampling • Example: loaded dice • Meta-search is relatively easy • Merging and re-ranking is hard • Example: Booking a flight to Boston! • Scores and E-values are not comparable • How to choose the best answer? • Example: Best E-value favors Tandem!
Search for Consensus • Running many search engines is hard! • Identifications must have every opportunity to agree: • No failed searches, matched search parameters,sequence databases, spectra • But the search engines all use: • Varying spectral file formats, different parameter specifications for mass tolerance, modifications, pre-processing for sequence databases, different charge-state handling, termini rules • Decoy searches must also use identical parameters
Searching for Consensus • Initial methionine loss as tryptic peptide? • Missing charge state handling? • X!Tandem's refinement mode • Pyro-Gln, Pyro-Glu modifications? • Precursor mass tolerance (Da vs ppm) • Semi-tryptic only (no fully-tryptic mode).
Configuring for Consensus • Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Must strive to ensure that each search engine is presented with the same search criteria, despite different formats, syntax, and quirks. • Search engine configuration must be automated.
Results Extraction for Consensus • Must be able to unambiguously extract peptide identifications from results • Spectrum identifiers / scan numbers • Modification identifiers • Protein accessions • How should we handle E-values vs. probabilities vs. FDR (partitioned)? • Cannot rely on these to be comparable • Must use consistent, external significance calibration
Search Engine Independent FDR Estimation • Comparing search engines is difficult due to different FDR estimation techniques • Implicit assumption: Spectra scores can be thresholded • Competitive vs Global • Competitive controls some spectral variation • Reversed vs Shuffled Decoy Sequence • Reversed models target redundancy accurately • Charge-state partition or Unified • Mitigates effect of peptide length dependent scores • What about peptide property partitions?
Search Execution for Consensus • Running many search engines take time • 7 x 3 searches of the same spectra! • Some search engines require licenses or specific operating systems • How to use grid/cloud computing effectively? • Cannot assume a shared file-system • Search engines may crash or be preempted • Machine may "disappear" • Machine may consistently fail searches
Combining Multi-Search Results • Treat search engines as black-boxes • Generate PSMs + scores, features • Apply machine learning / statistical modeling to results • Use multiple match metrics • Combine/refine using multiple search engines • Agreement suggests correctness
Machine Learning / Statistical Modeling • Use of multiple metrics of PSM quality: • Precursor delta, trypsin digest features, etc • Often requires "training" with examples • Different examples will change the result • Generalization is always the question • Scores can be hard to "understand" • Difficult to establish statistical significance • e.g. PeptideProphet/iProphet • Weighted linear combination of features • Number of sibling searches
Available Tools • PeptideProphet/iProphet • Part of trans-proteomic-pipeline suite • Scaffold • Commercial reimplementation of PP/iP • PepArML • Publicly available from the Edwards lab • Lots of in-house stuff… • Result combining mentioned in talks, lots of papers, etc. but no public tools
Agreement score Peptide 1 Brian Searle Peptide 7 Get SEQUEST Identification Peptide 4 p=76% Get Mascot Identification Peptide 2 Peptide 8 p=81% For Each Spectrum p=56% Peptide 5 Get X!Tandem Identification Peptide 3 Peptide 6 Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made
PepArML Strategy • Meta-Search for Multi-Search: • Automatic configuration of searches • Automatic preprocessing of sequence databases • Automatic spectral reformatting • Automatic execution of search on local or remote computing resources (AWS/grid/NFS). • Result Combining: • Decoy-based FDR significance estimation • Unsupervised, model-free, machine-learning
Simple unified search interface for: Mascot, X!Tandem, K-Score, S-Score, OMSSA, MyriMatch, InsPecT+MSSGF Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid, Cloud Peptide Identification Meta-Search
Grid-Enabled Peptide Identification Meta-Search AmazonWeb Services Heterogeneous compute resources Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to 250+ simultaneoussearches Single, simplesearch request UniversityCluster
PepArML Combiner • Peptide identification arbiter by machine learning • Unifies these ideas within a model-free, combining machine learning framework • Unsupervised training procedure
PepArML Overview Feature extraction X!Tandem PepArML Mascot OMSSA Other
Dataset Construction X!Tandem Mascot OMSSA T F T …… T
Voting Heuristic Combiner • Choose PSM with most votes • Break ties using FDR • Select PSM with min. FDR of tied votes • How to apply this to a decoy database? • Lots of possibilities – all imperfect • Now using: 100*#votes – min. decoy hits
Application to Real Data • How well do these models generalize? • Different instruments • Spectral characteristics change scores • Search parameters • Different parameters change score values • Supervised learning requires • (Synthetic) experimental data from every instrument • Search results from available search engines • Training/models for all parameters x search engine sets x instruments
PepArML Performance LCQ QSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1
Conclusions • Combining search results from multiple engines can be very powerful • Boost both sensitivity and specificity • Running multiple search engines is hard • Statistical significance is hard • Use empirical FDR estimates...but be careful...lots of subtleties • Consensus is powerful, but fragile • Search engine quirks can destroy it • "Witnesses" are not independent