1 / 47

Meta-Search and Result Combining

Meta-Search and Result Combining. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Peptide Identifications. Search engines provide an answer for every spectrum... Can we figure out which ones to believe? Why is this hard?

mira
Télécharger la présentation

Meta-Search and Result Combining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

  2. Peptide Identifications • Search engines provide an answer for every spectrum... • Can we figure out which ones to believe? • Why is this hard? • Hard to determine “good” scores • Significance estimates are unreliable • Need more ids from weak spectra • Each search engine has its strengths ...... and weaknesses • Search engines give different answers

  3. Mascot Search Results

  4. Translation start-site correction • Halobacterium sp. NRC-1 • Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins • Goo, et al. MCP 2003. • GdhA1 gene: • Glutamate dehydrogenase A1 • Multiple significant peptide identifications • Observed start is consistent with Glimmer 3.0 prediction(s)

  5. Halobacterium sp. NRC-1ORF: GdhA1 • K-score E-value vs PepArML @ 10% FDR • Many peptides inconsistent with annotated translation start site of NP_279651

  6. Translation start-site correction

  7. Search engine scores are inconsistent! Tandem Mascot

  8. Common Algorithmic Framework – Different Results • Pre-process experimental spectra • Charge state, cleaning, binning • Filter peptide candidates • Decide which PSMs to evaluate • Score peptide-spectrum match • Fragmentation modeling, dot product • Rank peptides per spectrum • Retain statistics per spectrum • Estimate E-values • Apply empirical or theoretical model

  9. Comparison of search engines • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment OMSSA Mascot 10% 4% 2% 69% 9% 5% 2% X!Tandem

  10. Simple approaches (Union) • Different search engines confidently identify different spectra: • Due to search space, spectral processing, scoring, significance estimation • Filter each search engine's results and union • Union of results must be more complete • But how to estimate significance for the union? • What if the results for same spectra disagree? • Need to compensate for reduced specificity • How much?

  11. Union of filtered peptide ids Tandem Mascot

  12. Union of filtered peptide ids Tandem Mascot

  13. Union of filtered peptide ids Tandem Mascot

  14. Simple approaches (Intersection) • Different search engines agree on many spectra • Agreement is unexpected given differences • Filter each search engine's results and take the intersection • Intersection of results must be more significant • But how to estimate significance for the intersection? • What about the borderline spectra? • Need to compensate for reduced sensitivity • How and how much?

  15. Intersection of filtered peptide ids Tandem Mascot

  16. Intersection of filtered peptide ids Tandem Mascot

  17. Intersection of filtered peptide ids Tandem Mascot

  18. Combine / Merge Results • Threshold peptide-spectrum matches from each of two search engines • PSMs agree → boost specificity • PSMs from one → boost sensitivity • PSMs disagree → ????? • Sometimes agreement is "lost" due to threshold... • How much should agreement increase our confidence? • Scores easy to "understand" • Difficult to establish statistical significance • How to generalize to more engines?

  19. Consensus and Multi-Search • Multiple witnesses increase confidence • As long as they are independent • Example: Getting the story straight • Independent "random" hits unlikely to agree • Agreement is indication of biased sampling • Example: loaded dice • Meta-search is relatively easy • Merging and re-ranking is hard • Example: Booking a flight to Boston! • Scores and E-values are not comparable • How to choose the best answer? • Example: Best E-value favors Tandem!

  20. Search for Consensus • Running many search engines is hard! • Identifications must have every opportunity to agree: • No failed searches, matched search parameters,sequence databases, spectra • But the search engines all use: • Varying spectral file formats, different parameter specifications for mass tolerance, modifications, pre-processing for sequence databases, different charge-state handling, termini rules • Decoy searches must also use identical parameters

  21. Searching for Consensus • Initial methionine loss as tryptic peptide? • Missing charge state handling? • X!Tandem's refinement mode • Pyro-Gln, Pyro-Glu modifications? • Precursor mass tolerance (Da vs ppm) • Semi-tryptic only (no fully-tryptic mode).

  22. Configuring for Consensus • Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Must strive to ensure that each search engine is presented with the same search criteria, despite different formats, syntax, and quirks. • Search engine configuration must be automated.

  23. Results Extraction for Consensus • Must be able to unambiguously extract peptide identifications from results • Spectrum identifiers / scan numbers • Modification identifiers • Protein accessions • How should we handle E-values vs. probabilities vs. FDR (partitioned)? • Cannot rely on these to be comparable • Must use consistent, external significance calibration

  24. Search Engine Independent FDR Estimation • Comparing search engines is difficult due to different FDR estimation techniques • Implicit assumption: Spectra scores can be thresholded • Competitive vs Global • Competitive controls some spectral variation • Reversed vs Shuffled Decoy Sequence • Reversed models target redundancy accurately • Charge-state partition or Unified • Mitigates effect of peptide length dependent scores • What about peptide property partitions?

  25. Search Execution for Consensus • Running many search engines take time • 7 x 3 searches of the same spectra! • Some search engines require licenses or specific operating systems • How to use grid/cloud computing effectively? • Cannot assume a shared file-system • Search engines may crash or be preempted • Machine may "disappear" • Machine may consistently fail searches

  26. Combining Multi-Search Results • Treat search engines as black-boxes • Generate PSMs + scores, features • Apply machine learning / statistical modeling to results • Use multiple match metrics • Combine/refine using multiple search engines • Agreement suggests correctness

  27. Machine Learning / Statistical Modeling • Use of multiple metrics of PSM quality: • Precursor delta, trypsin digest features, etc • Often requires "training" with examples • Different examples will change the result • Generalization is always the question • Scores can be hard to "understand" • Difficult to establish statistical significance • e.g. PeptideProphet/iProphet • Weighted linear combination of features • Number of sibling searches

  28. Available Tools • PeptideProphet/iProphet • Part of trans-proteomic-pipeline suite • Scaffold • Commercial reimplementation of PP/iP • PepArML • Publicly available from the Edwards lab • Lots of in-house stuff… • Result combining mentioned in talks, lots of papers, etc. but no public tools

  29. Agreement score Peptide 1 Brian Searle Peptide 7 Get SEQUEST Identification Peptide 4 p=76% Get Mascot Identification Peptide 2 Peptide 8 p=81% For Each Spectrum p=56% Peptide 5 Get X!Tandem Identification Peptide 3 Peptide 6 Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made

  30. PepArML Strategy • Meta-Search for Multi-Search: • Automatic configuration of searches • Automatic preprocessing of sequence databases • Automatic spectral reformatting • Automatic execution of search on local or remote computing resources (AWS/grid/NFS). • Result Combining: • Decoy-based FDR significance estimation • Unsupervised, model-free, machine-learning

  31. Simple unified search interface for: Mascot, X!Tandem, K-Score, S-Score, OMSSA, MyriMatch, InsPecT+MSSGF Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid, Cloud Peptide Identification Meta-Search

  32. Grid-Enabled Peptide Identification Meta-Search AmazonWeb Services Heterogeneous compute resources Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to 250+ simultaneoussearches Single, simplesearch request UniversityCluster

  33. PepArML Combiner • Peptide identification arbiter by machine learning • Unifies these ideas within a model-free, combining machine learning framework • Unsupervised training procedure

  34. PepArML Overview Feature extraction X!Tandem PepArML Mascot OMSSA Other

  35. Dataset Construction X!Tandem Mascot OMSSA T F T …… T

  36. Voting Heuristic Combiner • Choose PSM with most votes • Break ties using FDR • Select PSM with min. FDR of tied votes • How to apply this to a decoy database? • Lots of possibilities – all imperfect • Now using: 100*#votes – min. decoy hits

  37. Supervised Learning

  38. Search Engine Info. Gain

  39. Precursor & Digest Info. Gain

  40. Retention Time & Proteotypic Peptide Properties Info. Gain

  41. Application to Real Data • How well do these models generalize? • Different instruments • Spectral characteristics change scores • Search parameters • Different parameters change score values • Supervised learning requires • (Synthetic) experimental data from every instrument • Search results from available search engines • Training/models for all parameters x search engine sets x instruments

  42. Model Generalization

  43. Unsupervised Learning

  44. Unsupervised Learning Performance

  45. Unsupervised Learning Convergence

  46. PepArML Performance LCQ QSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1

  47. Conclusions • Combining search results from multiple engines can be very powerful • Boost both sensitivity and specificity • Running multiple search engines is hard • Statistical significance is hard • Use empirical FDR estimates...but be careful...lots of subtleties • Consensus is powerful, but fragile • Search engine quirks can destroy it • "Witnesses" are not independent

More Related