1 / 23

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data. Gabriel Ilie 1 , Alex Zelikovsky 2 and Ion M ă ndoiu 1 1 CSE Department, University of Connecticut 2 CS Department, Georgia State University. MassCLEAVE assay for MS-based nucleic acid sequence analysis.

Télécharger la présentation

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie1, Alex Zelikovsky2 and Ion Măndoiu1 1CSE Department, University of Connecticut 2CS Department, Georgia State University

  2. MassCLEAVE assay for MS-based nucleic acid sequence analysis

  3. Error model • Signed relative errors assumed to follow a normal distribution with mean 0, standard deviation σ for masses and σ’ for intensities • Two types of error incurred when matching compomer c to peak of mass m and intensity i(m): • Relative mass error • Relative intensity error:

  4. Problem formulation Given: • Mass spectra MS • Reference sequence r including position of PCR primers • Maximum edit distance D • Standard deviations σ and σ’, tolerance parameter  Find: • Target sequence t flanked by PCR primers that • is within edit distance D of r, and • yields a matching of compomers of CS(t) to masses of MS with minimum total relative error

  5. Naïve Algorithm • Exhaustive search • Generate all sequences within an edit distance of D of the reference, and • Compute the minimum total relative error for matching the compomers of each of these sequences to the masses in MS. • The number of candidate sequences grows exponentially with D

  6. 3-Stage Algorithm • Identify regions of the reference sequence that are unambiguously supported by MS data • High probability to be present in the unknown target sequence • Branch-and-bound approach to fill in remaining gaps • Generates set of candidate sequences with compomers supported by MS data • Compute candidate sequences with minimum total relative error • Min-cost flow problem currently solved as linear program • With or without intensities

  7. First stage: finding strongly supported regions of the reference • Chebyshev’s inequality: • A detectable compomer c ϵ CSσ(s) is strongly matched to mass m ϵ MSσ(s) if: where ε = σ / 0.5 is set based on a user specified tolerance

  8. First stage: finding strongly supported regions of the reference • A strong match between compomer c and mass m is unambiguous if: • c has multiplicity of 1 in reference • c can be strongly matched only to m • m can be strongly matched only to c • The set M of unambiguous matches can be found efficiently by binary search

  9. First stage: finding strongly supported regions of the reference • (c1, m1), . . . , (cn, mn) = unambiguous matches for cut base σ, indexed in non-decreasing order of relative errors • We iteratively apply Chebyshev’s inequality with tolerance  to the running means of signed relative errors, • which are normally distributed with mean 0 and standard deviation σ /i0.5 • If Chebyshev’sinequality fails for index i, match(ci, mi) is removed from M

  10. First stage: finding strongly supported regions of the reference • A position in the reference sequence has strong support if • All detectable compomers overlapping it can be strongly matched, and • At least one of these matches is in M (unambiguous + not removed) • Positions in PCR primers automatically marked as having strong support

  11. Second stage: generating candidate targets by branch-and-bound • Reference regions with strong support assumed to be present in target • Gaps filled one base at a time, in left-to-right order, using branch-and-bound • Choice order: reference base, substitutions, deletion, insertions • Chebyshev test with tolerance  applied torunning means of signed relative errors of closest matches • Search pruned when test failsor more than D mutations

  12. Third stage: scoring candidates by linear programming • Objective: • Minimize total relative error • Variables: • For each c ϵ CSσ and m ϵ MSσ, xc,m is set to 1 if c is matched to m, 0 otherwise (integrality follows from total unimodularity) • Constraints: • No missing peaks: each detectable compomer c ϵ CSσ(t) must be matched to one mass in MSσ • No extraneous peaks: each mass m ϵ MSσ must be matched to at least one detectable compomer c ϵ CSσ(t)

  13. LP w/o intensities

  14. LP with intensities

  15. Simulation setup • Reference length: 100-500 bp • Reference sequences/targets • D=1: 10 random references, all sequences at edit distance 1 used as targets • D=2,3: 100 random reference-target pairs • Error free MS data: σ= σ’ = 0 • Noisy MS data: σ = 0.0001, σ’ =0-1 • Tolerance parameter:  = 0.01

  16. Precision and Recall

  17. Branch-and-bound vs. Naïve(F-measure for D=1, error free data, w/o intensities)

  18. Branch-and-bound speed-up(D=1, error free data, w/o intensities)

  19. Results on noisy data (F-measure, D=1, σ= 0.0001, w/o intensities)

  20. Effect of the number of mutations (F-measure, σ= 0.0001, w/o intensities)

  21. Do intensities help?(F-measure, σ = 0.0001, 1 substitution)

  22. Do intensities help?(F-measure, σ = 0.0001)

  23. Ongoing Work • Experiments on EPLD clone data • Branch-and-bound relaxation + penalty in LP objective to handle missing/extraneous peaks • Intensity data normalization: correct for mass and base composition effects

More Related