1 / 30

Model-based species identification using DNA barcodes

Model-based species identification using DNA barcodes. Bogdan Paşaniuc. CSE Department, University of Connecticut. Joint work with Ion Măndoiu and Sotirios Kentros. Outline. Existing approaches to species identification Proposed statistical model based methods Experimental Results

etta
Télécharger la présentation

Model-based species identification using DNA barcodes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios Kentros

  2. Outline • Existing approaches to species identification • Proposed statistical model based methods • Experimental Results • Ongoing Work and Conclusions

  3. Background on DNA barcoding • Recently proposed tool for species identification • Use short DNA region as “fingerprint” for the species • Region of choice: cytochrome c oxidase subunit 1 mitochondrial gene ("COI", 648 base pairs long). • Key assumption: inter-species variability higher than intra-species variability

  4. Species identification problem • Given: • Database DB containing barcodes from known species • New barcode x • Find: • a high confidence assignment to a species in the DB • UNKNOWN, if confidence not high enough • Use additional evidence/methods to resolve UNKNOWN assignments and possible discovery of new species

  5. Existing approaches and limitations Neighbor Joining tree for new + known barcodes [Meyers&Paulay05] One barcode per species Runtime does not scale well with #species (quadratic or worse) Likelihood ratio test for species membership using MCMC [Matz&Nielsen06] Impractical runtime even for moderate #species Distance-based [BOLD-IDS, TaxI(Steinke et al.05)] Unclear statistical significance

  6. BOLD BOLD: The Barcode of Life Data Systems [Ratnasingham&Hebert07] http://www.barcodinglife.org Currently: 28,129 species, 251,429 barcodes Identification System: BOLD-IDS Distance-based (NJ tree for visualization) Employs a threshold (less than 1% divergence) to get a tight match to a barcode in the DB

  7. BOLD-IDS [Ekrem et al.07]:“…identifications by the BOLD facility must be cautiously evaluated as the system at present may return high probabilities of placements that obviously are erroneous”

  8. Outline Existing approaches to species identification Proposed statistical model based methods Experimental Results Ongoing Work and Conclusions

  9. Bayesian approach to species identification • Assign barcode x=x1x2x3…xn to species SPi that maximizes P(SPi|x) over all species SPi • P(SPi|x) computed using Bayes’ theorem: P(SP|x) = P(x|SP)*P(SP)/P(x) • Uniform prior P(SP) • P(x) constant for fixed x • Need model for P(x|SP) • We explored three scalable models: position weight matrices, Markov chains, hidden Markov models • Similar to models used successfully in other sequence analysis problems such as DNA motif finding and protein families

  10. Positional weight matrix (PWM) Assumption: independence of loci P(x|SP) = P(x1|SP)*P(x2|SP)*…*P(xn|SP) For each locus, P(xi|SP) is estimated as the probability of seeing each nucleotide at that locus in DB sequences from species SP

  11. Inhomogeneous Markov Chain (IMC) A A C C T T G G • Takes into account dependencies between consecutiveloci A A C C … start T T G G locus 1 locus 2 locus 3 locus 4

  12. Hidden Markov Model (HMM) Same structure as the IMC Each state emits the associated DNA base with high probability; but can also emit the other bases with probability equal to mutation rate Barcode x generated along path p with probability equal to product of emission & transitions along p P(x|HMM) = sum of probabilities over all paths Efficiently computed by forward algorithm

  13. Accuracy on BOLD dataset • 37 species with at least 100 barcodes from BOLD • 10-50% barcodes removed and used for test • IMC yields better accuracy in all cases

  14. Score normalization DB barcodes have non uniform lengths and cover different regions of the COI gene Membership probabilities not always comparable Normalization scheme: Species models constructed only over positions covered in DB Scores normalized using background IMC constructed from all sequences in DB

  15. Computing the confidence of assignment • x assigned to species SP with score s • p-value: probability that a barcode generated under background model Ḿ has a score s’  s • Methods for p-value estimation: • Random sampling • Generate random sequences and count how many exceed the score • Exact computation (for PWMs): • Dynamic programming [Rahmann03] • Branch and bound [Zhang et. Al 07] • Shiffted FFTs [Nagarajan et al. 05]

  16. Exact computation for PWMs [Rahmann03] Computes the entire distribution Scores rounded by a granularity factor Score is a sum of n independent variables (score contribution of each position) Probability of a rand. seq. of length i having a score of computed from the contribution of first i-1 positions and current position

  17. Exact computation for IMCs • Define as the prob. of a random seq of length i having score and last letter • Basic recurrence:

  18. IMC exact p-value computation Initially The probability of a random barcode having score Runtime , where R is the difference between max and min score for any i.

  19. Outline Existing approaches to species identification Proposed statistical model based methods Experimental Results Ongoing Work and Conclusions

  20. Experimental setup (1) • Compared methods • IMC • Species with highest score • If score < species specific threshold UNKNOWN • Distance-based (BOLD-IDS like) • Species containing barcode showing less divergence • If divergence > threshold (default 1%)  UNKNOWN • Basic questions • What is the effect of training set size (#barcodes per species) on accuracy? • What is the effect of the #species on accuracy?

  21. Experimental setup (2) Two scenarios: Complete DB: all new barcodes belong to species in DB Incomplete DB: some new barcodes belong to species not in DB

  22. Accuracy measures True positive rate = TP/(TP+FP) Barcodes belonging to species present in the DB TP = #barcodes assigned to correct species FP = #barcodes assigned to incorrect species Barcodes belonging to species not present in DB TP = #barcodes assigned to unknowns FP = #barcodes assigned to species in the DB

  23. Effect of #barcodes/species Datasets containing all BOLD species with at least 5/25 barcodes BOLD5: 1508 sp, 28600 barcodes BOLD25: 270 sp, 17197 barcodes DB composed of randomly picked 5-20 barcodes from all species in BOLD25 Test barcodes Complete database scenario All remaining barcodes from BOLD25 Incomplete database scenario All barcodes from BOLD5 not in DB

  24. Effect of #barcodes/species, complete DB

  25. Effect of #barcodes/species, incomplete DB

  26. Effect of #species • Datasets containing all BOLD species with at least 5/10 barcodes • BOLD5: 1508 sp, 28600 barcodes • BOLD10: 690 sp, 23558 barcodes • DB composed of randomly picked 100 to 690 species from BOLD10 • 10 barcodes per species • Test barcodes • Complete database scenario • All remaining barcodes from picked species • Incomplete database scenario • All barcodes from BOLD5 not in DB

  27. Effect of #species, complete DB

  28. Effect of #species, incomplete DB

  29. Outline Existing approaches to species identification Proposed statistical model based methods Experimental Results Ongoing Work and Conclusions

  30. Conclusions & Ongoing work • IMC provides a scalable method for species identification • High accuracy, with useful tradeoff between TP rate and unknown rate • Efficiently computable p-values • Comprehensive comparison of identification algorithms to be submitted to 2nd International Barcode Conference • Broad coverage of methods • tree-based, distance-based, character-based, model-based • Assessment of further effects besides #species and #barcodes/species • Barcode length • Barcode quality • Number of regions • Runtime scalability (up to millions of species) • Diverse datasets (BOLD, cowries, flu viruses, simulated data, etc.)

More Related