1 / 35

Freek T. Bakker Nationaal Herbarium Nederland, Wageningen University branch,

Optimising DNA barcode regions. Freek T. Bakker Nationaal Herbarium Nederland, Wageningen University branch, Biosystematics Group, Wageningen UR The Netherlands. DNA barcoding, CBOL, GenBank Non-COI protocol Models in DN barcode matching. Structure of this talk. DNA barcoding.

miette
Télécharger la présentation

Freek T. Bakker Nationaal Herbarium Nederland, Wageningen University branch,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimising DNA barcode regions Freek T. Bakker Nationaal Herbarium Nederland, Wageningen University branch, Biosystematics Group, Wageningen UR The Netherlands

  2. DNA barcoding, CBOL, GenBank Non-COI protocol Models in DN barcode matching Structure of this talk

  3. DNA barcoding “Using molecular data as species diagnostics isn’t new, but global standardization and scale of implementation are”

  4. http://www.barcoding.si.edu/

  5. CBOL Structure Member Organizations Executive Committee Secretariat Office Working Groups Scientific Advisory Board

  6. Establish reference library of barcodes from identified voucher specimens If necessary, revise species limits Then: Identify unknowns by searching against reference sequences Look for matches (mismatches) against ‘library on a chip’ Before long: Analyze relative abundance in multi-species samples Uses of DNA Barcodes

  7. BARCODE reference records Adhere to data standards Bidirectional reads, 500+ bp long Linked to voucher, species name Query barcode records Used in BLAST or other searches Often single pass reads Often very short – 100+ bp for good IDs Can cost less than $2, take less than 6 hours Reference versus micro-Barcodes

  8. DNA barcode default • The Consortium for the Barcode of Life (CBOL) has so far accepted the 648 base-pair ‘Folmer region’ of COI (mitochondrial encoded cytochrome oxidase 1) as the default DNA barcode region for vertebrates and insects and promotes its use in as many other clades as possible. • The International Nucleotide Sequence Database Collaboration (INSDC, consisting of GenBank, the European Molecular Biology Laboratory and the DNA Data Bank of Japan) has adopted the data standards proposed by CBOL for BARCODE data records, and has empowered CBOL to decide which gene regions can be given BARCODE status.

  9. CBOL  GenBank GenBank New CO1 barcode Data standards CBoL

  10. 1.7 x 106 described species 10 barcodes per species ~20 x 106 barcodes of 650bp each 10 x 106 more eukaryote species to go ~100 x 106 more barcodes of 650bp each In total this would be ~65,000,000,000 bp This is twice the total amount of bp currently in GenBank! To be completed within the decade (Hajibabaei & al., 2005) How many DNA barcodes do we need, or, what’s ahead?

  11. Optimal DNA barcodes • Barcoding gap: high inter-specific, low intra-specific sequence divergence • Universal amplification/sequencing with standard primers • Technically simple to sequence • Short enough to sequence in one reaction • Easily alignable (few insertions/deletions) • Readily recoverable from museum or herbarium samples and other degraded samples

  12. CO1 divergence in eukaryotes

  13. Data standards Non-COI protocol CBOL  GenBank GenBank Non-CO1 barcode CBoL

  14. non-COI barcode regions • COI alone will not do • mtDNA evolution too variable across major clades • NUMTs • Other faults (e.g. heteroplasmy, introgression, COI not present – e.g. Rubinoff & al. 2007) • rDNA: ITS, D3/D4, cpDNA: rpoC1, rpoB, matK • Multiple barcodes

  15. CBoL’s non-CO1 protocol • Protocol, to be used as guideline, available now • Reject CO1 as suitable region for clade of interest • Propose alternative region based on required evidence as documented • Barcode gap? • NJ tree • Multiple regions?

  16. From Hebert & al. PLoS Biology 2004 The DNA barcode gap From Meyer & al. PLoS Biology 2004

  17. DNA barcode gap From Van Velzen & al. NEV 2007

  18. 20% Shapes indicate species pair comparisons; colors indicate gene regions Barcode Gap Minimum divergence between species pairs (%) 1:1 line No Barcode Gap Maximum intraspecific divergence (%) 10% DNA barcode gap • Discontinuity minimum inter- and maximum intra species divergence • However, in ‘paraphyletically’ clustered barcodes: intra > inter divergence!

  19. Tortella (Bryophyta) rpoC1 Hedderson & al. NJ Hampeella pallens Agave angustiarum AG5649 Tortella flavovirens SL6 Tortella flavovirens Netherlands Agave rhodacantha AG8333 1 Pleurochaete squarrosa SL1 1 Agave salmiana AGSN Tortella nitens SL11 Agave convallis AG7306 Tortella densa SL10 Pleurochaete luteola SL2 Agave petrophila AG5811 Tortella tortuosa Austria 1 Agave stricta AGSN Tortella fragilis SL7 Furcraea longeva AG6055 Tortella fragilis SL8 Furcraea aff guatemalensis AG Tortella bambergii SL3 1 Tortella inclinata SL9 Agave striata spp striata AGSN Tortella arctica SL4 Agave aff kerchovei JIC24709 Tortella arctica SL4 Agave datylio AG8274 Tortella arctica SL5 Tortella humilis France Agave titanota JICSN Pleurochaete luteola Peru Agave triangularis AG5818 Pleurochaete squarrosa Arabia Agave victoria reginae AGSN Pleurochaete squarrosa France 0.005 substitutions/site 0.1 changes Agave (Agavaceae) rpoB Cowan & al.

  20. Rejection of CO1 • Reject CO1 as suitable region for clade of interest • Propose alternative region based on required evidence, i.e. • Pattern of intra- and interspecific variation • Resolving power • Universality • Document the number of primer pairs needed to succesfully PCR amplify & identify species throughout the clade of interest

  21. Implementation • Protocols will be adopted for a period of 6 months during which CBOL is open to suggestions for their improvement from the community. • CBOL will normally expect publication of evidence for effectiveness of proposed non-COI barcode region(s) in a peer-reviewed publication prior to submission of a proposal • Prior peer review and publication will support the proposal’s claims and will inform the community of the proposed barcode region(s) • Upon approval by CBOL’s Executive Committee, INSDC will be informed immediately and BARCODE status can be given

  22. Challenges • Is effectiveness of DNA barcode jeopardized by using parameter-poor models? • Is NJ too crude to provide correct matches between closely related barcodes? • How will non-coding DNA sequences perform when matching ‘unknowns’? • Do we need Bayesian matching for critical species? (PP’s on match, Priors to express uncertainty on population parameters) • Is matching of multiple barcodes a special case?

  23. DNA barcode matching • Character-based for closely related barcodes? • Phylogenetic clustering • Distance-based matching: what models? • Low divergence  few parameters (JK, K2P) • Codon models? • Composite barcodes  composite models? • Non-coding regions: length-variation • Pragmatism: large reference libraries, speed

  24. DNA barcode models • Simulate DNA barcodes using parameter-rich models, derived from insect COI and from cpDNA atpB data (GTR, c113) • 100 replicates of simulated data sets: 60 barcode sequences of 654nt • Distance: models simple  complex • NJ clustering of resulting distances • Semistrict consensus of 100 NJ trees

  25. TM Simulation(rich model) cpDNA atpB mtDNA COI resolve AGCTGACGTGGACGTA AGCTGAGCTGGACGTA GGCTGAGCTGGACGTA AGCCGACGAGTACGTA AGCTGACGTGGACGTA AGCTGAGCTGGACGTA GGCTGAGCTGGACGTA AGCCGACGAGTACGTA  100  100 …….. …….. …….. …….. NJ (rich model) 100 NJ trees Semistrict consensus NJatpB r/r NJCOI r/r DNA barcode models NJ (poor model) 100 NJ trees Semistrict consensus NJatpB r/p NJCOI r/p

  26. DNA barcode models • Findmodel (Los Alamos National Lab.):best-fitting model for Lepidopteran COI data set • MrBayes/Tracer: model parameter values • Simulation tree: angiosperm species-level phylogenetic tree topology (not ultrametric) • Seq-Gen: simulate 100 reps., 654nt×60 seqs. • PAUP*: NJ and consensus analysis • TreeView: tree interpretation

  27. cpDNA atpB model Relative subst. rates Base composition

  28. mtDNA COI model Relative subst. rates Base composition

  29. atpB vs. COI models Relative subst. rates Base composition atpB COI

  30. Semistrict gros73ECape outtaxon boranenseZZZu myrrhifoliummy231 suburbanumbi230 mollicomum12 antidysentericuman228 spinosum216 caylae210 acraeum213 frutetorum211 quinquelobatum176 ranunculophyllum212 peltatum146 nanum2b hystrix221 appendiculatum127 paniculatum65 schizopetalum64 bowkeri41 alternans104 iocastum1b buysii78 gros102TdC gros72WCape cotyledonis20 cortusifolium183 crassicaule172 crithmifolium105 carnosum45a pseufumarioides88 minimum9190 drummondii0a australe4a gibbosum39b anethifolium107 triste54 caffrum103 flabellifolium134 luridum138 hypoleucum74 geniculatum89 filicaule69 dichondrifolium116 ionidiflorum16 exstipulatum14 album93 mutans227 praemorsum223 tenuicaule208 senecioides99 redactum80 stipulaceum55 sericifolium123 nerviflorumTT fumarifoliumTT rapaceum130 undulatumTT petroseleniTT asarifoliumTT fissifolium129 leptumTT auritum126 luteolum67 leipoldtiiTT incarnatum170 vitifolium222 cucculatum27b ternatum175 hispidum207 Reconstructed NJ tree K2P Reconstructed NJ tree GTR alpinum119 grandiflorum26 althaeoides75 coronopifolium111 denticulatum215 quercifolium218 lanceolatum135 laevigatum217 hermanniifolium225 citronellum31 atpB Model tree atpB/GTR

  31. Reconstructed NJ tree K2P Reconstructed NJ tree GTR COI Model tree atpB/GTR

  32. Over-parametrization? • Parameter rich models not efficient in reconstructing ‘parameter-rich patterns’? • Parameter-poor models do better • Artefact of pairwise comparison? • Various shapes & branch lengths • Different base-composition across tree • Different omega rates across tree • Codon models?

  33. Conclusions • Non-COI barcode regions will be needed and are proposed through CBOL protocol • CBOL approval  adoption by INSDC • NJ/K2P sufficient for performance testing of proposed region • Character-based DNA barcode matching needed for closely related barcodes • Multiple barcodes: matched simultaneously

More Related