360 likes | 504 Vues
Optimising DNA barcode regions. Freek T. Bakker Nationaal Herbarium Nederland, Wageningen University branch, Biosystematics Group, Wageningen UR The Netherlands. DNA barcoding, CBOL, GenBank Non-COI protocol Models in DN barcode matching. Structure of this talk. DNA barcoding.
E N D
Optimising DNA barcode regions Freek T. Bakker Nationaal Herbarium Nederland, Wageningen University branch, Biosystematics Group, Wageningen UR The Netherlands
DNA barcoding, CBOL, GenBank Non-COI protocol Models in DN barcode matching Structure of this talk
DNA barcoding “Using molecular data as species diagnostics isn’t new, but global standardization and scale of implementation are”
CBOL Structure Member Organizations Executive Committee Secretariat Office Working Groups Scientific Advisory Board
Establish reference library of barcodes from identified voucher specimens If necessary, revise species limits Then: Identify unknowns by searching against reference sequences Look for matches (mismatches) against ‘library on a chip’ Before long: Analyze relative abundance in multi-species samples Uses of DNA Barcodes
BARCODE reference records Adhere to data standards Bidirectional reads, 500+ bp long Linked to voucher, species name Query barcode records Used in BLAST or other searches Often single pass reads Often very short – 100+ bp for good IDs Can cost less than $2, take less than 6 hours Reference versus micro-Barcodes
DNA barcode default • The Consortium for the Barcode of Life (CBOL) has so far accepted the 648 base-pair ‘Folmer region’ of COI (mitochondrial encoded cytochrome oxidase 1) as the default DNA barcode region for vertebrates and insects and promotes its use in as many other clades as possible. • The International Nucleotide Sequence Database Collaboration (INSDC, consisting of GenBank, the European Molecular Biology Laboratory and the DNA Data Bank of Japan) has adopted the data standards proposed by CBOL for BARCODE data records, and has empowered CBOL to decide which gene regions can be given BARCODE status.
CBOL GenBank GenBank New CO1 barcode Data standards CBoL
1.7 x 106 described species 10 barcodes per species ~20 x 106 barcodes of 650bp each 10 x 106 more eukaryote species to go ~100 x 106 more barcodes of 650bp each In total this would be ~65,000,000,000 bp This is twice the total amount of bp currently in GenBank! To be completed within the decade (Hajibabaei & al., 2005) How many DNA barcodes do we need, or, what’s ahead?
Optimal DNA barcodes • Barcoding gap: high inter-specific, low intra-specific sequence divergence • Universal amplification/sequencing with standard primers • Technically simple to sequence • Short enough to sequence in one reaction • Easily alignable (few insertions/deletions) • Readily recoverable from museum or herbarium samples and other degraded samples
Data standards Non-COI protocol CBOL GenBank GenBank Non-CO1 barcode CBoL
non-COI barcode regions • COI alone will not do • mtDNA evolution too variable across major clades • NUMTs • Other faults (e.g. heteroplasmy, introgression, COI not present – e.g. Rubinoff & al. 2007) • rDNA: ITS, D3/D4, cpDNA: rpoC1, rpoB, matK • Multiple barcodes
CBoL’s non-CO1 protocol • Protocol, to be used as guideline, available now • Reject CO1 as suitable region for clade of interest • Propose alternative region based on required evidence as documented • Barcode gap? • NJ tree • Multiple regions?
From Hebert & al. PLoS Biology 2004 The DNA barcode gap From Meyer & al. PLoS Biology 2004
DNA barcode gap From Van Velzen & al. NEV 2007
20% Shapes indicate species pair comparisons; colors indicate gene regions Barcode Gap Minimum divergence between species pairs (%) 1:1 line No Barcode Gap Maximum intraspecific divergence (%) 10% DNA barcode gap • Discontinuity minimum inter- and maximum intra species divergence • However, in ‘paraphyletically’ clustered barcodes: intra > inter divergence!
Tortella (Bryophyta) rpoC1 Hedderson & al. NJ Hampeella pallens Agave angustiarum AG5649 Tortella flavovirens SL6 Tortella flavovirens Netherlands Agave rhodacantha AG8333 1 Pleurochaete squarrosa SL1 1 Agave salmiana AGSN Tortella nitens SL11 Agave convallis AG7306 Tortella densa SL10 Pleurochaete luteola SL2 Agave petrophila AG5811 Tortella tortuosa Austria 1 Agave stricta AGSN Tortella fragilis SL7 Furcraea longeva AG6055 Tortella fragilis SL8 Furcraea aff guatemalensis AG Tortella bambergii SL3 1 Tortella inclinata SL9 Agave striata spp striata AGSN Tortella arctica SL4 Agave aff kerchovei JIC24709 Tortella arctica SL4 Agave datylio AG8274 Tortella arctica SL5 Tortella humilis France Agave titanota JICSN Pleurochaete luteola Peru Agave triangularis AG5818 Pleurochaete squarrosa Arabia Agave victoria reginae AGSN Pleurochaete squarrosa France 0.005 substitutions/site 0.1 changes Agave (Agavaceae) rpoB Cowan & al.
Rejection of CO1 • Reject CO1 as suitable region for clade of interest • Propose alternative region based on required evidence, i.e. • Pattern of intra- and interspecific variation • Resolving power • Universality • Document the number of primer pairs needed to succesfully PCR amplify & identify species throughout the clade of interest
Implementation • Protocols will be adopted for a period of 6 months during which CBOL is open to suggestions for their improvement from the community. • CBOL will normally expect publication of evidence for effectiveness of proposed non-COI barcode region(s) in a peer-reviewed publication prior to submission of a proposal • Prior peer review and publication will support the proposal’s claims and will inform the community of the proposed barcode region(s) • Upon approval by CBOL’s Executive Committee, INSDC will be informed immediately and BARCODE status can be given
Challenges • Is effectiveness of DNA barcode jeopardized by using parameter-poor models? • Is NJ too crude to provide correct matches between closely related barcodes? • How will non-coding DNA sequences perform when matching ‘unknowns’? • Do we need Bayesian matching for critical species? (PP’s on match, Priors to express uncertainty on population parameters) • Is matching of multiple barcodes a special case?
DNA barcode matching • Character-based for closely related barcodes? • Phylogenetic clustering • Distance-based matching: what models? • Low divergence few parameters (JK, K2P) • Codon models? • Composite barcodes composite models? • Non-coding regions: length-variation • Pragmatism: large reference libraries, speed
DNA barcode models • Simulate DNA barcodes using parameter-rich models, derived from insect COI and from cpDNA atpB data (GTR, c113) • 100 replicates of simulated data sets: 60 barcode sequences of 654nt • Distance: models simple complex • NJ clustering of resulting distances • Semistrict consensus of 100 NJ trees
TM Simulation(rich model) cpDNA atpB mtDNA COI resolve AGCTGACGTGGACGTA AGCTGAGCTGGACGTA GGCTGAGCTGGACGTA AGCCGACGAGTACGTA AGCTGACGTGGACGTA AGCTGAGCTGGACGTA GGCTGAGCTGGACGTA AGCCGACGAGTACGTA 100 100 …….. …….. …….. …….. NJ (rich model) 100 NJ trees Semistrict consensus NJatpB r/r NJCOI r/r DNA barcode models NJ (poor model) 100 NJ trees Semistrict consensus NJatpB r/p NJCOI r/p
DNA barcode models • Findmodel (Los Alamos National Lab.):best-fitting model for Lepidopteran COI data set • MrBayes/Tracer: model parameter values • Simulation tree: angiosperm species-level phylogenetic tree topology (not ultrametric) • Seq-Gen: simulate 100 reps., 654nt×60 seqs. • PAUP*: NJ and consensus analysis • TreeView: tree interpretation
cpDNA atpB model Relative subst. rates Base composition
mtDNA COI model Relative subst. rates Base composition
atpB vs. COI models Relative subst. rates Base composition atpB COI
Semistrict gros73ECape outtaxon boranenseZZZu myrrhifoliummy231 suburbanumbi230 mollicomum12 antidysentericuman228 spinosum216 caylae210 acraeum213 frutetorum211 quinquelobatum176 ranunculophyllum212 peltatum146 nanum2b hystrix221 appendiculatum127 paniculatum65 schizopetalum64 bowkeri41 alternans104 iocastum1b buysii78 gros102TdC gros72WCape cotyledonis20 cortusifolium183 crassicaule172 crithmifolium105 carnosum45a pseufumarioides88 minimum9190 drummondii0a australe4a gibbosum39b anethifolium107 triste54 caffrum103 flabellifolium134 luridum138 hypoleucum74 geniculatum89 filicaule69 dichondrifolium116 ionidiflorum16 exstipulatum14 album93 mutans227 praemorsum223 tenuicaule208 senecioides99 redactum80 stipulaceum55 sericifolium123 nerviflorumTT fumarifoliumTT rapaceum130 undulatumTT petroseleniTT asarifoliumTT fissifolium129 leptumTT auritum126 luteolum67 leipoldtiiTT incarnatum170 vitifolium222 cucculatum27b ternatum175 hispidum207 Reconstructed NJ tree K2P Reconstructed NJ tree GTR alpinum119 grandiflorum26 althaeoides75 coronopifolium111 denticulatum215 quercifolium218 lanceolatum135 laevigatum217 hermanniifolium225 citronellum31 atpB Model tree atpB/GTR
Reconstructed NJ tree K2P Reconstructed NJ tree GTR COI Model tree atpB/GTR
Over-parametrization? • Parameter rich models not efficient in reconstructing ‘parameter-rich patterns’? • Parameter-poor models do better • Artefact of pairwise comparison? • Various shapes & branch lengths • Different base-composition across tree • Different omega rates across tree • Codon models?
Conclusions • Non-COI barcode regions will be needed and are proposed through CBOL protocol • CBOL approval adoption by INSDC • NJ/K2P sufficient for performance testing of proposed region • Character-based DNA barcode matching needed for closely related barcodes • Multiple barcodes: matched simultaneously