SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS

SUPERVISED DNA BARCODES SPECIES CLASSIFICATION:ANALYSIS, COMPARISON, AND RESULTS

Outline • How we approach the DNA Barcodes speciesclassification problem • Supervised machine learning • Supervised machine learning DNA Barcodesclassification methods: BLOG 2.0 • Supervised machine learning DNA Barcodesclassification methods: WEKA • Consolidated DNA Barcodes classification methods • Methods comparison: the data sets • Methods comparison: Weka • Methods comparison: Weka and consolidatedDNA Barcodes classification methods • Conclusions

How we approach the DNA Barcodes species classification problem • Goal: assign an unknown specimen to a known species starting from its DNA Barcode sequence • The classification problem may be formulated in the following way [Weitschek, et al. 2013]: • given a reference library composed of DNA Barcode specimen sequences of known species and • a collection of unknown DNA Barcode sequences (query set) • recognize the latter into the species that are present in the library • to obtain reliable results • the query set has to contain only specimens from the same species that are present in the reference library • the reference set has to contain a sufficient number of specimens sequences for each species (at least 4 specimens per species)

Supervised machine learning • The user has to provide as input a training set (reference library) containing specimens with a priori known species membership • Based on this training set, the software computes the classification model • Subsequently, the classification model can be applied to a test set (query set) which contains specimens that require classification • The test set can contain query specimens with unknown species membership or, alternatively, specimens that also have a priori known species membership, allowing verification of the specimen classifications

Supervised machine learning DNA Barcodes classification methods: BLOG 2.0 BLOG 2.0: a software system for character-based species classification with DNA Barcode sequences. What it does, how to use it. E. Weitschek, R. van Velzen, G. Felici and P. Bertolazzi. Molecular Ecology Resources 2013 13(6):1043-1046, 2013 (doi: 10.1111/1755-0998.12073)

Supervised machine learning DNA Barcodes classification methods: BLOG 2.0 • Input a reference library in fasta format • sequences have to be of the same region or pre-aligned to the same region • BLOG computes for each species the distinctive nucleotide positions of the DNA Barcode sequences and the logic classification formulas (small rules in the form of “if-then” that are able to characterize a species in a compact way) • The classification formulas can be applied to a query set If pos3 = A and pos458 = C then the specimen is a IF BASE IN POSITION 466IS CAND BASE IN POSITION 595 IS TTHENSPECIES IS …1 IF BASE IN POSITION 340 IS GAND BASE IN POSITION 451 IS AAND BASE IN POSITION 493 IS CTHENSPECIES IS … 2 IF BASE IN POSITION 340 IS TAND BASE IN POSITION 466 IS AAND BASE IN POSITION 625 IS GTHENSPECIES IS … 3

Supervised machine learning DNA Barcodes classification methods: BLOG 2.0 http://bol.uvm.edu http://dmb.iasi.cnr.it/blog.php

Supervised machine learning DNA Barcodesclassification methods: WEKA • WEKA (Waikato Environment for Knowledge Analysis) machine learning software is adopted for DNA Barcoding • WEKA contains several methods to perform supervised classification of general problems • Input a reference library in arff format • fasta sequences have to be converted • sequences have to be of the same region or pre-aligned to the same region • Weka computes the classification model • The classification model can be applied to a query set • For using Weka in DNA Barcodes classification reference and query set have to be converted in arff format

Supervised machine learning DNA Barcodesclassification methods: WEKA • Example • FASTA to WEKA DNA barcode: FASTA format Weka format: ARFF format specimen ID species name > CC_1c_ID115 | Inga_alba ATT > CC.MZ_9_ID316 | Inga_chartacea AAC nucleotides sequence @relation Inga_test @attribute pos1 numeric @attribute pos2 numeric @attribute pos3 numeric @attribute class {Inga_alba,Inga_chartacea} @data 1,4,4Inga_alba 1,1,2 Inga_chartacea nucleotides position species names nucleotides sequences Availableuponrequest (emanuel@dia.uniroma3.it), soon online on http://dmb.iasi.cnr.it

Supervised machine learning DNA Barcodesclassification methods: WEKA WEKA contains several supervised machine learning methods to perform classification , that can all be used for DNA Barcoding

Supervised machine learning DNA Barcodesclassification methods: WEKA • Function - Support Vector Machines (SMO): • Two class distinction problem (one vs all the other approach) • Transform the data in n-dimensional vectors and build the best separating hyperplane between the two vectors • Perform well, but no human interpretable classification model • Rule Based – RIPPER (Jrip): • Extracts for every species in the reference library a characterizing “if– then rule” • classification model is compact and human interpretable • Classification tree – C4.5 (J48): • mathematical structures composed of nodes (nucleotides assignments) and edges (decisions). The species labels are on the leaves of the trees. • A path from the root to the leave is a set of decision on the attributes values that leads to a classification of a specimen (can be transformed in “if-then rule”) • Bayesian– Naïve Bayes: • joint probability distribution of a set of variables. • Bayesian networks based on the state of the observable variables and a priori probabilities represented by the edges in the relations between variables, evaluating the a posteriori probabilities of the unknown states

DNA Barcodes classification methods • Tree-based methods assign unidentified (query) barcodes to species based on their membership of clusters (or clades) in a DNA barcode tree • Similarity-based methods assign query barcodes to species based on how much DNA barcode characters they have in common • Diagnostic methods (character-based methods) rely on the presence/absence of particular characters in DNA barcode sequences for identification, instead of using them all DNA barcoding of recently diverged species: Relative Performance of Matching Methods. R. Van Velzen, E. Weitschek, G. Felici and F.T.Bakker. Plos One 7(1):e30490, 2012

DNA Barcodes classification methods • Tree based: • Neighbour Joining [Saitou and Nei 1987]: is the most used method in DNA Barcode data analysis; it is a bottom-up clustering method used for the construction of phylogenetic trees based on sequence distance • Parsimony [Edwards et al. 1963]: the preferred tree, is the tree that requires the least evolutionary change to explain data; outperformed other tree-based methods; • Similarity based: • Nearest Neighbour [Meier et al. 2006] is a distance based method, which gave very high recognition rates • BLAST [Altschul et al. 1997]: the most commonly used method for classifying DNA sequences in practice; Ian algorithm for comparing query sequences with an unaligned reference data base calculating pairwise alignments in the process • Diagnostic methods: • DNA-BAR [DasGupta et al. 2005]: it showed higher levels of accurate species identification in previous studies; alignment free method; it first selects sequence substrings (distinguishers) differentiating the sequences in the reference data set, and then records presence/absence of these distinguishers; it does not require an alignment • BLOG [Bertolazzi, Felici, Weitschek 2009]: character based method; the first time tested

Methods comparison: The data sets • Public available data sets to perform a comparative analysis of the methods • Empirical data sets [Weitschek et al., 2013; Van Velzen et al., 2012] • Synthetic data sets [Van Velzen et al., 2012]

Methods comparison: Weka • Wekasupervised machine learningmethods (SVM, RIPPER, C4.5, and Naïve Bayes) weretested on the empirical and simulated data sets • Reference and querywerechosenas in the previousreferences (80% – 20% in empirical data; 80% - 20% replicated 100 fold in simulated data) • The reached average accuracies on the query sets

Methods comparison: Weka • Wekasupervised machine learningmethods (SVM, RIPPER, C4.5, and Naïve Bayes) comparison • SVM and Naïve Bayes have the highest correct classification rate (accuracy), but no human interpretable model or compact model of the data set is provided • Jrip and C4.5 have slightly inferior results, but provide a classification model

Methods comparison: Weka and consolidatedDNA Barcodes classification methods • SVM and Naïve Bayes have the highest correct classification rate (accuracy) • BLOG is at a comparable level and provides a classification model in terms of logic formulas

Methods comparison: Weka and consolidatedDNA Barcodes classification methods • Very high accuracy for the supervised machine learning methods in Weka • Consolidated DNA Barcodes methods are challenged by these datasets

Conclusions • The classification analysis shows that • supervised machine learning methods are promising candidates for handling with success the DNA Barcode species classification problem • All methods obtained very good classification performances • SVM, Naïve Bayes excellent accuracy, but no human interpretable model • BLOG, C4.5, RIPPER very good results and human interpretable classification model (if-then rules) that can be used outside the realm of DNA barcoding, for instance in species description or molecular detection assays • Finally, the DNA Barcoding community is provided with a powerful tool to perform species classification

Main references • BLOG 2.0: a software system for character-based species classification with DNA Barcode sequences. What it does, how to use it. E. Weitschek, R. van Velzen, G. Felici and P. Bertolazzi. Molecular Ecology Resources 13(6):1043-1046, 2013 (doi: 10.1111/1755-998.12073) • DNA barcoding of recently diverged species: Relative Performance of Matching Methods. R. Van Velzen, E. Weitschek, G. Felici and F.T.Bakker. Plos One 7(1):e30490, 2012. www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0030490 • Learning to classify species with barcodes. P. Bertolazzi, G. Felici and E. Weitschek. BMC Bioinformatics 10(S-14):7, 2009. www.biomedcentral.com/1471-105/10/S14/S7 • Supervised DNA Barcodes species classification: analysis, comparison, and results. E. Weitschek, G. Fiscon and G. Felici. BMC BioData Mining (under review)

References • Sarkar IN, Trizna M; The Barcode of Life Data Portal: Bridging the Biodiversity Informatics Divide for DNA Barcoding; PLoS One; 2011 • Saitou N, Nei M; The Neighbour-joining method: a new method for reconstructing phylogenetic trees; Mol Biol Evol; 1987, 4:406 - 425. • Edwards AWF, L.L. C-S; The reconstruction of evolution; Annals of Human Genetics; 1963, 27: 105–106 • Meier R, Shiyang K, Vaidya G, K. L. NG P; DNA Barcoding and Taxonomy in Diptera: A Tale of High Intraspecific Variability and Low Identification Success; Systematic Biology; 2006, 55(5):715-728 • Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al.; Gapped BLAST and PSI-BLAST: a new generation of protein database search programs; Nucleic Acids Research; 1997, 25: 3389-3402 • DasGupta B, Konwar KM, Măndoiu II, Shvartsman AA;DNA-BAR: distinguisher selection for DNA barcoding;Bioinformatics; 2005, 21: 3424-3426 • Lou M, Golding GB; Assigning sequences to species in the absence of large interspecific differences; Molecular Phylogenetics and Evolution; 2010, 56: 187-194 • Dexter KG, Pennington TD, Cunningham CW; Using DNA to assess errors in tropical tree identifications: How often are ecologists wrong and when does it matter?; Ecological Monographs; 2010 ,80: 267-286

References • Meyer CP, Paulay G; DNA barcoding; Error rates based on comprehensive sampling; PLoS Biology; 2005, 3: 2229–2238 • Felici G, Truemper K; The Lsquare System for MiningLogic Data; Encyclopedia of Data Warehousing and Mining; 2005 • Bertolazzi P, Felici G, Festa P, Lancia G; LogicClassification and Feature Selection for Biomedical Data;Computers & Mathematics with Applications, 2008 • Van Velzen R, Weitschek E, Felici G and Bakker FT; DNA barcoding of recently diverged species: Relative Performance of Matching Methods; Plos One (in press) • Weitschek E, Van Velzen R, Felici G; Speciesclassificationusing DNA Barcode sequences: A comparative analysis; IASI CNR Technical Report ; 2011 • Bertolazzi P, Felici G, Weitschek E; Learning to classifyspecies with barcodes; BMC Bioinformatics; 2009 • Arisi, D'Onofrio, Brandi, Di Mambro, Felsani, Capsoni, Drovandi, Felici, Weitschek, Bertolazzi, Cattaneo; Gene expression biomarkers in the brain of a mouse model for Alzheimer's disease: mining of microarray data by logic classification and feature selection; Journal of Alzheimer’s Disease; 2011 • Bertolazzi, Felici, Weitschek, Drovandi, Ciccozzi, Ciotti, Lopresti; Polyomaviruses genome analysis by logic mining techniques; BMC Virology Journal; 2012 • Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. • DMB project website: http://dmb.iasi.cnr.it

Contacts • EmanuelWeitschekUniversity Roma TreDepartment of Computer Science and AutomationRome, Italyemanuel@dia.uniroma3.it • Giulia FisconUniversity La SapienzaDepartment of Computer, Control and Management Engineering Rome, Italygiulia.fiscon@dis.uniroma1.it • Giovanni FeliciNational ResearchCouncilInstitute of System Analysis and Computer Science A. RubertiRome, Italygiovanni.felici@iasi.cnr.it Thanksforyourattention!

SUPERVISED DNA BARCODES SPECIES CLASSIFICATION: ANALYSIS, COMPARISON, AND RESULTS