Integrating Machine Learning for Species Prediction and Discovery

The Barcode of LifeIntegrating machine learning techniques for species prediction and discoverywww.barcodinglife.com

Welcome to the Meeting • Barcode of Life -- • Great opportunity to contribute to a fast growing area of research. • Some questions to keep in mind for the afternoon…..

Open Questions • Species discovery vs. prediction • Data structure, missing data, sample sizes • New visualization tools for both discovery and prediction • Confidence measures for species discovery and individual specimen assignments – controlling number of false discoveries - power of detection

Barcoding Data – A first look at the data.Dimacs BOL Data Analysis Working Group meetingSeptember 26 2005 Rebecka Jornsten Department of Statistics, Rutgers University http://www.stat.rutgers.edu/~rebecka/DIMACSBOL/DimacsMeetingDATA/ Thanks to Kerri-Ann Norton

Outline • Data Structure and Data Retrieval • Sequencing and Base Calling • Distance metrics – Sequence information • Clustering • Classification • Open Questions – Discussion • Questions to think about are highlighted in red.

What do the data look like?

Sequencing

Sequencing • Peak finding • Deconvolution • Denoising • Normalization • Base calling • Quality assessment • (ABI base caller, Phred)

www.barcodinglife.com Leptasterias data – six-rayed sea stars Astraptes data - moths Collembola data - springtails Sample Data

Leptasterias data – six-rayed sea stars 5 species,21 specimens Sample sizes 3-7 Sequence length 1644 Astraptes data – moths 12 species, 451 specimens Sample sizes 3-96, 8 with more than 20 Sequence length 594 Collembola data – springtails 18 species, 54 specimens Sample sizes 1-5 Sequence length 635 Sample Data

Sequence Information • We can compute the information content for each nucleotide. • Is there a lot of variability between species at locus j? • Is there lot of variability within a species at locus j? • Are the same loci discriminating between multiple species?

Astraptes: Within-species entropy for the 9 species with 20+ specimens “pure”

Mutual information of each locus for the 9 species

Pair wise (Mutual Information) 10 vs. 11 10 vs. 12 10 vs. 12 2 vs. 10 2 vs. 12

Distance Metric • To group the specimens in an unsupervised fashion we need to come up with a distance metric. • Without prior information of which loci are informative, we compute distances using the entire sequence (for Astraptes 594 bases) • The 0-1 distance metric is the most commonly used • However, some bases are ‘uncalled’ – usually denoted by letter other than a,c,g,t • How should we take this into account?

H. Clustering: Astraptes

PAM Clustering: Astraptes

H. Clustering:Collembola

Another example: Leptasterias – 5 species, 21 specimens All groups Groups 3 vs. 4

H. Clustering: Leptasterias Group 1 Group 2 Problem….

PAM Clustering: LeptasteriasSelecting the number of clusters via silhouette width, CV etc leads to the combining of species 3 and 4 – these data does not support a separate species.

Classification • A classifier that in principle closely resembles the hierarchical clustering approach is kNN • Leave-one-out Cross-Validation: • On the Leptasterias data 1-2 specimens are misallocated with this classifier. • Both of these specimens are in group 4 (and mislabeled as 3). • Via cross validation we see that one observation is only labeled as 4 if it’s in the training set, o/w 3. • The other mislabeled observation fails in 15 out of 20 training scenarios. Both these specimens may have been mislabeled?

Classification • A simple alternative is to use a centroid-based classifiers • Assign new specimens to the species with respect to which the specimen is closest to the species consensus sequence. • We can match specimens to a consensus sequence based on the 0-1 distance, or • use the position weights of each letter base in the consensus sequence.

Classification • On the Leptasterias data, the consensus sequence (CS) based classifier makes 2 errors (LOO CV) • “Vote of confidence”=weighted 0/1 distance to CS

Classification • Relative voting (RV) strength illustrates that species 3 and 4 are difficult to separate, and the misallocated specimens are associated with low relative votes • RV = max(weighted similarity)-(max-1)(weighted similarity) • (max-1)(weighted similarity)

Base calling is not perfect – errors are made and there are programs (e.g. phred) that can analyze the ABI traces and assign confidence measures to each base. An interesting question is – can we obtain similar error rates for species prediction and discovery with smaller sample sizes if quality measures are incorporated into the analysis?

Before the Discussion Session • Try out some clustering techniques on the sample data • Number of uncalled bases? • Length of sequences? • Sample sizes – effect on clustering? • Sample sizes – effect on classification?

Integrating Machine Learning for Species Prediction and Discovery

Integrating Machine Learning for Species Prediction and Discovery

Presentation Transcript

Welcome to the Meeting

Welcome to the first meeting of

DNA Barcoding and the Consortium for the Barcode of Life

The implementation of 2D barcode in real life

Overview of DNA Barcoding and the Barcode of Life Initiative

Consortium for the Barcode of Life (CBOL)

WELCOME TO LIFE

WELCOME TO THE GOOD LIFE

Welcome to the Study of Life…. Biology

Welcome to the Second Life!

Consortium for the Barcode of Life (CBOL): Linking Molecules to the Catalogue of Life

WELCOME TO THE GOOD LIFE

Welcome To The Meeting!

Welcome to the Life of…

Overview of Consortium for the Barcode of Life (CBOL)

The Barcode of Life Data Portal ( bol.uvm)

Welcome to the philosophy of life paper!

The Barcode Initiative and the Consortium for the Barcode of Life

Introduction to the Consortium for the Barcode of Life (CBOL)

Welcome to the VMCoP Meeting

DNA Barcoding and the Consortium for the Barcode of Life

WELCOME TO THE LIFE OF CLAIRES READING