1 / 35

Classification of Mitochondrial DNA SNPs into Haplogroups

Classification of Mitochondrial DNA SNPs into Haplogroups. Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717. Carol Wong Department of Bioengineering University of Pennsylvania Philadelphia, PA 19104.

madison
Télécharger la présentation

Classification of Mitochondrial DNA SNPs into Haplogroups

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717 Carol Wong Department of Bioengineering University of Pennsylvania Philadelphia, PA 19104 National Science Foundation – BioGRID REU Fellows Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269

  2. Mitochondrial DNA & Haplogroups • The Genographic Project • 1 Nearest Neighbor • Support Vector Machines • Random Forest • RF PCA • Results • Discussions • Extensions

  3. Mitochondrial DNA • Found in Mitochondria • 2 to 10 copies per Mitochondrion • Hundreds to thousands per cell • Circular • Bacterial origin • Uniparental • Non-Combining • High mutation rate • Maternal Inheritence • Egg vs Sperm & Ubiquitin Marker

  4. Haplogroups | Sequencing • Coding is done at the D-loop • Mutation Hotspots • Hypervariable Region I (HVR-I) • Nucleotides 16024 - 16569 • Each sample is tagged with a haplogroup label representing its genetic content. • Contains similar haplotyes that share a common ancestor based on SNPs. www.ncbi.nlm.nih.gov/bookshelf/br.fcg

  5. SNPs • SNPs (Single nucleotide polymorphisms) • Insertion • Deletion • Transversion • Transition • Heteroplasmy • Variables = SNPs • 545 HVS I SNPs

  6. Haplotypes/Haplogroups • Haplotype – combination of SNPs • Haplotype: HVS-I variants • 21164 samples • Dataset - Coarse- Hg labels • Coding region SNPs/ HVS – I motifs • Considered the ‘gold – standard’

  7. http://learn.genetics.utah.edu/content/health/pharma/snips/

  8. Cladistics • Classification based on shared ancestry • Back Mutations • Homoplasy http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.0030104

  9. Cladistics cont. http://upload.wikimedia.org/wikipedia/en/d/dd/Migration_map4.png

  10. The Genographic Project • The National Geographic Society • Anthropological and Forensic Questions • 78,590 Samples • 21,141 consented database • Hg labeling is done with both HVR-I motifs and the 22-SNP panel results • Utilizes Nearest Neighbor Algorithm (1-NN)

  11. Nearest Neighbor • Pattern recognition | Instance Based Learning • Simple and Power • High accuracy rate with large data sets • Data point is classified by a vote of its k nearest neighbors • Training data is separated in space into regions • Data is classified to the highest number of votes amongst its neighbors.

  12. Support Vector Machines -Training and Testing -Data  Vectors -Model Production -Mapping into higher dimensional plane -Maximum separating margin

  13. Data Processing (SVM) • -Numbering of detailed data • -{x,y,z}  (0,0,1), (0,1,0), (1,0,0) • -Radial Basis Function (RBF) Kernel • -Higher plane mapping • -Simplicity • -Opitmal Parameters: • - Grid Search:

  14. Random Forest • Tree-based classification algorithm • Fortran (computationally oriented programming language) original package • Leo Breiman and Adele Cutler • Ensemble learning algorithm • Implementation through R environment

  15. RF • Voting for classification • Random inputs • Variables • Samples • Ntree single decision trees • Mtry variables • Random sampling • Training set obtained through bootstrap sampling • OOB data/error estimate • Inputted dataset excludes certain cases • 1/3 of input cases left out

  16. Random Forest http://images.google.com/imgres?imgurl=http://proteomics.bioengr.uic.edu/malibu/docs/images/random_forest_thumb.png&imgrefurl=http://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html&usg=__oCugzsEOtYKwtBLo2Mi11kcgkcE=&h=240&w=420&sz=101&hl=en&start=1&um=1&tbnid=bGAVW705VPSR9M:&tbnh=71&tbnw=125&prev=/images%3Fq%3DrandomForest%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1 http://images.google.com/imgres?imgurl=http://cg.scs.carleton.ca/~luc/bst.gif&imgrefurl=http://cg.scs.carleton.ca/~luc/trees.html&usg=__gYANVMgGa_H8CUhJZApOczZD5Xs=&h=447&w=548&sz=21&hl=en&start=3&um=1&tbnid=lsdiIpjqYFENXM:&tbnh=108&tbnw=133&prev=/images%3Fq%3Drandom%2BForest%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1

  17. 5F - Cross Validation • Test out predictive model • Divide into 5 subsets (5-fold) • Training set • Test set • ‘unseen’ data • Five fold cross validation • Training set/ testing set • Random Forest with training set • Testing set

  18. 5F CV https://esus.genome.tugraz.at/ProClassify/help/contents/pages/images/xv_folds.gif

  19. RF model • Genotyped mtDNA • Dataset • 545 SNPs in HVS – I • HVS-I haplotypes • 21164 samples • Hg classification from similar haplotypes • SNPs dictate Hg classifications • SNPs = variables • Coarse – Hg classifications in dataset • ‘gold standard’

  20. Model - Optimal mtryand ntree values for entire dataset • Pair of parameters with lowest OOB error (training set) • Mtry SNPs used to construct each tree • Ntree decision trees constructed • 5 fold Cross validation • Random forest on training set • Training set : random sampling with replacement • Bootstrap sampling • random sampling with replacement of cases • OOB data • Model : random forest object outputted • Apply random forest model on test set • Output = predicted Hg classifications • Compare back to ‘observed’ Hg classifications

  21. R environment • Bill Venables and David M. Smith • Primary programming language: ‘S’ (statistical) • Coherent system integrating data manipulation, calculation and graphical display

  22. R environment

  23. PCA • PCA = Principal Component Analysis • Feature Selection tool • Which variables more informative than others? • Confusing dataset • Too many variables – 545

  24. PCA • Reexpress dataset in another basis, the principal components (PCs) • Change of basis • Possible dimensional reduction • Reveal hidden structure, underlying relationships • Which basis best represents the dynamics of interest? • Maximize variance • Minimize covariance (redundancy) • Find PCs – new basis vectors

  25. PCA on dataset • Eigendecomposition on Cx • Original dataset = X • CX. = 1/(number of samples) *XXT • Transformed dataset = Y • Y = PX • P = orthonormal matrix • Rows = principal components of X • Rows = eigenvectors of Cx • CY = diagonal covariance matrix of transformed X, Y. • Diagonal entries = eigenvalues = variances

  26. PCA • Eigenvalues represent variance • Rank order PCs= eigenvectors of original covariance matrix(new variables) by corresponding eigenvalues • Subselection of k new variables (PCs) from available pool • K = 64, 100, 200, 300, 400, 545 • Select the first k rank ordered PCs for input into RF • Transformed dataset = 21164 by k dimensions

  27. Results

  28. SVM Findings Macro: 88.06% Micro: 96.59%

  29. Comparison

  30. Discussion • Unbalanced dataset • Underrepresented haplogroups • Overrepresented haplogroups • Possibility: change weights/ coarser Hgs • Bootstrap sampling in RF • Cross validation • RF vs. SVM

  31. Graph

  32. Coarse Hg accuracy rates

  33. Conclusions • RF: ? • Random sampling of variables • Random sampling of training cases (samples) • Repeated trials • SVM vs. 1 – NN • Deterministic models • SVM (5FCV) > 1-NN (5FCV)

  34. Acknowledgements • Advisor: Chih Lee • Dr. Chun-Hsi Huang • National Science Foundation REU grant CCF-0755373 • Univ. of Connecticut

  35. Thank you! • Any questions?

More Related