1 / 26

Inferring Ethnicity from Mitochondrial DNA Sequence

Inferring Ethnicity from Mitochondrial DNA Sequence. Chih Lee 1 , Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology University of Connecticut. Outline.

aysel
Télécharger la présentation

Inferring Ethnicity from Mitochondrial DNA Sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee1, Ion Mandoiu1 and Craig E. Nelson2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu 1Department of Computer Science and Engineering 2Department of Molecular and Cell Biology University of Connecticut

  2. Outline • Introduction • Methods • Results and Discussions • Conclusions

  3. Outline • Introduction • Methods • Results and Discussions • Conclusions

  4. Ethnicity in Forensics • Ethnicity information assists forensic investigators. • Investigator-assigned ethnicity: based on genetic and non-genetic markers. • Genetic information enhances inference accuracy when accessto most informative markers (e.g. skin/hair) is limited. • Autosomal markers: • Excellent accuracy assigning samples to clades [Phi07, Shr97] • May not survivedegradation

  5. Mitochondrial DNA • Circular • 16,569 bps • Maternally inherited • High copy number Recoverable from degraded samples • Coding region • SNPs define haplogroups [Beh07] • Hypervariable Region

  6. Hypervariable Region • High mutation rate compared to the coding region • Haplogroup inference [Beh07] • 23 groups • 96.7% accuracy rate with 1NN • Geographic origin inference [Ege04] • SE Africa, Germany and Icelandic • 66.8% accuracy rate with PCA-QDA

  7. Ethnicity Inference from HVR • The problem: • Given a set of HVR sequences tagged with ethnicities • Predict the ethnicities of new HVR sequences • A classification problem • Our contribution: • Assess the performance of 4 classification algorithms: SVM, LDA, QDA and 1NN.

  8. Outline • Introduction • Methods • Results and Discussions • Conclusions

  9. Encoding HVR • Align to rCRS (revised Cambridge reference sequence)  SNP profile • a SNP  a binary variable • Missing data (not typed regions) • Assume rCRS • Use mutation probability • Common region

  10. Support Vector Machines • Binary classification algorithm • Map instances to high-D space (the feature space) • Optimal separating hyperplane with max margins • Kernel function k(x1,x2): similarity x1 and x2 between in the feature space • Radial basis kernel: exp(-γ||x1-x2||2) • Software: LIBSVM [Cha01]

  11. Linear/Quadratic Discriminant Analysis • Find argmaxg P(G=g|X=x) • Assumptions: • X|G=g ~Np(μg, Σg) • P(G=g)’s are equal for all g • P(G=g|X=x) prop. to P(X=x|G=g) • μg and Σg are estimated by the training data • LDA: common dispersion matrix Σg = Σ for all g

  12. 1-Nearest Neighbor • Assign a new sample to the dominating ethnicity among the nearest samples in the training data • Distance measure: the Hamming distance • Used by Behar et al. (2007) for haplogroup assignment

  13. Principal Component Analysis • A dimension reduction technique • Used in conjunction with SVM, LDA and QDA • Denoted as: PCA-SVM, PCA-LDA and PCA-QDA

  14. Outline • Introduction • Methods • Results and Discussions • Conclusions

  15. The FBI mtDNA Population Database • Two tables: • forensic: typed by FBI • published: collected from literature • Retain only Caucasian, African, Asian and Hispanic samples

  16. Data Coverage and Subsets • Variable sequence lengths • trimmed forensic dataset (4,426) • 16024-16365 • trimmed published dataset (1,904) • 16024-16365 • full-length forensic dataset (2,540) • 16024-16569, 1-576 forensic published

  17. 5-fold Cross-Validation (trimmed forensic) • Macro-Accuracy: Average of ethnicity-wise accuracy rates • Micro-Accuracy: Weighted by # Samples • More accurate than Egeland et al. (2004) • Matches human experts depending on skull and large bones [Dib83, isc83]

  18. Seq. Region Effect on Accuracy 100% 90% 80% full-length forensic dataset • Different primers result in different coverage. • PCA-LDA outperforms 1NN on long sequences. • PCA-SVM is consistently the best.

  19. Seq. Region Effect on Accuracy 80% 100% 90% full-length forensic dataset • HVR 2 contains less information. • PCA-SVM is consistently the best.

  20. Twenty 10% Windows 10% 10% 10% • Accuracy varies with region. • PCA-SVM remains the best. • 1NN is as good as PCA-SVM for short regions.

  21. Independent Validation (1/2) • Training data: trimmed forensic dataset • Test data: trimmed published dataset • PCA-SVM • No Hispanic samples in the test data but samples can be mis-classified as Hispanic • Asian: ~17% lower than CV

  22. Independent Validation (2/2) • Composition of the Asian samples in the training data: • China (356 profiles), Japan (163), Korea (182), Pakistan (8), and Thailand (52) • Strong bias towards East Asia • 145 Mis-classified Asian samples in the test data: • 10 samples of unknown country of origin • 90 samples from Kazakhstan and Kyrgyzstan • Both countries have significant Russian population. • Evidence of admixture with Caucasians.

  23. Handling Missing Data • Mimic real-world scenario • Training: forensic dataset • Test: published dataset • rCRS and Probability are biased toward Caucasian. • Common Region is the best overall.

  24. Posterior Probability Calibration • PCA-SVM on published dataset with “Common Region” • Accuracy rates are slightly higher than the estimated posterior probabilities.

  25. Conclusions • SVM is the most accurate algorithm among those investigated, outperforming • Discriminant analysis employed by Egeland et al. (2004) • 1NN similar to that used by Behar et al. (2007) • Overall accuracy of 80%-90% in CV and independent testing • Matches the accuracy of human experts depending on measurements of skull and large bones [Dib83,isc83] • Approaches the accuracy by using ~60 autosomal loci [Bam04]

  26. Questions? • Thank you for your attention.

More Related