1 / 24

Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel. Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Student ų 50-415, Kaunas, Lithuania robertas.damasevicius @ktu.lt.

calvin
Télécharger la présentation

Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania robertas.damasevicius@ktu.lt

  2. What is splicing? • Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined • Splice junctions: boundary points between exons and introns where splicing occurs • Donor: upstream part of intron, conserved dinucleotide GT • Acceptor: downstream part of intron, conserved dinucleotide AG • Pseudo splice-sites Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  3. Problem • Splice-junction site recognition • Important for successful gene prediction • Study of genetical deseases • Understanding of genetic mechanisms • Difficulties • Noisy data • Pseudo splice sites • Non-canonical splice sites (intron is not GT...AG) • Alternative splicing • Multitude of consensus sequences • Machine Learning: Support Vector Machine (SVM) • Feature space mapping for SVM • Which frequency-based feature mapping is the best? Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  4. Support Vector Machine (SVM) are training data vectors, are unknown data vectors , is a targetspace is the kernel function. Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  5. What factors influence quality of classification? • Training data • size of dataset, generation of negative examples, imbalanced datasets • Mapping of data into feature space • Orthogonal, single nucleotide, nucleotide grouping, ... • Selection of an optimal kernel function • linear, polynomial, RBF, sigmoid • Kernel parameters • SVM learning parameters • Regularization parameter, Cost factor Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  6. SVM feature space • Feature space: multidimensional vector representing data instances • Mapping of data into features:achieving better classification accuracy • Feature space construction: • nucleotide position-dependent • nucleotide position-independent • both nucleotide position-dependent and -independent information • Feature mapping rule: • N –the lengthof a DNA sequence, M – thelength of feature vector Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  7. K-mers • K-mer: a k-base long sequence (k-tuple) of DNA • K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence Σ – alphabet, N – length of a DNA sequence, k – length of k-mer, nj– number of j-th k-mer in a DNA sequence Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  8. K-mer frequency mapping rules • 4-letter (ACGT) :Σ = {A, C, G, T}, ||Σ|| = 4 • Disadvantage: feature space growth ~ 4k • Nucleotide grouping based: SW, KM & RY • SW : Σ = {S, W}, ||Σ|| = 2 • Strong (C, G) nucleotides– 3 H bonds • Weak (A, T) nucleotides– 2 H bonds • RY : Σ = {R, Y}, ||Σ|| = 2 • A and G – purines (R) • C and T – pyrimidines (Y) • KM : Σ = {K, M}, ||Σ|| = 2 • A and C – amines (M) • G and T – ketones (K) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  9. Example: 2-mer frequency mapping Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  10. Case study • Dataset: UCI repository, Genbank 64.1 primate data • 3175 sequences, each (-30 bp, +30 bp)with regard to splice site • Three splice site recognition sub-problems: • Exon/Intron(EI) vs. Negative(N) • Intron/Exon (IE) vs. Negative (N) • Exon/Intron (EI) vs. Intron/Exon (IE) • Three datasets: • EI vs. N : 767 EI and 1655 N • IE vs. N : 768 EI and 1655 N • EI vs. IE: 767 EI and 768 EI • Power series kernel • Accuracy evaluation metric: F-measure Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  11. Classification results: Exon/Intron vs. Negative Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  12. Classification results:Intron/Exon vs. Negative Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  13. Classification results:Intron/Exon vs. Exon/Intron Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  14. Classification time Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  15. Feature vector size Intron/exon splice sites, 2422 sequences Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  16. Evaluation of results • Classification accuracy: • Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%) • Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%) • E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%) • 4-mers and 6-mers better than 5-mers • RY always better than SW or KM • Feature space size: • ACGT k-mer: 4k • SW, RY, KM k-mer: 2k • Classification speed: • SW/KM/RY k-mer frequency based classification can be ~ 2 times faster than ACGT k-mer classficaion Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  17. Why RY is better than SW or KM? • Acceptor consensus sequence has long runs of Pyrimidines (Y) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  18. Conclusions • Selection of the appropriate feature mapping rule can greatly influence the DNA sequence classification results • Anomalies in consensus sequences (such as long runs) can be exploited for better classification results when selecting mapping rules • For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4-letter k-mer frequency • Open research problem: “forbidden” k-mers Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  19. Questions? Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  20. SVM kernel function optimization • Introduction of additional kernel parameters • Introduction of new kernels • Power series kernel function • Advantage: • more parameters for optimization • better separation of classes in feature space Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  21. SW k-mer frequency mapping rule • SW ({A,T} vs. {C,G}) mapping rule • reflects the difference in the number of hydrogen bonds in the DNA molecule • Strong (C, G) nucleotides- 3 H bonds • Weak (A, T) nucleotides- 2 H bonds • related to physical-chemical properties of DNA • transport of electrons • mechanical waves along the DNA helix Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  22. RY k-mer frequency mapping rule • The RY mapping rule ({A, G} vs.{C, T}) • describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence. • A and G – purines (R) • C and T – pyrimidines (Y) • corresponds to the chemical composition bias in the DNA strand Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  23. KM k-mer mapping rule • The KM mapping rule ({A,C} vs. {G,T}) • describes how ketones (K) and amines (M) are distributed along the DNA sequence • A and C – amines (M) • G and T – ketones (K) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

  24. Classification metric • F-measure • Advantage: • One measure that takes into account both recall and precision: aspectacular score in one does notcompensate for a bad score in the other Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

More Related