Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania robertas.damasevicius@ktu.lt

What is splicing? • Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined • Splice junctions: boundary points between exons and introns where splicing occurs • Donor: upstream part of intron, conserved dinucleotide GT • Acceptor: downstream part of intron, conserved dinucleotide AG • Pseudo splice-sites Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Problem • Splice-junction site recognition • Important for successful gene prediction • Study of genetical deseases • Understanding of genetic mechanisms • Difficulties • Noisy data • Pseudo splice sites • Non-canonical splice sites (intron is not GT...AG) • Alternative splicing • Multitude of consensus sequences • Machine Learning: Support Vector Machine (SVM) • Feature space mapping for SVM • Which frequency-based feature mapping is the best? Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Support Vector Machine (SVM) are training data vectors, are unknown data vectors , is a targetspace is the kernel function. Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

What factors influence quality of classification? • Training data • size of dataset, generation of negative examples, imbalanced datasets • Mapping of data into feature space • Orthogonal, single nucleotide, nucleotide grouping, ... • Selection of an optimal kernel function • linear, polynomial, RBF, sigmoid • Kernel parameters • SVM learning parameters • Regularization parameter, Cost factor Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

SVM feature space • Feature space: multidimensional vector representing data instances • Mapping of data into features:achieving better classification accuracy • Feature space construction: • nucleotide position-dependent • nucleotide position-independent • both nucleotide position-dependent and -independent information • Feature mapping rule: • N –the lengthof a DNA sequence, M – thelength of feature vector Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

K-mers • K-mer: a k-base long sequence (k-tuple) of DNA • K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence Σ – alphabet, N – length of a DNA sequence, k – length of k-mer, nj– number of j-th k-mer in a DNA sequence Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

K-mer frequency mapping rules • 4-letter (ACGT) :Σ = {A, C, G, T}, ||Σ|| = 4 • Disadvantage: feature space growth ~ 4k • Nucleotide grouping based: SW, KM & RY • SW : Σ = {S, W}, ||Σ|| = 2 • Strong (C, G) nucleotides– 3 H bonds • Weak (A, T) nucleotides– 2 H bonds • RY : Σ = {R, Y}, ||Σ|| = 2 • A and G – purines (R) • C and T – pyrimidines (Y) • KM : Σ = {K, M}, ||Σ|| = 2 • A and C – amines (M) • G and T – ketones (K) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Example: 2-mer frequency mapping Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Case study • Dataset: UCI repository, Genbank 64.1 primate data • 3175 sequences, each (-30 bp, +30 bp)with regard to splice site • Three splice site recognition sub-problems: • Exon/Intron(EI) vs. Negative(N) • Intron/Exon (IE) vs. Negative (N) • Exon/Intron (EI) vs. Intron/Exon (IE) • Three datasets: • EI vs. N : 767 EI and 1655 N • IE vs. N : 768 EI and 1655 N • EI vs. IE: 767 EI and 768 EI • Power series kernel • Accuracy evaluation metric: F-measure Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Classification results: Exon/Intron vs. Negative Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Classification results:Intron/Exon vs. Negative Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Classification results:Intron/Exon vs. Exon/Intron Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Classification time Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Feature vector size Intron/exon splice sites, 2422 sequences Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Evaluation of results • Classification accuracy: • Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%) • Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%) • E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%) • 4-mers and 6-mers better than 5-mers • RY always better than SW or KM • Feature space size: • ACGT k-mer: 4k • SW, RY, KM k-mer: 2k • Classification speed: • SW/KM/RY k-mer frequency based classification can be ~ 2 times faster than ACGT k-mer classficaion Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Why RY is better than SW or KM? • Acceptor consensus sequence has long runs of Pyrimidines (Y) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Conclusions • Selection of the appropriate feature mapping rule can greatly influence the DNA sequence classification results • Anomalies in consensus sequences (such as long runs) can be exploited for better classification results when selecting mapping rules • For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4-letter k-mer frequency • Open research problem: “forbidden” k-mers Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Questions? Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

SVM kernel function optimization • Introduction of additional kernel parameters • Introduction of new kernels • Power series kernel function • Advantage: • more parameters for optimization • better separation of classes in feature space Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

SW k-mer frequency mapping rule • SW ({A,T} vs. {C,G}) mapping rule • reflects the difference in the number of hydrogen bonds in the DNA molecule • Strong (C, G) nucleotides- 3 H bonds • Weak (A, T) nucleotides- 2 H bonds • related to physical-chemical properties of DNA • transport of electrons • mechanical waves along the DNA helix Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

RY k-mer frequency mapping rule • The RY mapping rule ({A, G} vs.{C, T}) • describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence. • A and G – purines (R) • C and T – pyrimidines (Y) • corresponds to the chemical composition bias in the DNA strand Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

KM k-mer mapping rule • The KM mapping rule ({A,C} vs. {G,T}) • describes how ketones (K) and amines (M) are distributed along the DNA sequence • A and C – amines (M) • G and T – ketones (K) Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Classification metric • F-measure • Advantage: • One measure that takes into account both recall and precision: aspectacular score in one does notcompensate for a bad score in the other Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain

Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology

Presentation Transcript

English-Lithuanian-English Lexicon Database Management System for MT

University of Hail College of Computer Science and Engineering Department of computer Science and Software Engineering

The Personal Software Process (PSP) Lecture #1

Software Engineering Behavioral Design Patterns

Amirkabir University of Technology Department of Petroleum Engineering

Vulnerability of the Day

Software Engineering Refactoring

To Electrical Engineering Department 12 th Batch

Ivica Crnkovic Mälardalen University Department of Computer Science and Engineering

Can WIMP Spin Dependent Couplings explain DAMA? Limits from DAMA and Other Experiments

T-76.5650 Software Engineering Seminar Component Based Software Engineering (CBSE) 5 credit units

Department of Computer Science and Engineering Bangladesh University of Engineering and Technology

1078_Ver06L_Dama Dama Dama Dama Damaru Bhajey Nachey Shankara Nacherey

Department of Chemical Engineering Institute of Technology, Nirma University

Department of Electronic Engineering

CAD/CAM (21-342)

CAD/CAM (21-342)

Árpád Beszédes University of Szeged, Hungary, Department of Software Engineering

LAYER MANUFACTURING TECHNOLOGY

Upgrading DAMA/LIBRA