SVM: Non-coding Neutral S equences V s Regulatory M odules

SVM: Non-coding Neutral Sequences Vs Regulatory Modules Ying Zhang, BMB, Penn State Ritendra Datta, CSE, Penn State Bioinformatics – I Fall 2005

Outline • Background: Machine Learning & Bioinformatics • Data Collection and Encoding • Distinguish sequences using SVM • Results • Discussion

Regulation: A Recurring Challenge • Expression of genes are under regulation. • Right protein, right time, right amount, right location… • Regulation: cis-element vs trans-element • Cis-element: Non-coding functional sequence • Trans-element: Proteins interact with cis-element • Predicting cis-regulatory elements remains a challenge: • Significant effort put in the past • Current trends: TFBS clusters, pattern analysis

Alignments and Sequences: The Data • Information: Sequence • Genetics information encoded in DNA sequence • Typical information: Codon, Binding site, … Codon: ATG (Met), CGT (Arg.), … Binding sites: A/TGATAA/G ( Gata1 ), … • Evolutionary Information: Aligned Sequence • Similarity between species • Conservation ~ Function Human: TCCTTATCAGCCATTACC Mouse: TCCTTATCAGCCACCACC

Problem • Given the genome sequence information, is it possible to automatically distinguishRegulatory Regions from other genomic non-coding Neutral sequences using machine learning ?

Machine Learning: The Tool • Sub-field of A.I. • Computers programs “learn” from experience, i.e. by analyzing data and corresponding behavior • Confluence of Statistics, Mathematical Logic, Numerical Optimization • Applied in Information Retrieval, Financial Analysis, Computer Vision, Speech Recognition, Robotics, Bioinformatics, etc. M.L. Statistics Optimization Logic Analyzing Stocks Predicting Genes Personalized WWW search Applications

Machine Learning: Types of Learning • Supervised Learning • Learning statistical models from past sample-label data pairs, e.g. Classification • Unsupervised Learning • Building models to capture the inherent organization in data, e.g. Clustering • Reinforcement Learning • Building models from interactive feedback on how well the current model is doing, e.g. Robotic learning

Machine Learning and Bioinformatics: The Confluence • Learning problems in Bioinformatics[ICML ’03] • Protein folding and protein structure prediction • Inference of genetic and molecular networks • Gene-protein interactions • Data mining from micro arrays • Functional and comparative genomics, etc.

Machine Learning and Bioinformatics: Sample Publications • Identification of DNaseI Hypersensitive Sites in the human genome (may disclose the location of cis-regulatory sequences) • W.S. Noble et al., “Predicting the in vivo signature of human gene regulatory sequences,” Bioinformatics, 2005. • Functionally classifying genes based on gene expression data from DNA microarray hybridization experiments using SVMs • M. P. S. Brown, “Knowledge-based analysis of microarray gene expression data by using support vector machines,” PNAS, 2004. • Using Log-odds ratios from Markov models for identifying regulatory regions in DNA sequences • L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral Sites,” Genome Research, 2003. • Selection of informative genes using an SVM-based feature selection algorithm • I. Guyon et al., “Gene selection for cancer classification using support vector machines,” Machine Learning, 2002.

Machine Learning and Bioinformatics: Books

Support Vector Machines: A Powerful Statistical Learning Technique Which of the linear separators is optimal?

Support Vector Machines: A Powerful Statistical Learning Technique Choose the one that maximizes the margin between the classes ξi ξi

x 0 x 0 Support Vector Machines: A Powerful Statistical Learning Technique The classes in these datasets linearly separate easily x What about these datasets ?

x2 x 0 x Support Vector Machines: A Powerful Statistical Learning Technique Solution: Kernel Trick !

Experiments: Overview • Classification in Question: • Regulatory regions (REG) vs Ancestral Repeats (AR) • Two types of experiments: • Nucleotide sequences – ATCG • Alignments (reduced 5-symbol) - SWVIG (S: match involving G & C, W: match involving A & T, G:gap V:transversion, I: transition) • Two datasets: • Elnitski et al. dataset • Dataset from PennState CCGB • Mapping Sequences/Alignments → Real Numbers • Frequencies of short length K-mers (K=1, 2, 3) • Normalizing factor - sequence length (Ambiguous for K > 1) • Stability of variance – Equal length sequences (whenever possible)

Experiments: Feature Selection • Total number of features: • Sequences: 4 + 42 + 43 = 84 • Alignments: 5 + 52 + 53 = 155 • Relatively high-dimensionality: • Curse of dimensionality:Convergence of estimators very slow • Over-fitting:Poor generalization performance • Solutions: • Dimension Reduction – e.g., PCA • Feature Selection - e.g., Forward Selection, Backward Elimination

Experiments: Training and Validation • Training Set: • Elnitski et al. dataset • Sequences: 300 samples of 100 bp each class (REG and AR) • Alignments: 300 samples of length 100 from each class • SVM setup: • RBF Kernel: k(x1, x2) = exp( δ || x1 – x2 || ) • Implementation: LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) • Validation: • N-fold Cross-validation • Used in feature selection, parameter tuning, and testing

Results: The Elnitski et al. dataset • Parameter selection • SVM Parameters: δ and C • Feature Selection • Assessing Feature Importance • G-C Normalization • Sequences: 10 out of 84 • Symbols: 10 out of 155 • Accuracy scores • Overall • Ancestral Repeats (AR) • Regulatory Regions (Reg)

Results: SVM Parameter Selection • Iterative selection procedure • Coarse selection – Initial neighborhood • Fine-grained selection - Brute force • Validation Set from data • Within-loop CV • Chosen Parameters: • δ = 1.6 • C = 1.5

Results: Feature Selection - Sequence Chosen by One-dimensional SVMs Distribution of Nucleotide frequencies of the top 9 most significant k-mers

Results: Feature Selection - Symbol Chosen by One-dimensional SVMs Distribution of 5-symbol frequencies of the top 9 most significant k-mers

Results: Feature Selection • Procedure: • Greedy • Forward Selection + Backward Elimination • Chosen Features: • Sequence: [5 68 3 20 63 4 16 10 1 22] • ( 0 = A, 1 = T, 2 = G, 3 =C, 4 = AA, 5 = AT, etc. ) • [AT,CAA,C,AAA,GGC,AA,CA,TG,T,AAG] • Symbol: [3 5 4 18 24 124 17 143 19 95 103] • ( 0 = G, 1 = V, 2 = W, 3 =S, 4 = I, 5 = GG, 6 = GV, etc. ) • [S, GG, I, WS,SI,SIG,WW,IWI,WI,WSV,WII]

Results: Accuracy Scores

Results: Laboratory Data • Training: • SVM models built using Elnitski et al. data • Same parameters; Same features selected • Data: • 9 candidate cis-regulatory regions predicted by RP score • 1: negative control based on the definition. • 5 of the 9 candidates passed current biological testing,positive • Accuracy • Classification result for sequence (1-, 2-, 3-mer): • 1 negative control • 4 out of 5 positive element + 3 out of 4 “negative” element • Classification result for alignment (1-, 2-, 3-mer): • 1 negative control • 9 original candidates

Discussion • High validation rate for Ancestral repeat • The structure of selected training set is not that diverse • Ancestral repeat tends to be AT-rich • AR: LINE, SINE etc. • SVM performs a little better than RP scores in training set • Statistically more powerful • RP: Markov model for pattern recognition • SVM: Hyper-plane in high-dimensional feature space • Feature selection using wrapper method possible

Discussion(cont’d) • Performance degradation in Lab Data classification • No improvement in SVM classification compared to RP score • Features identified from the Elnitski et al. data may have some bias – other features may be more informative on the Lab data • Sequence classification vs Alignment (Accuracy Table) • SVM yields higher overall cross-validation accuracy for aligned symbol sequences compared to nucleotide sequences • Gained accuracy rate: Ancestral Repeat driven • No improvement for aligned symbol sequence • In Lab data classification, sequence classification is better than aligned symbol sequence • No information gained from evolutionary history !!! • Alphabet reduction not optimal • Assumption worng!!!

Summary • Generally, SVM is a powerful tool for classification • Performance better than RP in distinguishing AR training set from Reg training set • SVM: answer “yes or no” question • RP: Probabilistic method, can generate quantitative measurement genome-wide • SVM: Results can be extended using probabilistic forms of SVM • SVM can reveal potentially interesting biological features • e.g. the transcription regulation scheme

Future Directions:Possible extensions • Explore more complex features • Refine models for neutral non-coding genomic segments • Utilize multi-species alignment for the classification • Combining sequence and alignment information to build more robust multi-classifiers – “Committee of Experts” • Pattern recognition for more accurate prediction

Questions and recommendations? • Using original alignment features, 20 columns. • Other lab data (avoiding the possible bias of RP preselection) for SVM performance testing.

References • L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral Sites,” Genome Research, 2003. • Machine Learning Group, University of Texas at Austin, “Support Vector Machines,” http://www.cs.utexas.edu/~ml/ . • N. Cristianini, “Support Vector and Kernel Methods for Pattern Recognition,” http://www.support-vector.net/tutorial.html.

Acknowledgement • Dr. Webb Miller • Dr. Francesca Chiaromonte • David King Thank You!!!

SVM: Non-coding Neutral S equences V s Regulatory M odules