360 likes | 487 Vues
An HV-SVM Classifier to Infer TF-TF Interactions Using Protein Domains and GO Annotations. Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Junxiang Lee (National University of Singapore)
E N D
An HV-SVM Classifier to Infer TF-TF Interactions Using Protein Domains and GO Annotations Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Junxiang Lee (National University of Singapore) Bharadwaj Veeravalli (National University of Singapore) See-Kiong Ng (Institute for Infocomm Research)
Outlines • 1. Introduction • 2. Existing work • 3. The proposed techniques • 4. Experiments • 5. Conclusions
1. Introduction 3 Key Bio-molecules in a cell • DNA:The blueprint. Encodes all the information to produce an organism • RNA:The messenger. Transmits DNA information to the protein synthesis machinery. • Proteins:The workers. Targets for most drugs.
Transcription Translation PPI The Central Dogma DNA RNA Protein Cell Genes transcribed to RNA: transfer genetic information from DNA into RNA. RNA translated to protein: cells build proteins through protein biosynthesis. Protein is the machinery of life: Proteins function by interacting with other proteins and biomolecules.
Genetic Information Flow in a cell Transcription: DNA to RNA Translation: RNA to Protein
Transcriptional Regulation This slide is from Li Shen: CSB Tutorial, UCSD
Combinatorial Nature of TFs • Recruiting of RNA Polymerase requires the cooperation of several different transcription factors. • TFIID is first positioned at TATA box of DNA • TFIIA, TFIIB, etc then need to attached to DNA • Next, RNA Polymerase bounds • Other TFs complete the mature of transcription complex
2. Existing works: Computationally Infer TF-TF Interactions • Exploits the abundance of gene expression data, inferring synergistic relationships between TFs • A pair of TFs (A, B) is considered to be interacting if the expression correlation scores of genes binding both A and B is significantly greater than any set of gene with binding of either TF A or B alone. ECS(A, B, G)>>max(ECS(A,G), ECS(B,G)) Pilpel (2001), Nature Genet ; Bussemaker (2001), Nature Genet Yu, (2003), Trends Genet ; Banerjee, (2003), NAR
Existing works (Cont.) • Exploit the growing availability of whole genome sequences. Sophisticated sequence analysis methods are employed to discover the so-called interacting motif pairs in the DNA sequences. • Two motifs are deemed interacting if they co-occurrence in the input DNA promoters are over-represented and the distance between the two motifs are significantly different from random expectations. • The TFs binding to these motifs are then predicted to be interacting with each other. Conlon (2003), PNAS; Das (2004), PNAS ; Yu, (2006), NAR;
Existing works (Cont.) • Exploit the growing availability of large-scale protein-protein interaction networks. • The working hypothesis here is that proteins that are close to each other in the PPI networks are more likely to be co-regulated by the same set of TFs. • TFs A and B are considered to be interacting when the median distance between protein pairs controlled by both A and B is significantly shorter than protein pairs controlled by A only or B only. Nagamine, (2005), NAR
Existing works (Cont.) • Machine learning algorithms have been used to exploit the current abundant availability of PPI data to build models to classify novel protein interactions. • Feature representation • SVM, NB, etc • Most of these works had focused on predicting generic PPI’s. It would be interesting if we could also employ a machine learning approach for predicting the more biologically specific TF-TF interactions. Bock (2001), Bioformatics; Wu, (2005), NAR; Ben-Hur, (2005), Bioformatics.
3. The Proposed Technique • 3.1 TF characterization • 3.2 SVM classification • 3.3 Kernel design • Vertical Pairwise Kernel • Horizontal Pairwise Kernel • Combined kernel • 3.4 Genetic algorithm (GA)
3.1 TF characterization • We characterize TFs according to their protein domains and biological attributes, which include molecular functions, biological processes, as well as cellular components.
Protein Domains • They are evolutionarily conserved modules of amino acid sub-sequence, i.e. “building blocks” for constructing different proteins. • The existence of certain domains in the TFs could indicate the propensity for the TFs to interact (DDI).
GO Annotations • GO provides a common vocabulary that describes the biological processes, molecular functions and cellular components for many bio-molecules. • Physical interactions between TFs require that they exist in close proximity in a cell. • Biologically, TFs that have the same molecular functions, involved in the same biological processes, and located in the same cellular components are more likely to interact.
Class 2 Class 1 3.2 Support Vector Machines • SVM is related to statistical learning theory, and it was first introduced in 1992. • Consider a binary classification problem • Many decision boundaries! • Are all decision boundaries equally good? V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.
m Large-margin Decision Boundary • The decision boundary should be as far away as possible from the data of both classes • We should maximize the margin, Class 2 Class 1
Finding the Decision Boundary • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The decision boundary should classify all points correctly Þ • The decision boundary can be found by solving the following constrained optimization problem
The Dual Problem • This is a quadratic programming (QP) problem • Many approaches, such as Loqo, cplex have been proposed and a global maximum of ai can always be found • see http://www.numerical.rl.ac.uk/qp/qp.html • Inner product:a similarity measure between the objects. • Different kernels can be designed, e.g. using domain knowledge to better address the problem!!!
A Geometrical Interpretation a10=0 a8=0.6 Class 2 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1
3.3 Kernel design:SVM Default Linear Kernel Kd • Need to define pairwise kernel (TF pairs, not single TF) • A simplistic approach is to arrange the TFs in an alphabetically ordered list. Hence given TF pairs (A, B) and (C, D), where A<B and C<D, the pairwise kernel is defined as:
Vertical Pairwise Kernel Kv What is wrong with default kernel?
TF pair AB TF pair CD A C B D Domain pairs in AB Domain pairs in CD Common processes in AB Common processes in CD Common functions in AB Common functions in CD Common components in AB Common components in CD Proposed Horizontal Pairwise Kernel Kh • A pairwise kernel is to represent a pair explicitly as one object and measure the similarity directly between both pairs.
Combining to build HV kernel Khv and learning feature weights • Combining vertical and horizontal kernels as following HV kernel where Kh and Kv refer to the horizontal and vertical kernels while wh and wv are their respective kernel weights. • How to decide the kernel weights wh and wv and feature weights
3.4 Genetic algorithm optimizes the kernel and feature weights 1) Randomly generate an initial population of k chromosomes; 2) Evaluate the fitness, f, of each individual; 3) Form k/2 random pairs from the population and select the fitter individual of each pair as parents to breed the next generation. Repeat to obtain a total of k parents; 4) Breed a new generation of offspring through crossover of the k parents; 5) Perform random mutation on the newly created offspring with a mutation rate r; 6) Repeat Steps 2-5 for n generations; 7) Select fittest chromosome from the n generations as solution to the search problem.
4. Experimental Results • We collected TF-TF interaction data for Homo sapiens (Human) and Mus Musculus (Mouse) from various databases, including • IntAct (http://www.ebi.ac.uk/intact/index.html), • GRIP (http://biodata.mshri.on.ca/grid/servlet/Index/), • MINT (http://mint.bio.uniroma2.it/mint/), • BIND (http://bind.ca/), • DIP (http://dip.doe-mbi.ucla.edu/). • In all, a total of 3224 TF-TF interaction pairs among 619 TFs were extracted from these public databases.
Positive and Negative Dataset • The positive dataset: the extracted 3224 TF-TF interaction pairs formed. • The negative dataset: similar size of TFs was constructed by randomizing pairs of TFs that do not already exist in the positive dataset. • Perform five-fold cross validation to test the experimental results
Comparison between Feature Combinations • We conducted an experiment to determine the effectiveness of the different features in predicting TF-TF interactions. • All 15 combinations of the 4 features (D: domains, P: biological processes, F: molecular functions, C: cellular components) were tested over different kernels.
Performance of default kernel Kd for different feature combinations
Performance of different kernels using all features with equal weights.
Optimization of Kernel and Feature Weights Applying our optimized HV-SVM classifier with kernel Khvo on test set achieved 85.7% F-measure, which is 1.0% higher than kernel Khv and 2.5% higher than randomly assigned weights for the initial population.
Conclusions • To understand the complex mechanisms behind transcription regulation, it is imperative to unravel the coordinated interactions of the TFs. • We characterized the TFs using Pfam domains and GO annotations. Our results that integrating multiple biological evidences improves the prediction of TF-TF interactions.
Conclusions (Cont.) • We specifically designed two novel pairwise kernels for predicting TF-TF interactions • The vertical pairwise kernel measures similarity across individual TFs between two TF pairs • The horizontal pairwise kernel considers similarity between two TF pairs by measuring the similarity between the feature pairs of the two sets of TFs. • Genetic algorithm was then employed to learn the kernel and features weights of the kernel combination to give the best results.
Future Work • A future area of work would be to incorporate the current three classes of TF-TF interaction prediction techniques, namely gene expression correlation based techniques, interacting motif based techniques and PPI network based techniques, into our HV-SVM machine learning approach for even better predictions.
Thank you for your attention!