140 likes | 265 Vues
Abstract. Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis vs ToppGene ( functional prioritization method ). Results:
E N D
Abstract • Background: In this work, a candidate gene prioritization methodis described, and based on protein-protein interaction network (PPIN) analysis vs ToppGene (functional prioritization method). • Results: For the first time, the PageRank and HITS algorithms and the K-Step Markov method used in Web and social network analysis, are applied to a PPIN to prioritize disease candidate genes. • Conclusion: PPIN-based candidate geneprioritization performs better than all others gene features or annotation. Itcan be successfully used for disease candidate gene prioritization.
Background-1 • Most of the current disease candidate gene identification and prioritization methods rely on functional annotations from different data sources: GO, Pathways,Domains, Expressions.. • In their recent work, the authors used a functional prioritization method named ToppGene: they integrated functional data with Mouse Phenotype data. ToppGene outperforms better than the other published functional prioritization methods. • In these methods there is a limitation, with regard to the coverage of the gene functional annotation: - only a fraction of human genome is annotated with pathways and phenotypes - 2/3 of all genes are annotated by at least one functional annotation - 1/3 is yet to be annotated
Background-2 Different approach • In this study, for the first time, they applied to a PPIN , social andWeb-network analysis-based algorithms to prioritize disease candidate genes • PPINrepresented asunweighted, undirected, simple graph G (V, E); genes are nodes, interactions are edges, V all genes, E all interactions. The set of known disease genes (seeds) is denoted as R. • Prioritization approaches are based on the methods of White and Smyth whose framework of four successive problem formulations defines the approach to rank nodes in the unweighted graph G (V,E).
Methods-1 White and Smyth problem formulations: • Given G, where t and r are both nodes in G, compute the Importance I(t|r) of the node t respect to the root r • Given G and a root node r in G, rank all vertices in T, a subset of vertices in G and for each node in t in T compute I(t|r) • Given G and a set of root node R in G, rank all vertices in T. The I(t|R) is the average sum of importance of each node in R: I(t|R) = (1/|R|)(sum(I(t|r)) 4. Given G, rank all nodes where R=T=V • The solution of the formulation 3 is what is needed in this study: here the problem is to prioritize a set of genes in the network based on their importance to a set of root genes (genes known to be associated with a disease). • The importance of a gene to the set of root genes is just the average sum of its importance towards each individual root gene.
Methods-2 • The solution is to find I(t|r), the importance of the node t with respect to a root node r. • They used the three algorithms from White and Smythmethods: • PageRank • HITS 3. K-Step Markov
Methods-3 Human protein interactions network • The Human protein-protein interactions were extracted from the NCBI Entrez Gene FTP site with 8340 nodes and 27250 edges (BIND, BioGRID, HPRD). Evaluation of PPIN for gene prioritization • they used the same training data, from their previous study, comprising 19 diseases on OMIM (Online Mendelian Inheritance in Man) and GAD (Genetic Association Database) databases. A total ol 693 associated genes. 589 genes were used in the cross validation. Cardiac septal defect candidate gene prioritization • From NCBI’s OMIM databse: 166 OMIM records were extracted; they had the label “atrial septal defect”. 81 genes were mapped on these records and used as the training set. 431 genes (from interactions) used for ranking (test set).
Rank-based ROC curves were plotted, and AUC values were used to quantitatively measure the performance. 13 conditions with 3 algorithms different parameter settings repeated 5 times Results-1 Cross validation
A combined functional annotations and PPIN-based methods are more effective in identifying and ranking of disease candidate genes Results-3 Top 20 ranked genes Mice with deletion of Erbb2 show ventricular septal defects (VSD) Suggesting that the human ortholog ERBB2 could be a potential candiadte gene for VSD Mouse embryos lacking p300 protein (EP300 gene) show ventricular septal defects Truncated CBP protein (CREBBP gene) leads, in mice, atrial and ventricular septal defects *Genes associated with cardiac development or malformation: 15 ToppGen, 14 PPIN-based method #(hash) genes associated with septal defects: 6 ToppGene, 3 PPIN-based method
Results-4 Prioritized candidate genes of cardiac septal defects using both functional annotation- and PPIN- based methods.
Results-5 AUC of different feature sets. Red bars indicate the AUC scores based on each feature set, and blue bars are the corresponding random controls.
Conclusions-1 • PageRank, HITS, K-Step Markov algorithms were applied on a Literature-based and manually curated protein interactions network. • Goal: to prioritize disease candidate genes. Known disease-related genes was used as a training set ("seeds"), and the candidate genes were ranked. • Network-based methods are generally not as effective as the integrated functional annotation-based methods. • By comparing PPIN-based methods to the individual functional annotation features, network-based methods are better than all annotations. • Therefore, PPINs can be a good feature for disease candidate gene prioritization, especially when the genes lack all other functional annotations or are sparsely annotated.
Conclusions-2 • Limitations: Just like functional annotation-based methods, the performance depends on the quality of interaction data (missing interactions and false positives). Solutions: • betterfit with biological networks (e.g., using weighted nodes - genes or proteins - or edges – interactions-). • integrate the method with other methods (e.g., combining results from functional annotation-based methods and expression profiles with network-based approaches). • It is expected that using bothfunctional annotations and PPIN-based topological parameters may better facilitate the discovery and prioritization of disease genes.