20 likes | 160 Vues
Identification of Cancer-Causing Mutations in the Human Genome with Machine Learning Techniques U, Man Chon (Kevin) Computer Science Department The University of Georgia manchonu@cs.uga.edu. Introduction
E N D
Identification of Cancer-Causing Mutations in the Human Genome with Machine Learning Techniques U, Man Chon (Kevin) Computer Science Department The University of Georgia manchonu@cs.uga.edu • Introduction • Cancer is a leading cause of death worldwide and the total number of cases globally is increasing. The number of global cancer deaths is projected to increase by 45% from 2007 to 2030 (from 7.9 million to 11.5 million deaths). In most developed countries, cancer is the second largest cause of death after cardiovascular disease. Therefore, research to improve our understanding of the causes of cancer and its most promising therapies is urgently needed. • Background • Single nucleotide polymorphisms (SNPs): DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered, as Figure 1 illustrates. • Mutations: Driver mutations are responsible for oncogenicity. Passenger mutations are harmless mutations. • Hypothesis: Subsets of the non-synonymous SNPs (nsSNPs) will help identify the multiple genes associated with complex ailments such as cancer. • Motivation: Finding those nsSNPs is extremely expensive, and time-consuming. • Solution: The aim of this research is to use machine learning techniques to identify probable cancer causing nsSNPs • Ultimate Goal • To identify suspicious mutations that we can assert with a high degree of certainty to be driver mutations and build a sophisticated model for this process. • Discussion • Our experimental results demonstrate that by utilizing machine learning techniques, we can indentify the cancer-causing mutations in human genome with very high accuracy. The little variance in the accuracies of the different machine learning algorithms suggests that our new features are very useful in terms of playing a significant role in the identification process. Furthermore, when we limited our experiments to the Kinase domain, the accuracy of classification reached 90.1757%, and by giving the experimentally confirmed drivers (cancer-causing) list, we were able to successfully identify the mutations with 98.549% accuracy without having any attribute selection or instance selection methods implemented. • Contributions • Applied different machine learning techniques to identify cancer-causing mutations. • New features are introduced. • Provide evidence to biologists for inventing new therapies for cancer treatment. • Acknowledgments • I would like to thank Dr. Khaled Rasheed and Dr. Natarajan Kannan for their guidance in this project. I would also like to thank Eric Talevich for help in collecting the data. Figure 1. Single Nucleotide Polymorphisms Figure 2. Visualization of Classification Results Table I. Classification Results