1 / 87

Gene family classification using a semi-supervised learning method

Nan Song Advisors: John Lafferty, Dannie Durand. Gene family classification using a semi-supervised learning method. Outline. Introduction A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning

kaida
Télécharger la présentation

Gene family classification using a semi-supervised learning method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nan Song Advisors: John Lafferty, Dannie Durand Gene family classification using a semi-supervised learning method

  2. Outline • Introduction • A motivating application: genome annotation • A graphical model of sequence relatedness • Gene classification using machine learning • Empirical evaluation • Conclusion

  3. The Genome The complete genetic material of an organism or species

  4. Key genomic component: genes A gene is a DNA subsequence ACCCTTAGCTAGACCTTTAGGAGG...

  5. A gene is a DNA subsequence ACCCTTAGCTAGACCTTTAGGAGG... A protein is an amino acid sequence A protein is an amino acid sequence VHLT P E... Genes encode proteins, the building blocks of the cell Key genomic component: genes A gene is a DNA subsequence Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... VHLT P E...

  6. Whole Genome Sequencing 413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes www.genomesonline.org

  7. atgcaccttg

  8. 14,882 Known genes 16,896 Predicted genes 31,778 Total Gene prediction and annotation International Human Genome Consortium, Nature 2001

  9. Gene annotation • We are given a new genome sequence with predicted genes. • A few genes are well studied. • Identify other genes in the same family to predict function. • Verify predictions experimentally Two contexts: • Individual scientist • High throughput

  10. Outline • Introduction • Molecular biology • A motivating application: genome annotation • A graphical model of sequence relatedness • Gene classification using machine learning • Empirical evaluation • Conclusion

  11. atgcgccgtctggcatgt… atgcgaggtctcccatgt… atgcaaggagtcccagagc… γ-globin β-globin ε-globin Evolutionarily related genes have related functions Ancestral gene atgccaggactcccagtga… Duplication Duplication Adult Fetal Embryonic

  12. Evolutionarily related genes have related functions Ancestral gene Gene family classification is a powerful source of information for inferring evolutionary, functional and structural properties of genes atgccaggactcccagtga… Duplication Duplication atgcgccgtctggcatgt… atgcaaggagtcccagagc… atgcgaggtctcccatgt… β-globin γ-globin ε-globin

  13. Outline • Introduction • A graphical model of sequence relatedness • Gene classification using machine learning • Empirical evaluation • Conclusion

  14. A graphical model of sequence relatedness • G = (V,E) • V: represent sequences • E: weight of the edge is proportional to the similarity between sequences. …atgcaaggagtcccagagcc… …atgcgaggtctcccagtgtc… xi xj

  15. A graphical model of sequence relatedness • G = (V,E) • V: represent sequences • E: weight of the edge is proportional to the similarity between sequences. xi xj

  16. Gene family classification • Biological scenario: • small number of known genes • large number of unknown genes Goal: Given known genes, identify genes in the same family. xi xj

  17. Outline • Introduction • A graphical model of sequence relatedness • Gene classification using machine learning • Empirical evaluation • Conclusion

  18. Framework: binary classification • Machine learning scenario: • small number of labeled data • genes known to be in family • genes clearly not in family • large number of unlabeled data Determine which unlabeled genes belong to the family.

  19. Mutations DNA shuffling atgcgccccccggcatgt… atgcgccgtctggcatgt…ggctcgta Several challenging problems of gene family classification Ancestral gene Duplication Duplication atgcgccgtctggcatgt… atgcgaggtctcccatgt… atgcaaggagtcccagagc… Traditionally, similarity is represented by sequence comparison

  20. Mutations DNA shuffling atgcgccccccggcatgt… atgcgccgtctggcatgt…ggctcgta Several challenging problems of gene family classification Ancestral gene Duplication Duplication atgcgccgtctggcatgt… atgcgaggtctcccatgt… atgcaaggagtcccagagc… Traditionally, similarity is represented by sequence comparison

  21. Several challenging problems of gene family classification Families • do not form a clique • do not form a connected component • have edges to sequences outside the family.

  22. Outline • Introduction • A graphical model of sequence relatedness • Gene classification using machine learning • Semi-supervised learning algorithm • Supervised learning algorithm • Empirical evaluation • Conclusion

  23. Gene family classification • Machine learning scenario: • large number of unlabeled data • small number of labeled data Goal: Binary classification • Semi supervised learning: • Exploit information from both labeled and unlabeled data • Performed well in many applications

  24. Graphical semi-supervised learning (Binary classification) (xj,yj = 0) (xk,f(k)) • Notation: • V: The whole data set • L: Labeled data set • U: unlabeled data set • Each vertex: (xi,yi) or (xk, f(k)) (xi,yi = 1) Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

  25. Graphical semi-supervised learning (Binary classification) (xj,yj = 0) • Input: • family members (xi,yi = 1) • nonfamily members: (xj, yj = 0) (xk,f(k)) • Output: • Assign a real value to every vertex in the graph • Find a cutoff to separate the two classes (xi,yi = 1) Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

  26. Graphical semi-supervised learning (Binary classification) Assign real values to all vertices in the graph, to minimize E(f): (xn,yp = 1) (xk,f(k)) Sij (xi,yi = 0) G = (V,E) L: Labeled data set U: unlabeled data set

  27. Graph-based semi-supervised learning f(xk) Works well http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

  28. Graph-based semi-supervised learning f(xk) Works well Works well ? http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

  29. Outline • Introduction • A graphical model of sequence relatedness • Gene classification using machine learning • Semi-supervised learning • Supervised learning • Empirical evaluation • Conclusion

  30. Semi-supervised vs kernel-based supervised learning • Semi-supervised learning: • Supervised learning: where L is the labeled data set and U is the unlabeled data set

  31. Outline • Introduction • A graphical model of sequence relatedness • Gene classification using machine learning • Empirical evaluation • Methodology • Results • Conclusion

  32. Graph construction • G = (V,E) • V: All mouse sequences from SwissProt (n = 7439) • E: based on newly designed sequence similarity measurement. • 0 < S(i, j) < 1

  33. Methodology • Graph construction • Test set construction • Experiments performed • Basis for evaluation

  34. ACSL FOX Laminin SEMA USP ADAM GATA Myosin T-box WNT DVL Kinase Notch TNFR FGF Kinesin PDE TRAF Test set construction 18 well studied protein families • Receptors, enzymes, transcription factors, motor proteins, structural proteins, and extracellular matrix proteins.

  35. Test set construction • Retrieved all complete mouse sequences from SwissProtdatabase (7,439) • Identified sequences for each test family based on • Nomenclature committee reports • Structural properties • Literature surveys

  36. Methodology • Graph construction • Test set construction • Experiments performed • Basis for evaluation

  37. Experiments performed • Compare semi-supervised with supervised learning algorithm • Tested parameters: • Scaling parameter,σ, in the kernel function • Number of Labeled Family members (LF) • Number of Labeled Nonfamily members(LN)

  38. σ number of Labeled Family members number of Non-labeled Family members Tested parameters For each set of parameters, 20 tests were performed

  39. σ=100 1 σ=10 0.8 W 0.6 σ=1 0.4 σ=0.5 0.2 0.08 σ=0.2 0.05 σ=0.1 0.02 0 0 0.2 0.4 0.6 0.8 1 S Tested parameters (1) Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100

  40. Tested parameters (2) • Labeled Family members (LF): 10-70% of family size • Labeled Nonfamily members (LN) : 100, 500, 1000 about 1 - 10% of nonfamily size Database size: 7439

  41. Methodology • Graph construction • Test set construction • Experiments performed • Basis for evaluation

  42. Semi-supervised learning Goal: f(i) > f(j) when xi is a family member and xj is not. Evaluation criteria: • Visualization • AUC score • False negatives

  43. Sort all unlabeled data by f(x) Family members f(x) Nonfamily members Rank Visualization

  44. Family members f(x) Nonfamily members sensitivity Rank 1 - specificity Rank plot AUC (Area Under ROC Curve)

  45. Advantages of rank plot AUC = 0.9382

  46. AUC scores do not reflect all information we need • False negatives after the first false positive • The number of missed data after the first false positive

  47. Outline • Introduction • A graphical model of sequence relatedness • Gene classification using machine learning • Empirical evaluation • Methodology • Results • Conclusion

  48. Several challenging problems of gene family classification Families • do not form a clique • do not form a connected component • have edges to sequences outside the family. Edges to sequences outside the family are mainly a problem if they have strong edge weights

  49. Test families have different graph properties W: Edges to sequences outside the family have weak edge weights S: Edges to sequences outside the family have strong edge weights

  50. Results • Compare semi-supervised with supervised learning algorithm • Tested parameters: • Scaling parameter,σ, in the kernel function • Number of Labeled Family members (LF) • Number of Labeled Nonfamily members(LN)

More Related