1 / 28

Mining the Genome

Mining the Genome Filip Železný ČVUT FEL, Prague Dept. of Cybernetics Gerstner Laboratory Intro Research at ČVUT FEL Dept. of Cybernetics Nature Inspired Technologies machine learning evolutionary computation Agent Computing Robotics Computer Vision EU Projects (6 FP)

Sophia
Télécharger la présentation

Mining the Genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining the Genome Filip Železný ČVUT FEL, Prague Dept. of Cybernetics Gerstner Laboratory

  2. Intro • Research at ČVUT FEL Dept. of Cybernetics • Nature Inspired Technologies • machine learning • evolutionary computation • Agent Computing • Robotics • Computer Vision • EU Projects (6 FP) • 14 running in 2005, 9 new starting 2006

  3. Machine Learning basics

  4. Machine Learning & Data Mining • Supervised learning • given examples and their class labels • find a model for predicting class labels of new examples • also: “concept learning”, “predictive classification”, ... • Example • Given: • Discover: size=small & luxury=low  affordable

  5. Machine Learning Plethora of paradigms Decision trees Artificial NeuralNetworks Support VectorMachines “Symbolic” “Subsymbolic” “Statistical” Learning = optimization in structure / parameter space Learning = search AI techniques employed (gradient descent, heuristic search)

  6. Relational Learning What if examples have a structure? Not an attribute tuple ! Description spread in multiple tables of a relational database

  7. Relational Learning • Relational learning • Representing data and rules in relational logic (Prolog) • Exploits background knowledge (eg. “charge”) • Inductive Logic Programming carcinogenic(Compound) IF has_atom(Compound, Atom) & type(Atom, carbon) & charge(Atom, Charge) & Charge > 0.0133 & has_atom(Compound, Atom2) & double_bond(Atom1, Atom2)

  8. Applications of Interest 3 hot fields intersection BIOtechnologies(genomics) INFORMATIONtechnologies(machine learning) NANOtechnologies(microarray chips)

  9. A quick intro into computational genomics

  10. Background: GENETICS How does a cell know what to do?

  11. Chromosomes Chromosomes get copied during mitosis They carry the assembly instructions? How? Chromosomes = proteins + DNA where is the information ??

  12. DNA 1953: Jim Watson & Francis Crick Discover the DNA structure. That is where the information is. 4-symbol alphabet Guanin, Adenin, Cytosin, Tymin Double-helix pairing: C-G A-T video

  13. The CENTRAL DOGMA of Molecular Biology • Gene = DNA subsequence • Genes code for proteins • Gene expression • DNA piece transcribes to RNA • RNA translates into a protein • Proteins `do the job’ • - enzymes • - building blocks • - ... video

  14. Protein Coding Codon(3 bases) DNA strand aminoacid Protein

  15. Protein structures “resolution”

  16. Secondary structure prediction Two common secondary structures  - sheet  - helix Primary structure determines secondary structure. Computational problem:Given primary structure, predict if  - sheet or  - helix NOBODY CAN DO THAT !

  17. Secondary structure prediction • Secondary structure prediction with ILP [Muggleton 1992] Using ILP, obtained rulessuch as alpha0(A,B)  ... position(A,D,O) & not_aromatic(O) & small_or_polar(O) & position(A,B,C) & very_hydrophobic(C) & not_aromatic(C) ...etc (22 literals) • Note the incorporation of background knowledge • Accuracy 81%, best at the time • Published in JrProtein Engineering

  18. Sequencing the Human Genome

  19. The Genome project • 1993 – 2003 All human genes sequenced Celera X NIH race • Challenge NOW: annotate the genes • discover functions • interactions • dynamic pathways video

  20. Genomics research Verification(targeted assay) Human intuition Hypotheses • Traditional functional genomics research • Hypothesis - driven • eg. a gene is suspected to be responsible for ... • then tracing its expression in relevant tissues • “First hypothesize, then measure”

  21. Gene Expression Microarrays • Microarray chip: • Measures expression of tens of thousands genes simultaneously: “high-throughput” • pioneering technology (mid to late 90’s) • A grid carrying synthesized DNA probes •  Breakthrough in genomics research? photo scan

  22. Genomics Research • High-Throughput approach to functional genomics ? • Data-driven, unbiased, “First measure, then hypothesize” • Might reveal never-thought-of relationships Microarray data Human analysis Hypotheses IMPOSSIBLE (TOO MUCH DATA) Expression of almost entire genome(tens of thousands genes)

  23. Genomics Research through Machine Learning • AI based High-Throughput functional genomics ? High-throughputscreening High-performancecomputing Microarray data Machine Learning Hypotheses Interpretation

  24. Genomics Research with AI • This concept has recently been proven to work • Golub et al., Science286:531-537 1999 • leukemia classification model (AML vs. ALL) • voting of informative attributes (genes) • Discovery of new classes (clustering) • Ramaswamy et al., PNAS 98:15149-54 2001 • Tumor classification • 14 classes of cancer • used Support Vector Machines video

  25. Interpretable classifiers • Comprehensibility Pursuit: Rule Based Models • Models interpretable by biologists • Our work • D. Gamberger, N. Lavrač, F. Železný, J. Tolar Jr Biomed Informatics 37(5):269-284 2004 IF gene_20056 EXPRESSEDAND gene_23984 NOT_EXPRESSEDTHEN cancer_class = AML Class

  26. Exploiting Background knowledge • Tons of genomic background knowledge available • Relational learning would allow to exploit it!

  27. Relational Genomic Data Mining • Our current work Combining expression & gene annotation data Rule Based Model

  28. Relational Genomic Data Mining • Example rule algorithmically discovered • ... open end, no conclusions expressed_in_all(Gene) IF has_location(Gene, integral_to_membrane) & has_function(Gene, receptor_activity) Expression of genescoding for proteinslocated in the integral to membrane cell component,whose functions include receptor activity, has a high correlation with the BCR class of acute lymphoblastic leukemia (ALL) and a low correlation with other classes of ALL.

More Related