1 / 32

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Administrative. Register for 3 hours of credit. Me. Luke Huan, assistant prof. in Electrical Engineering & Computer Science Homepage: http://people.eecs.ku.edu/~jhuan/ Office: 2304 Eaton Hall

archer
Télécharger la présentation

EECS 800 Research Seminar Mining Biological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

  2. Administrative • Register for 3 hours of credit

  3. Me • Luke Huan, assistant prof. in Electrical Engineering & Computer Science • Homepage: http://people.eecs.ku.edu/~jhuan/ • Office: 2304 Eaton Hall • Email: jhuan@eecs.ku.edu • Office hour: • 10:00 – 11:00am Monday and Wednesday

  4. My Lecture Style • I may tend to talk fast, especially when excited • Class materials are highly interdisciplinary • Use your questions to slow me down • Ask for clarification, repetition of a strange phrase, jargons • “If in doubt, speak it out”

  5. You • Introduction: • Who you are • What department you are in • Why you are taking the course

  6. Outline for Today • What is mining biological data? • What is this course about? • Course home page • Course references • Paper presentation • Final project • Grading • Forward class reviewing

  7. What is Mining Biological Data • Goal: understanding the structure of biological data • Patterns • Descriptive models • Predictive models • Challenges: • What is the nature of the data? • What are the computational tasks? • How to break a task into a group of computational components? • How to evaluate the computational results? • Applications • Experimental design and hypothesis generation • Synthesis novel proteins • Drug design • …

  8. What is this Course About? • Learning… • Problems in mining biological data • Available techniques, their pros and cons • How to combine techniques together • Enough perception to avoid pitfalls • Practicing… • To present recent papers on a selected topic • To work on a project that may involve • A domain expert, • A driving biological problem, and • The development of new data mining techniques

  9. Class Information • Class Homepage: http://people.eecs.ku.edu/~jhuan/fall06.html • Meeting time: 9:00 – 9:45 Monday, Wednesday, Friday • Meeting place: Eaton Hall 2001 • Prerequisite: none

  10. Textbook & References • Textbook: none • References • Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann, 2001. (ISBN:1-55860-489-8) • The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by Hastie, Tibshirani, and Friedman, Springer, 2001. (ISBN:0-387-95284-5) • Bioinformatics: Genes, Proteins, and Computers, edited by Christine Orengo, David Jones, Janet Thornton, Bios Scientific Publishers, 2003. (ISBN: 1-85996-0545)

  11. Paper Presentation • One per student • Research paper(s) • List of recommendations will be posted at the class webpage a week from now • Your own pick (upon approval) • Three parts • Review the goal of the paper(s) • Discuss the research challenges • Present the techniques and comment on their pros and cons • Questions and comments from audience • Extra credit for active participants of class discussions • Order of presentation: first come first pick • Please send in your choice of paper by September 1st.

  12. Final Project • Project (due Nov. 27th) • One project • I will post some suggestions at class website. • I am soliciting projects from researchers on campus • You are welcome to propose your own • Discuss with me before you start • Checkpoints • Proposal: title and goal (due Sep. 8th) • Background and related work (due Sep. 29th) • Outline of approach (due Oct. 20th) • Implementation & Evaluation (due Nov. 10th) • Class demo (due Nov. 27th)

  13. Grading • Grading scheme • No homework • No exam

  14. Forward Class Reviewing • This is for overview, not content • Don’t worry if you do not understand some of the words, that’s why you want to take this class. • Gives an idea of what is coming • Order of presentation might be shuffled to accommodate everyone’s schedule • Topics may be adjusted with progresses of the class

  15. Week 1: Pattern Mining • Frequent patterns: finding regularities in data • Frequent patterns (set of items) are one that occur frequently in a data set • Can we automatically profile customers? • What products are often purchased together? Customer Shopping basket One hypothesis: {a, c}  {m}

  16. Week 2: Advanced Pattern Mining • Reducing number of patterns • Maximal patterns and closed patterns • Constraint-based mining • Patterns with concept hierarchy • Patterns in quantitative data • Correlation vs. association

  17. Week 3: Mining Microarray Data from: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization”, Molecular Biology of the Cell, 9, 3273-3297.

  18. p5 p2 y c b y p1 x a y y d b p4 p3 G1 y y b c b q1 s1 s4 y b P3 b P2 y b c y y b y s2 q2 x x a a a a x y y y f=3/3 x f=2/3 b b a f=2/3 b b b b P6 P5 s3 q3 P4 G3 G2 Week 4: Patterns in Sequences, Trees, and Graphs  = 2/3 b y f=2/3 f=2/3 f = 3/3 a y b P1

  19. Lys Lys Gly Gly Leu Val Ala His Cartoon Space filling Oxygen Nitrogen Carbon Sulfur Ribbon Surface Week 5: Pattern Discovery in Biomolecules • Protein • A sequence from 20 amino acids • Adopts a stable 3D structure that can be measured experimentally

  20. Outliers Cluster 1 Cluster 2 Week 6: Descriptive Models • Group objects into clusters • Ones in the same cluster are similar • Ones in different clusters are dissimilar • Unsupervised learning: no predefined classes

  21. Week 7: Subspace Clustering

  22. Week 7: Subspace Clustering

  23. Week 8: Mining Microarray (II) • Apply subspace clustering to microarray analysis • Find groups of genes that are co-regulated • May integrate data from protein sequences and functional description of genes • Applying subgraph mining to microarray analysis

  24. Week 9: Predictive Models • Two-class version: • Using “training data” from Class +1 and Class -1 • Develop a “rule” for assigning new data to a Class Slides from J.S. Marron in Statistics at UNC

  25. Week 10: Classification Algorithms and Applications • Decision tree • Fishers linear discrimination method • Kernel methods

  26. Week 11: Text Mining, Gene Ontology, Data Management • Ontology seeks to describe or posit the basic categories and relationships of being or existence to define entities and types of entities within its framework. Ontology can be said to study conceptions of reality (Wikipedia). • GO is a database of terms for genes • Terms are connected as a directed acyclic graph • Levels represent specifity of the terms (not normalized) • GO contains three different sub-ontologies: • Molecular function • Biological process • Cellular component

  27. Part of the biological system in a cell at the molecular level A proteome is the set of all proteins in an organism Week 12: Systems Biology & Proteomics Source: http://www.ircs.upenn.edu/modeling2001/,

  28. Protein-protein interaction in yeast 35,000 Growth of Known Structures in Protein Data Bank (PDB) # of structures Year Gary D. Bader & Christopher W.V. Hogue, Nature Biotechnology 20, 991 - 997 (2002) Week 13: Analyzing Biological Networks • Biological networks pose serious challenges and opportunities for the data mining research in computer science • Large volume of data • Heterogeneous data types

  29. Week 14: bio-Data Integration • Data are collected from many different sources • Each piece of data describes part of a complicated (and not directly observable) biological process • Combine data together to achieve better understanding and better prediction

  30. Week 15, 16: Project Presentation • Check what you have learned from the class • Celebrate the hard work!

  31. Further References • Data mining • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. • Journal: Data Mining and Knowledge Discovery, IEEE-TKDD • Bioinformatics • Conferences: ISMB, RECOMB, PSB, CSB, BIBE, etc. • Journals: Bioinformatics, J. of Computational Biology, etc.

  32. Further References • AI & Machine Learning • Conferences: Machine learning (ICML), AAAI, IJCAI, etc. • Journals: Machine Learning, Artificial Intelligence, etc. • Statistics • Conferences: Joint Stat. Meeting, etc. • Journals: Annals of statistics, etc. • Database systems • Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, • Journals: ACM-TODS, IEEE-TKDE etc. • Visualization • Conference proceedings: IEEE Visualization, ACM-SIGGraph, etc. • Journals: IEEE Trans. visualization and computer graphics, etc.

More Related