1 / 29

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University C

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January 31 2005. Topics. Lecture Demo Data Mining tool Exercises Data Mining tool Breaks TBD. Genomic Microarrays – Case Study. Problem:

farrah
Télécharger la présentation

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University C

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January 31 2005

  2. Topics • Lecture • Demo Data Mining tool • Exercises Data Mining tool • Breaks TBD

  3. Genomic Microarrays – Case Study • Problem: • Leukemia (different types of Leukemia cells look very similar) • Given data for a number of samples (patients), can we • Accurately diagnose the disease? • Predict outcome for given treatment? • Recommend best treatment? • Solution • Data mining on micro-array data

  4. Example: ALL/AML data • 38 training patients, 34 test patients, ~ 7,000 patient attributes (microarry gene data) • 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) • Use train data to build diagnostic model ALL AML • Results on test data: • 33/34 correct, 1 error may be mislabeled

  5. Sources of (artificial) intelligence • Reasoning versus learning • Learning from data • Patient data • Customer records • Stock prices • Piano music • Criminal mugshots • Websites • Robot perceptions • Etc.

  6. Biomedical applications & data • General population survey data • Clinical studies • Patient characteristics • Imaging • Lab tests • Proteomics / genomics • Relating proteins / genes structure to biological functions • Medical research literature • ….

  7. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

  8. Some working definitions…. • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Example: the relation between patient characteristics and the probability to be diabetic • Instances: the individual, independent examples of a concept • Example: a patient, candidate drug etc. • Attributes: measuring aspects of an instance • Example: age, weight, lab tests, microarray data etc • Pattern or attribute space

  9. Data mining tasks • Predictive data mining • Classification: classify an instance into a category • Regression: estimate some continuous value • Descriptive data mining • Matching & search: finding instances similar to x • Clustering: discovering groups of similar instances • Association rule extraction: if a & b then c • Summarization: summarizing group descriptions • Link detection: finding relationships • …

  10. Data Mining Tasks: Search Finding best matching instances Every instance is a point in pattern space. Attributes are the dimension of an instance, f.e. Age, weight, gender etc. Pattern spaces may be high dimensional (10 to thousands of dimensions) f.e. weight f.e. age

  11. Data Mining Tasks: Classification Goal classifier is to seperate classes on the basis of known attributes The classifier can be applied to an instance with unknow class For instance, classes are healthy (circle) and sick (square); attributes are age and weight weight age

  12. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. weight f.e. age

  13. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user In >3 dimensions this is not possible f.e. weight f.e. age

  14. Examples of Classification Techniques • Majority class vote • Machine learning & AI • Decision trees • Nearest neighbor • Neural networks • Genetic algorithms / evolutionairy computing • Artificial Immune Systems • Good old statistics • …..

  15. Example Classification Algorithm 1Decision Trees 20000 patients age > 67 yes no 1200 patients 18800 patients Weight > 85kg gender = male? yes no no 400 patients 800 customers etc. Diabetic (%50) Diabetic (%10)

  16. Decision Trees in Pattern Space Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income Each line corresponds to a split in the tree Decision areas are ‘tiles’ in pattern space weight age

  17. Decision Trees in Pattern Space Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income Each line corresponds to a split in the tree Decision areas are ‘tiles’ in pattern space weight age

  18. Special Cases of Decision Trees • Depth = 0 • Majority class classifier (ZeroR) • Depth = 1 • One question only • Also known as decision stump • Depth = n • Any amount of branches • Various algorithms exist to learn the tree from data • Major difference is criterion to determine on what attribute value to split

  19. Example classification algorithm 2:Nearest Neighbour • Data itself is the classification model, so no abstraction like a tree etc. • For a given instance x, search the k instances that are most similar to x • Classify x as the most occurring class for the k most similar instances

  20. Nearest Neighbor in Pattern Space Classification = new instance Any decision area possible Condition: enough data available fe weight fe age

  21. Nearest Neighbor in Pattern Space Voorspellen Any decision area possible Condition: enough data available bvb. weight f.e. age

  22. Example classification algorithm 3:Neural Networks • Inspired by neuronal computation in the brain (McCullough & Pitts 1943 (!)) • Input (attributes) is coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons (for instance diabetic yes/no) • Algorithm learns to find optimal weight using the training instances and a general learning rule.

  23. Neural Networks • Example simple network (2 layers) • Probability of being diabetic = f (age * weightage + body mass index * weightbody mass index) age body_mass_index Weightbody mass index weightage Probability of being diabetic

  24. Neural Networks in Pattern Space Classification Simpel network: only a line available (why?) to seperate classes Multilayer network: Any classification boundary possible f.e. weight f.e. age

  25. e Decision Tree Demo in WEKA, An open source mining tool

  26. Descriptive data mining:association rules • Discovery of interesting patters • Rule format: if A (and B and C etc) then Z • Example: • If customer buys potatoes (A) and sauerkraut (B) then customer buys sausage (Z) • Belangrijke maten • Support condition: how often do potatoes and sauerkraut occur together (A,B) • Confidence rule: how often do sausages then occur / support conditions (is A,B  C always true?)

  27. e Associatie rule demo in WEKA

  28. Some examples of my research • Using data mining for bio-medical applications • Predicting Survival Rate for Throat Cancer Patients • … • Using bio-medical concepts for data mining • Artificial immune systems, learning computers based on the metaphor of the natural immune systems

  29. What have we learned so far? • Learning versus reasoning • Data mining definitions • Data mining tasks • Example data mining techniques for classification • Example data mining techniques for association rules • WEKA Demos • And now: lab sessions

More Related