1 / 42

DATA MINING Extracting Knowledge From Data

DATA MINING Extracting Knowledge From Data. Petr Olmer CERN petr.olmer@cern.ch. Motivation. Computers are useless, they can only give you answers. What if we do not know what to ask? How to discover a knowledge in databases without a specific query?. Many terms, one meaning. Data mining

lavey
Télécharger la présentation

DATA MINING Extracting Knowledge From Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA MININGExtracting Knowledge From Data Petr Olmer CERN petr.olmer@cern.ch

  2. Motivation Computersare useless,they can onlygive you answers. • What if we do not knowwhat to ask? • How to discover a knowledgein databases without a specific query?

  3. Many terms,one meaning • Data mining • Knowledge discovery in databases • Data exploration • A non trivial extraction of novel, implicit, and actionable knowledge from large databases. • without a specific hypothesis in mind! • Techniques for discovering structural patterns in data.

  4. What is inside? • Databases • data warehousing • Statistics • methods • but different data source! • Machine learning • output representations • algorithms

  5. CRISP-DMCRoss Industry Standard Process for Data Mining task specification data understanding data preparation deployment evaluation modeling http://www.crisp-dm.org

  6. Input data: Instances, attributes

  7. example: input data

  8. Output data: Concepts • Concept description = what is to be learned • Classification learning • Association learning • Clustering • Numeric prediction

  9. Task classes • Predictive tasks • Predict an unknown value of the output attribute for a new instance. • Descriptive tasks • Describe structures or relations of attributes. • Instances are not related!

  10. Models and algorithms • Decision trees • Classification rules • Association rules • k-nearest neighbors • Cluster analysis

  11. Inner nodes test a particular attribute against a constant Leaf nodes classify all instances that reach the leaf Decision trees a < 5 >= 5 b class C1 blue red a c > 0 <= 0 hot cold class C1 class C2 class C2 class C1

  12. If precondition then conclusion An alternative to decision trees Rules can be read off a decision tree one rule for each leaf unambiguous, not ordered more complex than necessary If (a>=5) then class C1 If (a<5) and (b=“blue”) and (a>0) then class C1 If (a<5) and (b=“red”) and (c=“hot”) then class C2 Classification rules

  13. Classification rulesOrdered or not ordered execution? • Ordered • rules out of context can be incorrect • widely used • Not ordered • different rules can lead to different conclusions • mostly used in boolean closed worlds • only yes rules are given • one rule in DNF

  14. Decision trees / Classification rules1R algorithm for each attribute: for each value of that attribute: count how often each class appears find the most frequent class rule = assign the class to this attribute-value calculate the error rate of the rules choose the rules with the smallest error rate

  15. outlook sunny-no 2/5 overcast-yes 0/4 rainy-yes 2/5 total 4/14 temp. hot-no* 2/4 mild-yes 2/6 cool-yes 1/4 total 5/14 humidity high-no 3/7 normal-yes 1/7 total 4/14 windy false-yes 2/8 true-no* 3/6 total 5/14 example: 1R

  16. Decision trees / Classification rulesNaïve Bayes algorithm • Attributes are • equally important • independent • For a new instance, we count the probability for each class. • Assign the most probable class. • We use Laplace estimator in case of zero probability. • Attribute dependencies reduce the power of NB.

  17. yes sunny 2/9 cool 3/9 high 3/9 true 3/9 overall 9/14 0.0053 20.5 % no sunny 3/5 cool 1/5 high 4/5 true 3/5 overall 5/14 0.0206 79.5 % example: Naïve Bayes

  18. Decision treesID3: A recursive algorithm • Select the attribute with the biggest information gain to place at the root node. • Make one branch for each possible value. • Build the subtrees. • Information required to specify the class • when a branch is empty: zero • when the branches are equal: a maximum • f(a, b, c) = f(a, b + c) + g(b, c) • Entropy:

  19. outlook sunny 2:3 overcast 4:0 rainy 3:2 temp. temp. hot 2:2 hot 0:2 mild 4:2 mild 1:1 cool 3:1 cool 1:0 humidity humidity normal 6:1 normal 2:0 high 0:3 high 3:4 windy windy false 1:2 false 6:2 true 1:1 true 3:3 example: ID3

  20. outlook sunny overcast rainy windy humidity yes high normal false true no yes yes no example: ID3

  21. Classification rulesPRISM: A covering algorithm only correctunordered rules • For each class seek a wayof covering all instances in it. • Start with: If ? then class C1. • Choose an attribute-value pair to maximize the probability of the desired classification. • include as many positive instances as possible • exclude as many negative instances as possible • Improve the precondition. • There can be more rules for a class! • Delete the covered instances and try again.

  22. If ? then P=yes If O=overcast then P=yes O = sunny 2/5 O = overcast 4/4 O = rainy 3/5 T = hot 2/4 T = mild 4/6 T = cool 3/4 H = high 3/7 H = normal 6/7 W = false 6/8 W = true 3/6 example: PRISM

  23. If ? then P=yes If H=normal then P=yes O = sunny 2/5 O = rainy 3/5 T = hot 0/2 T = mild 3/5 T = cool 2/3 H = high 1/5 H = normal 4/5 W = false 4/6 W = true 1/4 example: PRISM

  24. If H=normal and ? then P=yes If H=normal and W=false then P=yes O = sunny 2/2 O = rainy 2/3 T = mild 2/2 T = cool 2/3 W = false 3/3 W = true 1/2 example: PRISM

  25. If O=overcast then P=yes If H=normal and W=false then P=yes If T=mild and H=normal then P=yes If O=rainy and W=false then P=yes example: PRISM

  26. Association rules • Structurally the same as C-rules: If - then • Can predict any attribute or their combination • Not intended to be used together • Characteristics: • Support = a • Accuracy = a / (a + b)

  27. Association rulesMultiple consequences • If A and B then C and D • If A and B then C • If A and B then D • If A and B and C then D • If A and B and D then C

  28. Association rulesAlgorithm • Algorithms for C-rules can be used • very inefficient • Instead, we seek rules with a given minimum support, and test their accuracy. • Item sets: combinations of attribute-value pairs • Generate items sets with the given support. • From them, generate rules with the given accuracy.

  29. k-nearest neighbor • Instance-based representation • no explicit structure • lazy learning • A new instance is compared with existing ones • distance metric • a = b, d(a, b) = 0 • a <> b, d(a, b) = 1 • closest k instances are used for classification • majority • average

  30. example: kNN

  31. Cluster analysis • Diagram: how the instances fall into clusters. • One instance can belong to more clusters. • Belonging can be probabilistic or fuzzy. • Clusters can be hierarchical.

  32. Data miningConclusion • Different algorithms discover different knowledge in different formats. • Simple ideas often work very well. • There’s no magic!

  33. Text mining • Data mining discovers knowledge in structured data. • Text mining works with unstructured text. • Groups similar documents • Classifies documents into taxonomy • Finds out the probable author of a document • … • Is it a different task?

  34. Settings 1: empty kettle fire source of cold water tea bag How to prepare tea: put water into the kettle put the kettle on fire when water boils, put the tea bag in the kettle Settings 2: kettle with boiling water fire source of cold water tea bag How to prepare tea: empty the kettle follow the previous case How do mathematicians work

  35. Text miningIs it different? • Maybe it is, but we do not care. • We convert free text to structured data… • … and “follow the previous case”.

  36. Google NewsHow does it work? • http://news.google.com • Search web for the news. • Parse content of given web sites. • Convert news (documents) to structured data. • Documents become vectors. • Cluster analysis. • Similar documents are grouped together. • Importance analysis. • Important documents are on the top

  37. From documents to vectors • We match documents with terms • Can be given (ontology) • Can be derived from documents • Documents are described as vectors of weights • d = (1, 0, 0, 1, 1) • t1, t4, t5 are in d • t2, t3 are not in d

  38. TFIDFTerm Frequency / Inverse Document Frequency • TF(t, d) = how many times t occurs in d • DF(t) = in how many documents t occurs at least once • Term is important if its • TF is high • IDF is high • Weight(d, t) = TF(t, d) · IDF(t)

  39. Cluster analysis • Vectors • Cosine similarity • On-line analysis • A new document arrives. • Try k-nearest neighbors. • If neighbors are too far, leave it alone.

  40. Text miningConclusion • Text mining is very young. • Research is on-going heavily • We convert text to data. • Documents to vectors • Term weights: TFIDF • We can use data mining methods. • Classification • Cluster analysis • …

  41. References • Ian H. Witten, Eibe Frank:Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations • Michael W. Berry:Survey of Text Mining: Clustering, Classification, and Retrieval • http://kdnuggets.com/ • http://www.cern.ch/Petr.Olmer/dm.html

  42. Questions? Computersare useless,they can onlygive you answers. Petr Olmer petr.olmer@cern.ch

More Related