1 / 133


INTRODUCTION TO DATA MINING. Pinakpani Pal Electronics & Communication Sciences Unit Indian Statistical Institute pinak@isical.ac.in. Main Sources. Data Mining Concepts and Techniques – Jiawei Han and Micheline Kamber , 2007

Télécharger la présentation


An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. INTRODUCTION TODATA MINING Pinakpani Pal Electronics & Communication Sciences Unit Indian Statistical Institute pinak@isical.ac.in

  2. Main Sources • Data Mining Concepts and Techniques –Jiawei Han and MichelineKamber, 2007 • Handbook of Data Mining and Discovery- WilliKlosgen and Jan M Zytkow, 2002 • Fast algorithms for mining association rules and sequential patterns– R.Srikant, Ph.D. Thesis at the University of Wisconsin-Madison, 1996. • “Parallel & distributed association mining: a survey,” –M. J. Zaki, IEEE Concurrency, 7(4), pp.14-25, 1999. Introduction to Data Mining

  3. Prelude • Data Mining is a method of finding interesting trends or patterns in large datasets. • Data collection may be incomplete, heterogeneous and historical. • Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms. • Data Mining tools are expected to involve minimal user intervention. Introduction to Data Mining

  4. Prelude • Data mining deals with finding patterns in data either by • user-definition (pre-defined by the user), • interesting (with the help of an interestingness measure) or • valid (validity pre-defined). • Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support. Introduction to Data Mining

  5. Data Mining Communities • Statistics: Provides the background for the algorithms. • Artificial Intelligence: Provides the required heuristics for machine learning / conceptual clustering. • Database: Provides the platform for storage and retrieval of raw and summary data. Introduction to Data Mining

  6. Data Mining Mining knowledge from Large amounts of Data. Evolution: • Data collection • Database creation • Data management • Data storage • Retrieval • Transaction processing Introduction to Data Mining

  7. Data Mining • Advanced data analysis data warehouse and data mining Introduction to Data Mining

  8. Data Mining Components Information Repository: single or multiple heterogeneous data source Data Sever: storing or retrieving relevant data Knowledgebase: concept hierarchies, constraints, threshold, metadata Pattern Extraction : characterization, discrimination, association, classification, prediction, clustering, various statistical analysis Pattern Evaluation: interestingness measures Introduction to Data Mining

  9. Stages of the Data Mining Process Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention. Steps: • [Data Collection] • web crawling / warehousing Introduction to Data Mining

  10. Stages of the Data Mining Process Steps(contd.): • Data Preprocessing & Feature Extraction • Data cleaning: elimination of erroneous and irrelevant data • Data Integration: from multiple source • Data selection / reduction: to accept only the interesting attributes of the data according to the problem domain. • Data transformation: normalization, aggregation Introduction to Data Mining

  11. Stages of the Data Mining Process Steps(contd.): • Pattern Extraction & Evaluation • Identification of data mining primitives and interestingness measures are done at this stage. • Visualization of data • Making it easily understandable • Evaluation of results • Not every s/w discovered facts are useful for human beings! Introduction to Data Mining

  12. Data Preprocessing Data Cleaning: Data may be incomplete, noisy and inconsistent. Attempts are made to identify outliers to smooth out noise, fill in missing values and correct inconsistencies. Introduction to Data Mining

  13. Data Preprocessing Data Integration: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files. Introduction to Data Mining

  14. Data Preprocessing Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary. It includes activities like, Removal of irrelevant and redundant attributes, Data Compression and Aggregation or Generation of Summary Data. Introduction to Data Mining

  15. Data Preprocessing Transformation: Data need to be transformed or consolidated into forms suitable for mining. It may include activities like, Generalization, Normalization, e.g. attribute values converted from absolute values to ranges, Construction of new attributes etc. Introduction to Data Mining

  16. Patterns • Descriptive – characterizing general properties of the data • Predictive – inference on the current data in order to make patterns • Discover: • multiple kind of patterns to accommodate different user expectation (may specify hints to guide) /application • patterns at various granularity Introduction to Data Mining

  17. Frequent Patterns Patterns that occur frequently in the data. Types: • Itemset • Subsequences • Substructures (sub-graphs, sub-trees, sub-lattices) Introduction to Data Mining

  18. Discovery of Association Rules To identify the features or items in a problem domain that tend to appear together. These features or items are said to be associated. The process is to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract rules on how a subset of items influences the presence of another subset. Introduction to Data Mining

  19. Association Rule: Example A user studying the buying habits of customers may choose to mine association rules of the form: P (X:customer,W) ^ Q (X,Y) buys (X,Z) [support=n%, confidence is m%] Meta rules such as the following can be specified: occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”) [1.4%, 70%] Introduction to Data Mining

  20. Association Rule: Single/Multi Single-dimensional association rule: buys(X, “computer”) buys (X, “antivirus”) [1.1%, 55%] OR “computer” “antivirus” (A B ) [1.1%, 55%] Multi-dimensional association rule: occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”) [1.4%, 70%] Introduction to Data Mining

  21. Metrics for Interestingness measures Interestingness measures in knowledge discovery help to identify the relevance of the patterns discovered during the mining process. Introduction to Data Mining

  22. Interestingness measures • Used to confine the number of uninteresting patterns returned by the process. • Based on the structure of patterns and statistics underlying them. • Associate a threshold which can be controlled by the user • patterns not meeting the threshold are not presented to the user. Introduction to Data Mining

  23. Interestingness measures: objective Objective measures of pattern interestingness: • simplicity • utility (support) • certainty (confidence) • novelty Introduction to Data Mining

  24. Interestingness measures: simplicity Simplicity: a patterns interestingness is based on its overall simplicity for human comprehension. e.g. Rule length is a simplicity measure Introduction to Data Mining

  25. Interestingness measures: support Utility (support): usefulness of a pattern support(AB) = P(AUB). The support for a association rule {A}  {B} is the % of all the transactions under analysis that contains this itemset. Introduction to Data Mining

  26. Interestingness measures: confidence Certainty (confidence): Assesses the validity or trustworthiness of a pattern. Confidence is a certainty measure confidence(A  B) = P(B│A) The confidence for a association rule {A}  {B} is the % cases that follows the rule. Association rules that satisfy both the confidenceand support threshold are referred to as strong association rules. Introduction to Data Mining

  27. Interestingness measures: novelty Novelty: Patterns contributing new information to the given pattern set are called novel patterns. e.g: Data exception. Removing redundant patterns is a strategy for detecting novelty. Introduction to Data Mining

  28. Market Basket data analysis Let, a transaction be defined as the variety of items purchased by a customer in one visit, irrespective of the quantity of each item purchased. The problem is to find the items that a customer tends to buy together. Introduction to Data Mining

  29. Market Basket data analysis An association rule is an expression of the form XY, where X and Y are the sets of items. The intuitive meaning of the expression is, the transactions that contain X tend to contain Y as well. The inverse may not be true. Since only presence or absence of items are considered and not the quantity purchased, this type of rules are called Binary Association Rules. Introduction to Data Mining

  30. Market Basket data analysis Purpose is to study consumers’ purchase pattern in departmental stores. Considering four possible transactions, 1 - {Pen, Ink, Diary, Writing Pad} 2 - {Pen, Ink, Diary} 3 - {Pen, Diary} 4 - {Pen, Ink, Writing Pad} Introduction to Data Mining

  31. Market Basket data analysis A possible Association Rule, “ Purchase of Pen implies the purchase of Ink or Diary” {Pen}  {Ink} or {Pen}  {Diary} Basically, the rule is of the form {LHS}  {RHS} where, both {LHS} and {RHS} are sets of items, called itemset and {LHS} ∩ {RHS} = . • {Pen, Ink} is a 2-itemset. Introduction to Data Mining

  32. Binary Association Rule Mining Two Step Process • Find all frequent itemsets • An itemset will be considered for mining rules if its support is above a threshold called minsup. • Generate strong association rules from frequent itemsets • Acceptance of a rule is once again through a threshold called minconf. Introduction to Data Mining

  33. Finding Frequent Itemsets If there are N items in a market basket and the association is studied for all possible item combinations, totally 2N combinations are to be checked. Introduction to Data Mining

  34. Finding Frequent Itemsets All nonempty subsets of a frequent itemset must also be frequent. (anti-monotone property) Apriori Algorithm An itemset is frequent when its occurrence in the total dataset exceeds the minsup. If there exists N items, the algorithm attempts to compute frequent itemsets for 1-itemset to N-itemsets. Introduction to Data Mining

  35. Apriori Algorithm The algorithm has two steps, • Join step • Prune step • Join step : Here frequent k-itemsets are computed by joining the (k-1)-itemsets • Prune step: if a k-itemset fails to cross the minsup threshold, all the supersets of the concerned k-itemset are no longer considered for association rule discovery. Introduction to Data Mining

  36. Apriori Algorithm • Let Lk be the set of frequent k-itemsets • Let Ck be the set of candidate k-itemsets Each member of this set has two fields – itemset and support count. Introduction to Data Mining

  37. Apriori Algorithm • Let k←1 • Generate L1 frequent itemsets of length 1 • (Lk= ) OR (k = N) goto Step 7 • k ← k+1 • Generate Lkfrequent itemsets of length k by Join and Prune • Goto Step 3. • Stop Output : UkLk Introduction to Data Mining

  38. Apriori Algorithm Join () forall (i,j) where iϵ Lk-1 and j ϵ Lk-1, i≠j select all possible k-itemset and insert into Ck endfor If L3={{{1 2 3}, s123}, {{1 2 4}, s124}, {{1 3 4}, s134}, {1 3 5}, s135}, {2 3 4}, s234}} C4={{{1 2 3 4}, s1234}, {{1 3 4 5}, s1345}} Introduction to Data Mining

  39. Apriori Algorithm Prune() forallitemsets Ckdo forall (k-1)-subsets s of c do If ( Lk-1) then delete c from Ck endif endfor endfor Lk ← Ck L4={{{1 2 3 4}, s1234}} Introduction to Data Mining

  40. Rule Generation Rule generation should ensure production of rules that satisfy only the minimum confidence threshold • Because, rules are generated from frequent itemsets, they automatically satisfy the minimum support threshold Given a frequent itemsetli, find all non-empty subsets f  li such that f  li – f satisfies the minimum confidence requirement • If | li| = k, then there are 2k – 2 candidate association rules Introduction to Data Mining

  41. Rule Generation Algorithm: foralllii≥ 2 do call genrule (li,li) endfor Introduction to Data Mining

  42. Rule Generation genrule (lk,fi) F ← {(m-1)-itemset fm-1 | fm-1 fm} forall fm-1ϵ Fdo conf ←sup(lk) / sup(fm-1) if (conf ≥ minconf) print rule “fm-1 (lk-fm-1), conf, sup(lk)” if (m-1 >1) cal genrule(lk, am-1) endif endif endfor Introduction to Data Mining

  43. Rule Generation If {A,B,C,D} is a frequent itemset, candidate rules: {ABC}{D}, {ABD}{C}, {ACD}{B}, {BCD}{A}, {AB}{CD}, {AC}{BD}, {AD}{BC}, {BC}{AD}, {BD}{AC}, {CD}{AB}, {A}{BCD}, {B} {ACD}, {C}{ABD}, {D}{ABC} Introduction to Data Mining

  44. Rule Generation In general, confidence does not have an anti-monotone property c({ABC}  {D}) can be larger or smaller than c({AB}  {D}) But confidence of rules generated from the same itemset has an anti-monotone property • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule e.g., L = {A,B,C,D}: c({ABC}  {D})  c({AB}  {CD})  c({A}  {BCD}) Introduction to Data Mining

  45. Case Study To find the Association among the species of trees present in a forest. The problem is to find a set of association rules which would indicate the species of trees that usually appear together and also whether a set of species ensures the presence of another set of species with a minimum degree of confidence specified apriori. Introduction to Data Mining

  46. Data Collection A forest area is divided into a number of transacts. A group of surveyors walk through each such transact to identify the different species of trees and their number of occurrences. Introduction to Data Mining

  47. Data Introduction to Data Mining

  48. Converting the Data Introduction to Data Mining

  49. Drawbacks Support and confidence used by Apriori allow a lot of rules which are not necessarily interesting Two options to extract interesting rules • Using subjective knowledge • Using objective measures (measures better than confidence) Introduction to Data Mining

  50. Subjective approaches • Visualization – users allowed to interactively verify the discovered rules • Template-based approach – filter out rules that do not fit the user specified templates • Subjective interestingness measure – filter out rules that are obvious (bread butter) and that are non-actionable (do not lead to profits) Introduction to Data Mining

More Related