Créer une présentation
Télécharger la présentation

Télécharger la présentation
## INTRODUCTION TO DATA MINING

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**INTRODUCTION TODATA MINING**Pinakpani Pal Electronics & Communication Sciences Unit Indian Statistical Institute pinak@isical.ac.in**Main Sources**• Data Mining Concepts and Techniques –Jiawei Han and MichelineKamber, 2007 • Handbook of Data Mining and Discovery- WilliKlosgen and Jan M Zytkow, 2002 • Fast algorithms for mining association rules and sequential patterns– R.Srikant, Ph.D. Thesis at the University of Wisconsin-Madison, 1996. • “Parallel & distributed association mining: a survey,” –M. J. Zaki, IEEE Concurrency, 7(4), pp.14-25, 1999. Introduction to Data Mining**Prelude**• Data Mining is a method of finding interesting trends or patterns in large datasets. • Data collection may be incomplete, heterogeneous and historical. • Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms. • Data Mining tools are expected to involve minimal user intervention. Introduction to Data Mining**Prelude**• Data mining deals with finding patterns in data either by • user-definition (pre-defined by the user), • interesting (with the help of an interestingness measure) or • valid (validity pre-defined). • Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support. Introduction to Data Mining**Data Mining Communities**• Statistics: Provides the background for the algorithms. • Artificial Intelligence: Provides the required heuristics for machine learning / conceptual clustering. • Database: Provides the platform for storage and retrieval of raw and summary data. Introduction to Data Mining**Data Mining**Mining knowledge from Large amounts of Data. Evolution: • Data collection • Database creation • Data management • Data storage • Retrieval • Transaction processing Introduction to Data Mining**Data Mining**• Advanced data analysis data warehouse and data mining Introduction to Data Mining**Data Mining Components**Information Repository: single or multiple heterogeneous data source Data Sever: storing or retrieving relevant data Knowledgebase: concept hierarchies, constraints, threshold, metadata Pattern Extraction : characterization, discrimination, association, classification, prediction, clustering, various statistical analysis Pattern Evaluation: interestingness measures Introduction to Data Mining**Stages of the Data Mining Process**Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention. Steps: • [Data Collection] • web crawling / warehousing Introduction to Data Mining**Stages of the Data Mining Process**Steps(contd.): • Data Preprocessing & Feature Extraction • Data cleaning: elimination of erroneous and irrelevant data • Data Integration: from multiple source • Data selection / reduction: to accept only the interesting attributes of the data according to the problem domain. • Data transformation: normalization, aggregation Introduction to Data Mining**Stages of the Data Mining Process**Steps(contd.): • Pattern Extraction & Evaluation • Identification of data mining primitives and interestingness measures are done at this stage. • Visualization of data • Making it easily understandable • Evaluation of results • Not every s/w discovered facts are useful for human beings! Introduction to Data Mining**Data Preprocessing**Data Cleaning: Data may be incomplete, noisy and inconsistent. Attempts are made to identify outliers to smooth out noise, fill in missing values and correct inconsistencies. Introduction to Data Mining**Data Preprocessing**Data Integration: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files. Introduction to Data Mining**Data Preprocessing**Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary. It includes activities like, Removal of irrelevant and redundant attributes, Data Compression and Aggregation or Generation of Summary Data. Introduction to Data Mining**Data Preprocessing**Transformation: Data need to be transformed or consolidated into forms suitable for mining. It may include activities like, Generalization, Normalization, e.g. attribute values converted from absolute values to ranges, Construction of new attributes etc. Introduction to Data Mining**Patterns**• Descriptive – characterizing general properties of the data • Predictive – inference on the current data in order to make patterns • Discover: • multiple kind of patterns to accommodate different user expectation (may specify hints to guide) /application • patterns at various granularity Introduction to Data Mining**Frequent Patterns**Patterns that occur frequently in the data. Types: • Itemset • Subsequences • Substructures (sub-graphs, sub-trees, sub-lattices) Introduction to Data Mining**Discovery of Association Rules**To identify the features or items in a problem domain that tend to appear together. These features or items are said to be associated. The process is to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract rules on how a subset of items influences the presence of another subset. Introduction to Data Mining**Association Rule: Example**A user studying the buying habits of customers may choose to mine association rules of the form: P (X:customer,W) ^ Q (X,Y) buys (X,Z) [support=n%, confidence is m%] Meta rules such as the following can be specified: occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”) [1.4%, 70%] Introduction to Data Mining**Association Rule: Single/Multi**Single-dimensional association rule: buys(X, “computer”) buys (X, “antivirus”) [1.1%, 55%] OR “computer” “antivirus” (A B ) [1.1%, 55%] Multi-dimensional association rule: occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”) [1.4%, 70%] Introduction to Data Mining**Metrics for Interestingness measures**Interestingness measures in knowledge discovery help to identify the relevance of the patterns discovered during the mining process. Introduction to Data Mining**Interestingness measures**• Used to confine the number of uninteresting patterns returned by the process. • Based on the structure of patterns and statistics underlying them. • Associate a threshold which can be controlled by the user • patterns not meeting the threshold are not presented to the user. Introduction to Data Mining**Interestingness measures: objective**Objective measures of pattern interestingness: • simplicity • utility (support) • certainty (confidence) • novelty Introduction to Data Mining**Interestingness measures: simplicity**Simplicity: a patterns interestingness is based on its overall simplicity for human comprehension. e.g. Rule length is a simplicity measure Introduction to Data Mining**Interestingness measures: support**Utility (support): usefulness of a pattern support(AB) = P(AUB). The support for a association rule {A} {B} is the % of all the transactions under analysis that contains this itemset. Introduction to Data Mining**Interestingness measures: confidence**Certainty (confidence): Assesses the validity or trustworthiness of a pattern. Confidence is a certainty measure confidence(A B) = P(B│A) The confidence for a association rule {A} {B} is the % cases that follows the rule. Association rules that satisfy both the confidenceand support threshold are referred to as strong association rules. Introduction to Data Mining**Interestingness measures: novelty**Novelty: Patterns contributing new information to the given pattern set are called novel patterns. e.g: Data exception. Removing redundant patterns is a strategy for detecting novelty. Introduction to Data Mining**Market Basket data analysis**Let, a transaction be defined as the variety of items purchased by a customer in one visit, irrespective of the quantity of each item purchased. The problem is to find the items that a customer tends to buy together. Introduction to Data Mining**Market Basket data analysis**An association rule is an expression of the form XY, where X and Y are the sets of items. The intuitive meaning of the expression is, the transactions that contain X tend to contain Y as well. The inverse may not be true. Since only presence or absence of items are considered and not the quantity purchased, this type of rules are called Binary Association Rules. Introduction to Data Mining**Market Basket data analysis**Purpose is to study consumers’ purchase pattern in departmental stores. Considering four possible transactions, 1 - {Pen, Ink, Diary, Writing Pad} 2 - {Pen, Ink, Diary} 3 - {Pen, Diary} 4 - {Pen, Ink, Writing Pad} Introduction to Data Mining**Market Basket data analysis**A possible Association Rule, “ Purchase of Pen implies the purchase of Ink or Diary” {Pen} {Ink} or {Pen} {Diary} Basically, the rule is of the form {LHS} {RHS} where, both {LHS} and {RHS} are sets of items, called itemset and {LHS} ∩ {RHS} = . • {Pen, Ink} is a 2-itemset. Introduction to Data Mining**Binary Association Rule Mining**Two Step Process • Find all frequent itemsets • An itemset will be considered for mining rules if its support is above a threshold called minsup. • Generate strong association rules from frequent itemsets • Acceptance of a rule is once again through a threshold called minconf. Introduction to Data Mining**Finding Frequent Itemsets**If there are N items in a market basket and the association is studied for all possible item combinations, totally 2N combinations are to be checked. Introduction to Data Mining**Finding Frequent Itemsets**All nonempty subsets of a frequent itemset must also be frequent. (anti-monotone property) Apriori Algorithm An itemset is frequent when its occurrence in the total dataset exceeds the minsup. If there exists N items, the algorithm attempts to compute frequent itemsets for 1-itemset to N-itemsets. Introduction to Data Mining**Apriori Algorithm**The algorithm has two steps, • Join step • Prune step • Join step : Here frequent k-itemsets are computed by joining the (k-1)-itemsets • Prune step: if a k-itemset fails to cross the minsup threshold, all the supersets of the concerned k-itemset are no longer considered for association rule discovery. Introduction to Data Mining**Apriori Algorithm**• Let Lk be the set of frequent k-itemsets • Let Ck be the set of candidate k-itemsets Each member of this set has two fields – itemset and support count. Introduction to Data Mining**Apriori Algorithm**• Let k←1 • Generate L1 frequent itemsets of length 1 • (Lk= ) OR (k = N) goto Step 7 • k ← k+1 • Generate Lkfrequent itemsets of length k by Join and Prune • Goto Step 3. • Stop Output : UkLk Introduction to Data Mining**Apriori Algorithm**Join () forall (i,j) where iϵ Lk-1 and j ϵ Lk-1, i≠j select all possible k-itemset and insert into Ck endfor If L3={{{1 2 3}, s123}, {{1 2 4}, s124}, {{1 3 4}, s134}, {1 3 5}, s135}, {2 3 4}, s234}} C4={{{1 2 3 4}, s1234}, {{1 3 4 5}, s1345}} Introduction to Data Mining**Apriori Algorithm**Prune() forallitemsets Ckdo forall (k-1)-subsets s of c do If ( Lk-1) then delete c from Ck endif endfor endfor Lk ← Ck L4={{{1 2 3 4}, s1234}} Introduction to Data Mining**Rule Generation**Rule generation should ensure production of rules that satisfy only the minimum confidence threshold • Because, rules are generated from frequent itemsets, they automatically satisfy the minimum support threshold Given a frequent itemsetli, find all non-empty subsets f li such that f li – f satisfies the minimum confidence requirement • If | li| = k, then there are 2k – 2 candidate association rules Introduction to Data Mining**Rule Generation**Algorithm: foralllii≥ 2 do call genrule (li,li) endfor Introduction to Data Mining**Rule Generation**genrule (lk,fi) F ← {(m-1)-itemset fm-1 | fm-1 fm} forall fm-1ϵ Fdo conf ←sup(lk) / sup(fm-1) if (conf ≥ minconf) print rule “fm-1 (lk-fm-1), conf, sup(lk)” if (m-1 >1) cal genrule(lk, am-1) endif endif endfor Introduction to Data Mining**Rule Generation**If {A,B,C,D} is a frequent itemset, candidate rules: {ABC}{D}, {ABD}{C}, {ACD}{B}, {BCD}{A}, {AB}{CD}, {AC}{BD}, {AD}{BC}, {BC}{AD}, {BD}{AC}, {CD}{AB}, {A}{BCD}, {B} {ACD}, {C}{ABD}, {D}{ABC} Introduction to Data Mining**Rule Generation**In general, confidence does not have an anti-monotone property c({ABC} {D}) can be larger or smaller than c({AB} {D}) But confidence of rules generated from the same itemset has an anti-monotone property • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule e.g., L = {A,B,C,D}: c({ABC} {D}) c({AB} {CD}) c({A} {BCD}) Introduction to Data Mining**Case Study**To find the Association among the species of trees present in a forest. The problem is to find a set of association rules which would indicate the species of trees that usually appear together and also whether a set of species ensures the presence of another set of species with a minimum degree of confidence specified apriori. Introduction to Data Mining**Data Collection**A forest area is divided into a number of transacts. A group of surveyors walk through each such transact to identify the different species of trees and their number of occurrences. Introduction to Data Mining**Data**Introduction to Data Mining**Converting the Data**Introduction to Data Mining**Drawbacks**Support and confidence used by Apriori allow a lot of rules which are not necessarily interesting Two options to extract interesting rules • Using subjective knowledge • Using objective measures (measures better than confidence) Introduction to Data Mining**Subjective approaches**• Visualization – users allowed to interactively verify the discovered rules • Template-based approach – filter out rules that do not fit the user specified templates • Subjective interestingness measure – filter out rules that are obvious (bread butter) and that are non-actionable (do not lead to profits) Introduction to Data Mining