Download Presentation
## Chapter 2 Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Chapter 2Data Mining**Faculty of Computer Science and Engineering HCM City University of Technology October- 2010**Outline**• Overview of data mining • Association rules • Classification • Regression • Clustering • Other Data Mining problems • Applications of data mining**DATA MINING**• Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amount of data. • To be practically useful, data mining must be carried out efficiently on large files and databases. • This chapter briefly reviews the state-of-the-art of this extensive field of data mining. • Data mining uses techniques from such areas as • machine learning, • statistics, • neural networks • genetic algorithms.**OVERVIEW OF DATA MINING Data Mining as a Part of the**Knowledge Discovery Process. • Knowledge Discovery in Databases, abbreviated as KDD, encompasses more than data mining. • The knowledge discovery process comprises six phases: data selection, data cleansing, enrichment, data transformation or encoding, data mining and the reporting and displaying of the discovered information.**Example**• Consider a transaction database maintained by a specially consumer goods retails. Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount. • A variety of new knowledge can be discovered by KDD processing on this client database. • During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. • The data cleansing process then may correct invalid zip codes or eliminate records with incorrect phone prefixes. Enrichment enhances the data with additional sources of information. For example, given the client names and phone numbers, the store may purchases other data about age, income, and credit rating and append them to each record. • Data transformation and encoding may be done to reduce the amount of data.**Example (cont.)**The result of mining may be to discover the following type of “new” information: • Association rules – e.g., whenever a customer buys video equipment, he or she also buys another electronic gadget. • Sequential patterns – e.g., suppose a customer buys a camera, and within three months he or she buys photographic supplies, then within six months he is likely to buy an accessory items. This defines a sequential pattern of transactions. A customer who buys more than twice in the regular periods may be likely buy at least once during the Christmas period. • Classification trees – e.g., customers may be classified by frequency of visits, by types of financing used, by amount of purchase, or by affinity for types of items, and some revealing statistics may be generated for such classes.**We can see that many possibilities exist for discovering new**knowledge about buying patterns, relating factors such as age, income group, place of residence, to what and how much the customers purchase. • This information can then be utilized • to plan additional store locations based on demographics, • to run store promotions, • to combine items in advertisements, or to plan seasonal marketing strategies. • As this retail store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions. • The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualization.**Goals of Data Mining and Knowledge Discovery**• Data mining is carried out with some end goals. These goals fall into the following classes: • Prediction – Data mining can show how certain attributes within the data will behave in the future. • Identification – Data patterns can be used to identify the existence of an item, an event or an activity. • Classification – Data mining can partition the data so that different classes or categories can be identified based on combinations of parameters. • Optimization – One eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints.**Data Mining: On What Kind of Data?**• Relational databases • Data warehouses • Transactional databases • Advanced DB and information repositories • Object-oriented and object-relational databases • Spatial databases • Time-series data and temporal data • Text databases and multimedia databases • Heterogeneous and legacy databases • World Wide Web**Types of Knowledge Discovered During Data Mining.**• Data mining addresses inductive knowledge, which discovers new rules and patterns from the supplied data. • Knowledge can be represented in many forms: In an unstructured sense, it can be represented by rules. In a structured form, it may be represented in decision trees, semantic networks, or hierarchies of classes or frames. • It is common to describe the knowledge discovered during data mining in five ways: • Association rules – These rules correlate the presence of a set of items with another range of values for another set of variables.**Types of Knowledge Discovered (cont.)**• Classification hierarchies – The goal is to work from an existing set of events or transactions to create a hierarchy of classes. • Patterns within time series • Sequential patterns: A sequence of actions or events is sought. Detection of sequential patterns is equivalent to detecting associations among events with certain temporal relationship. • Clustering – A given population of events can be partitioned into sets of “similar” elements.**Main function phases of the KD process**• Learning the application domain: • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation: • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge**Main phases of data mining**Knowledge Pattern Evaluation/ Presentation Data Mining Patterns Task-relevant Data Selection/Transformation Data Warehouse Data Cleaning Data Integration Data Sources**2. ASSOCIATION RULESWhat Is Association Rule Mining?**• Association rule mining is finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Applications: • Basket data analysis, • cross-marketing, • catalog design, • clustering, classification, etc. • Rule form: “Body Head [support, confidence]”.**Association rule mining**• Examples. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] major(x, “CS”) takes(x, “DB”) grade(x, “A”) [1%, 75%] Association Rule Mining Problem: Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items • E.g., 98% of people who purchase tires and auto accessories also get automotive services done.**Rule Measures: Support and Confidence**• Let J = {i1, i2,…,im} be a set of items. Let D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that T J. Each transaction T is said to contain A if and only if A T. • An association rule is an implication of the form A B where A J, B J and A B = . • The rule A B holds in the transaction set D with supports, where s is the percentage of transactions in D that contain A B (i.e. both A and B). This is taken to be the probability P(A B ). • The rule A B has the confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B.**Support and confidence**That is. support, s, probability that a transaction contains {A B } s = P(A B ) confidence, c, conditional probability that a transaction having A also contains B. c = P(A|B). • Rules that satisfy both a minimum support threhold (min_sup) and a mimimum confidence threhold (min_conf) are called strong.**Frequent item set**• A set of items is referred as an itemset. An itemset that contains k items is a k-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. • An itemset satisfies minimum support if the occurrence frequency of the itemset is greater than or equal to the product of min_suf and the total number of transactions in D. The number of transactions required for the itemset to satisfy minimum support is referred to as the minimum support count. • If an itemset satisfies minimum support, then it is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk.**Example 2.1**Transaction-ID Items_bought ------------------------------------------- 2000 A, B, C 1000 A, C 4000 A, D 5000 B, E, F Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)**Types of Association Rules**Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “SQLServer”) buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations The rule that references two or more dimensions, such as the dimensions buys, income and age is a multi-dimensional association rule. Single level vs. multiple-level analysis Some methods for association rule mining can find rules at different levels of abstractions. For example, suppose that a set of association rule mined includes the following rules: age(x, “30..39”) buys(x, “laptop computer”) age(x, “30..39”) buys(x, “ computer”) in which “computer” is a higher level abstraction of “laptop computer”.**How to mine association rules from large databases?**• Association rule mining is a two-step process: 1. Find all frequent itemsets (the sets of items that have minimum support) A subset of a frequent itemset must also be a frequent itemset. (Apriori principle) i.e., if {AB} isa frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) 2.Generate strong association rules from the frequent itemsets. • The overall performance of mining association rules is determined by the first step.**The Apriori Algorithm**• Apriori is an important algorithm for mining frequent itemsets for Boolean association rules. • Apriori algorithm employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. • First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. • To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space.**Apriori property**• Apriori property: All nonempty subsets of a frequent itemset must also be frequent. • The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the minimum support threhold, min_sup, then I is not frequent, that is, P(I) < min_suf. If an item A is added to the itemset I, then the resulting itemset, I A, can not occur more frequently than I. Therefore, I A is not frequent either, i.e., P(IA) < min_suf. • This property belongs to a special category of properties called anti-monotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well.**Finding Lk using Lk-1.**• A two-step process is used in finding Lk using Lk-1. • Join Step: Ck is generated by joining Lk-1 with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset**Pseudo code**Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = apriori_gen(Lk, min_sup); for each transaction t in database do // scan D for counts increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;**procedure apriori_gen(Lk:frequent k-itemset, min_sup: minmum**support threshold) (1) for each itemset l1 Lk (2) for each itemset l2 Lk (3) if(l1[1] = l2[1] l1[2] = l2[2] … l1[k-1] = l2[k-1] l1[k] < l2[k] then { (4) c = l1 l2; (5) if some k-subset s of c Lkthen (6) delete c; // prune step: remove unfruitful candidate (7) else add c to Ck; (8) } (9) return Ck; (10) end procedure**Example 2.2:**TID List of item_Ids ----------------------------- T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 • Assume that minimum transaction support count required is 2 (i.e. min_sup = 2/9=22%).**C1 L1**Itemset Sup.count Itemset Sup.count {I1} 6 {I1} 6 {I2} 7 {I2} 7 {I3) 6 {I3) 6 {I4} 2 {I4} 2 {I5} 2 {I5} 2 C2 L2 Itemset Sup.count Itemset Sup.count {I1, I2} 4 {I1, I2} 4 {I1, I3} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I1, I5} 2 {I2, I3} 4 {I2, I3} 4 {I2, I4} 2 {I2, I4} 2 {I2, I5} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0**C3 L3**Itemset Sup.count Itemset Sup.count {I1, I2, I3} 2 {I1, I2, I3} 2 {I1, I2, I5} 2 {I1, I2, I5} 2 {I1, I3, I5} X {I2, I3, I4} X {I2, I3, I5} X {I2, I4, I5} X C4 = {{I1, I2, I3, I5}} L4 = **Generating Association Rules from Frequent Itemsets**• Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them. • This can be done using the following equation for confidence, where the conditional probability is expressed in terms of itemset support count: confidence(A B) = P(B|A) = support_count(AB)/support_count(A) where support_count(X) is the number of transactions containing the itemsets X.**Based on this equation, association rules can be generated**as follows: • For each frequent itemset l, generate all nonempty subsets of l. • For every nonempty subset s of l, output the rule “ s (l –s)” if support_count(l)/support_count(s) min_conf, where min_conf is the minimum confidence threshold. • Since the rules are generated from frequent itemsets, each one automatically satisfies minimum support.**Example 2.3. From Example 2.2, suppose the data contain the**frequent itemset l = {I1, I2, I5}. The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2} and {I5}. The resulting association rules are as shown blow: I1 I2 I5 confidence = 2/4 = 50% I1 I5 I2 confidence = 2/2 = 100% I2 I5 I1 confidence = 2/2 = 100% I1 I2 I5 confidence = 2/6 = 33% I2 I1 I5 confidence = 2/7 = 29% I5 I1 I2 confidence = 2/2 = 100% If the minimum confidence threshold is, say, 70%, then only the second, third and last rules above are outputs.**Properties of Apriori algorithm**• Generate several candidate itemsets • 104 frequent 1-itemsets more than 107 (≈104(104-1)/2) candidate 2-itemsets • Each k-itemset needs at least 2k -1 candidate itemsets. • Examine the dataset several times • High cost when sizes of itemsets increase. • If k-itemsets are identified then the algorithm examines the dataset k+1 times.**Improving the efficiency of Apriori**• Hash-based technique: hashing itemsets into corresponding buckets. • Transaction reduction: reducing the number of transaction scanned in future iterations. • Partitioning: partitioning the data to find candidate itemsets. • Sampling: mining on a subset of the given data. • Dynamic itemset counting: adding candidate itemsets at different points during a scan.**3. CLASSIFICATION**• Classification is the process of learning a model that describes different classes of data. The classes are predetermined. • Example: In a banking application, customers who apply for a credit card may be classify as a “good risk”, a “fair risk” or a “poor risk”. Hence, this type of activity is also called supervised learning. • Once the model is built, then it can be used to classify new data.**The first step, of learning the model, is accomplished by**using a training set of data that has already been classified. Each record in the training data contains an attribute, called the class label, that indicates which class the record belongs to. • The model that is produced is usually in the form of a decision tree or a set of rules. • Some of the important issues with regard to the model and the algorithm that produces the model include: • the model’s ability to predict the correct class of the new data, • the computational cost associated with the algorithm • the scalability of the algorithm. • Let examine the approach where the model is in the form of a decision tree. • A decision tree is simply a graphical representation of the description of each class or in other words, a representation of the classification rules.**Example 3.1**• Example 3.1: Suppose that we have a database of customers on the AllEletronics mailing list. The database describes attributes of the customers, such as their name, age, income, occupation, and credit rating. The customers can be classified as to whether or not they have purchased a computer at AllElectronics. • Suppose that new customers are added to the database and that you would like to notify these customers of an upcoming computer sale. To send out promotional literature to every new customers in the database can be quite costly. A more cost-efficient method would be to target only those new customers who are likely to purchase a new computer. A classification model can be constructed and used for this purpose. • The figure 2 shows a decision tree for the concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer.**Each internal node represents a test on an attribute. Each**leaf node represents a class. A decision tree for the concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer.**Algorithm for decision tree induction**Input: set of training data records: R1, R2, …, Rm and set of Attributes A1, A2, …, An Ouput: decision tree Basic algorithm (a greedy algorithm) - Tree is constructed in a top-down recursive divide-and-conquer manner - At start, all the training examples are at the root - Attributes are categorical (if continuous-valued, they are discretized in advance) - Examples are partitioned recursively based on selected attributes - Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)**Conditions for stopping partitioning**- All samples for a given node belong to the same class - There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf - There are no samples left.**Procedure Build_tree(Records, Attributes);**Begin (1) Create a node N; (2) If all Records belong to the same class, C then (3) Return N as a leaf node with the class label C; (4) If Attributes is empty then (5) Return N as a leaf node with the class label C, such that the majority of Records belong to it; (6) select attributes Ai (with the highest information gain) from Attributes; (7) label node N with Ai; (8) for each known value aj of Aido begin (9) add a branch for node N for the condition Ai = aj; (10) Sj = subset of Records where Ai = aj; (11) If Sj is empty then (12) Add a leaf L with class label C, such that the majority of Records belong to it and return L else (13) Add the node return by Build_tree(Sj, Attributes – Ai); end end**Attribute Selection Measure**• The expected information gain needed to classify training data of s samples, where the Class attribute has m values (a1, …, am) and si is the number of samples belong to Class label ai is given by: I(s1, s2,…, sm) = - where pi is the probability that a random sample belongs to the class with label ai. An estimate of piis si/s. Consider an attribute A with values {a1, …, av } used as the test attribute for splitting in the decision tree. Attribute A partitions the samples into the subsets S1,…, Sv where samples in each Sihave a value of ai for attribute A. Each Si may contain samples that belong to any of the classes. The number of samples in Si that belong to class j can be denoted as sij. Entropy of A is given by: E(A) =**I(s1j,…,smj) can be defined using the formulation for**I(s1,…,sm) with pi being replaces by pij = sij/sj. Now the information gain by partitioning on attribute A is defined as: Gain(A) = I(s1, s2,…, sm) – E(A). • Example 3.1: Table 1 presents a training set of data tuples taken from the AllElectronics customer database. The class label attribute, buys_computer, has two distinct values; therefore two distinct classes (m = 2). Let class C1 correspond to yes and class C2 corresponds to no. There are 9 samples of class yes and 5 samples of class no. • To compute the information gain of each attribute, we first use Equation (1) to compute the expected information needed to classify a given sample: • I(s1, s2) = I(9,5) = - (9/14) log2(9/14) – (5/9)log2(5/14) = 0.94**Training data tuples from the AllElectronics customer**database Class No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No**Next, we need to compute the entropy of each attribute.**Let’s start with the attribute age. We need to look at the distribution of yes and no samples for each value of age. We compute the expected information for each of these distributions. For age =”<= 30”: s11 = 2 s21 = 3 I(s11, s21) = -(2/5)log2(2/5) – (3/5)log2(3/5)= 0.971 For age = “31…40” s12 = 4 s22 = 0 I(s12, s22) = -(4/4)log2(4/4) – (0/4)log2(0/4) = 0 For age = “>40”: s13 = 3 s23 = 2 I(s13, s23) = -(3/5)log2(3/5) – (2/5)log2(2/5)= 0.971 Using Equation (2), the expected information needed to classify a given sample if the samples are partitioned according to age is E(age) = (5/14)I(s11, s21) + (4/14) I(s12, s22) + (5/14)I(s13, s23) = (10/14)*0.971 = 0.694.**Hence, the gain in information from such a partitioning**would be Gain(age) = I(s1, s2) – E(age) = 0.940 – 0.694 = 0.246 • Similarly, we can compute Gain(income) = 0.029, Gain(student) = 0.151, and Gain(credit_rating) = 0.048. Since age has the highest information gain among the attributes, it is selected as the test attribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values. • The samples are then partitioned accordingly, as shown in Figure 3.**age?**<= 30 >40 31…40 income student credit_rating class high no fair no high no excellent no medium no fair no low yes fair yes medium yes excellent yes income student credit_rating class medium no fair yes low yes fair yes low yes excellent no medium yes fair yes medium no excellent no income student credit_rating class high no fair yes low yes excellent yes medium no excellent yes high yes fair yes**Extracting Classification Rules from Trees**• Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand. Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “no” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “yes”**Neural Networks and Classification**• Neural network is a technique derived from AI that uses generalized approximation and provides an iterative method to carry it out. ANNs use the curve-fitting approach to infer a function from a set of samples. • This technique provides a “learning approach”; it is driven by a test sample that is used for the initial inference and learning. With this kind of learning method, responses to new inputs may be able to be interpolated from the known samples. This interpolation depends on the model developed by the learning method.**ANN and classification**• ANNs can be classified into 2 categories: supervised and unsupervised networks. Adaptive methods that attempt to reduce the output error are supervised learning methods, whereas those that develop internal representations without sample outputs are called unsupervised learning methods. • ANNs can learn from information on a specific problem. They perform well on classification tasks and are therefore useful in data mining.