Building Enterprise Bussines Intelligence

Building Enterprise Bussines Intelligence
HendroSubagyo, M.Eng

BI architecture and component Data warehouse Business analytics Automated decision tools Data mining Business performance management Dashboards Visualization tools

Why Business Intelligence Systems? Knowledge Management Problems (Drowning in data, starving for knowledge) Can’t access data (easily) E.g., data from different branches, years, functional areas, etc. Give me only what’s important (knowledge) E.g., which products do customers tend to buy together? I need to reduce data to what’s important by slicing and dicing. E.g., by branch, product, year, etc.

Why Business Intelligence Systems? Data inconsistency and poor data quality E.g., the 2001 PC sales amount in SLC from the CFO and the SLC Account Manager are not the same. Need to improve the practices of making informed decisions. E.g., Did the VP for Marketing decide on the advertising budgets for branches in the SW region based on their sales performances over the last five years? Hard and slow to query the database? E.g., VP for Marketing, CFO and Account Manager had to wait for the MIS Department to generate sales performance reports and analyses.

Why Business Intelligence Systems? ROI Problems Can I get more value out of my data? Ans: Make informed, potent decisions using knowledge extracted from integrated and consistent data over a long period of time. Can I do this cost-effectively? Can I easily scale up or change how I get knowledge out of my data? Options: manually versus automatically identifying knowledge

Business Intelligence We are drowning in data, but starving for knowledge Business intelligence (BI) is knowledge extracted from data to support better business decision making.

Building BI Systems Data Warehouse: A huge and integrated data base DW vs DBMS: In data warehouse, organize data in subject –oriented way rather than process-oriented way – dimensional modeling Data mining: Techniques to extract hidden patterns from data.

Data Warehouse Logical design of data warehouse Physical design of data warehouse Data preparation and staging Data analysis (OLAP)

Data Mining Association rules – Cross Selling Clustering – Target Marketing Classification – Credit Card Approval Advanced issues – web mining, personalization

Data Mining Applications Finance and Insurance Marketing: Target Marketing, Cross Selling E-commerce: personalization, recommendation, web site design Crime Detecting

Data Warehouse

Why Data Warehouse Problems with current database practices: Problem 1: Isolated databases distributed in an enterprise Sub-problems: Data Inconsistency No comprehensive view of enterprise’s data sources – information island Sales CRM Inventory

Why Data Warehouse Problem 1: Isolated databases distributed in an enterprise Sub-problems: Data Inconsistency Performance Sales CRM Inventory

Why Data Warehouse Problem 2: Historical data is archived in offline storage systems Sub-problems: Historical data is always needed to support business decisions Historical Sales Data Sales Archive

Why Data Warehouse A marketing manager wants to know sales amount distribution by product category and customer state in July? Query???

Why Data Warehouse Problem 3: Database is designed to process transactions but not to answer decision support queries Complex queries Bad query performance

What is Data Warehouse Data Warehouse is designed to solve problems associated with current database practices: Problem 1: Isolated databases distributed in an enterprise Sales CRM Data Warehouse Extract, Integrate and Replicate Inventory

Why Data Warehouse Problem 2: Historical data is archived in offline storage systems Data Warehouse Historical Sales Data Sales Archive Integrate Historical Data with Current Data

What is Data Warehouse Problem 3: Database is designed to process transactions but not to answer decision support queries Solution: In data warehouse, organize data in subject –oriented way rather than process-oriented way – dimensional modeling.

What is Data Warehouse Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision making process.

What is Data Warehouse 1. Subject-oriented means the data warehouse focuses on the high-level entities of business such as sales, products, and customers. This is in contrast to database systems, which deals with processes such as placing an order. 2. Integrated means the data is integrated from distributed data sources and historical data sources and stored in a consistent format.

What is Data Warehouse 3. Time-variant means the data associates with a point in time (i.e., semester, fiscal year and pay period) 4. Non-volatile means the data doesn’t change once it gets into the warehouse.

What is Data Warehouse

Data Warehouse Development Lifecycle Data Warehouse – Enterprise Data Warehouse Data Mart – Departmental Data Warehouse Data warehouse may contains multiple data marts

Data Warehouse Development Lifecycle Project Planning Requirements Analysis Logical Design Physical Design Data Staging Data Analysis (OLAP)

Data Warehouse Development Lifecycle Logical Design (Tool: Oracle Data Mart Designer) ER Modeling  Dimensional Modeling Design appropriate table structure and primary key/Foreign key relationship

Data Warehouse Development Lifecycle Physical Design Database selection Storage selection Web based? Performance

Data Warehouse Development Lifecycle Data Staging Extraction Cleansing and Transformation Transportation

Data Warehouse Development Lifecycle Extraction Transformation Transportation

Data Warehouse Development Lifecycle Data Analysis (OLAP) Reporting Ad-hoc query Graphical Analysis

Data Warehouse Development Lifecycle Analytical Report

Data Warehouse Development Lifecycle Drill-up&Drill-down Query

Data Warehouse Development Lifecycle Graphical Analysis

Data Mining

Why data mining? OLAP can only provide shallow data analysis -- what Ex: sales distribution by product

Why data mining? Shallow data analysis is not sufficient to support business decisions -- how Ex: how to boost sales of other products Ex: when people buy product 6 what other products do they are likely to buy? – cross selling

Why data mining? OLAP can only do shallow data analysis OLAP is based on SQL SELECT PRODUCTS.PNAME, SUM(SALESFACTS.SALES_AMT) FROM DBSR.PRODUCTS PRODUCTS, DBSR.SALESFACTS SALESFACTS WHERE ( ( PRODUCTS.PRODUCT_KEY = SALESFACTS.PRODUCT_KEY ) ) GROUP BY PRODUCTS.PNAME; The nature of SQL decides that complicated algorithm cannot be implemented with SQL. Complicated algorithms need to be developed to support deep data analysis – data mining

Why Data Mining? Walmart (!?) Diaper + Beer = ? $$$

Market Basket (Association Rule) Analysis A market basket is a collection of items purchased by a customer in an individual customer transaction, which is a well-defined business activity Ex: a customer’s visit a grocery store an online purchase from a virtual store such as ‘Amazon.com’

Market Basket (Association Rule) Analysis Market basket analysis is a common analysis run against a transaction database to find sets of items, or itemsets, that appear together in many transactions. Each pattern extracted through the analysis consists of an itemset and the number of transactions that contain it. Applications: improve the placement of items in a store the layout of mail-order catalog pages the layout of Web pages others?

Why data mining? OLAP results generated from data sets with large number of attributes are difficult to be interpreted Ex: cluster customers of my company --- target marketing Pick two attributes related to a customer: income level and sales amount

Why data mining? Ex: cluster customers of my company --- target marketing Pick three attributes related to a customer: income level, education level and sales amount

What is data mining? Data mining is a process to extract hidden and interesting patterns from data. Data mining is a step in the process of Knowledge Discovery in Database (KDD).

Step 5: Interpretation & Evaluation Step 4: Data Mining Knowledge Step 3: Transformation Step 2: Cleaning Patterns Step 1: Selection Transformed Data Preprocessed Data Target Data Steps of the KDD Process Data

Discovered knowledge Patterns Transformed data for DW Transformed data for DM Preprocessed data for DM Preprocessed data for DW Target data for DW Target data for DM Step 5: Interpretation & evaluation Step 4: Data mining Step 3: Transformation Step 2: Cleaning & preprocessing Step 1: Selection Domain expert Data warehouse OLAP & reporting Step 3: Cleaning & preprocessing Step 4: Transformation Step 2: Selection Step 1: Acquisition Raw data

Steps of the KDD Process Step 1: select interested columns (attributes) and rows (records) to be mined. Step 2: clean errors from selected data Step 3: data are transformed to be suitable for high performance data mining Step 4: data mining Step 5: filter out non-interesting patterns from data mining results

Data mining – on what kind of data Transactional Database Data warehouse Flat file Web data Web content Web structure Web log

Major data mining tasks Association rule mining – cross selling Clustering – target marketing Classification – potential customer identification, fraud detection

Association Rule Mining

Problem Cross Selling --- promote sales of other products as one product is purchased Brick-and-Mortar stores: merchandise placement Click-and-Mortar stores: web site design Telemarketing Market Basket Analysis

Preliminary Set Theory A set is a collection of objects. Ex: {1,3,5} The objects collected in a set is called its elements. Ex: Set X is a subset of set Y if any element in X can be found in Y

Preliminary Two properties of set An element is a set is counted only once Ex: {1,3,5} is the same as {1,3,3,5} There is no order of elements in a set Ex: {3,1,5} is the same as {1,3,5}

Association Rules Given: A database of transactions Example of transactions: a customer’s visit to a grocery store an online purchase from a virtual store such as ‘Amazon.com’ Format of transactions: date transaction ID customer ID Item 1/1/99 001 001 egg 1/1/99 001 001 milk

Association Rules Find: patterns in the form of association rules Association rules : correlate the presence of one set of items (X) with the presence of another set of items (Y), denoted as X  Y Example : {purchase egg,milk}  {bread} How to measure correlations in association rules?

Association Rules Two important metrics for association rules: If there are two itemsets X and Y in a transaction database, we call the association rule XY holds in the transaction database with supports s which is the ratio of the number oftransactions purchasing both X and Y to the total number of transactions confidence c which is the ratio of the number oftransactions purchasing both X and Y to the number of transactions purchasing only X.

Association Rules Example: TID CID Item Price Date 101 201 Computer 1500 1/4/99 101 201 MS Office 300 1/4/99 101 201 MCSE Book 100 1/4/99 102 201 Hard disk 500 1/8/99 102 201 MCSE Book 100 1/8/99 103 202 Computer 1500 1/21/99 103 202 Hard disk 500 1/2199 103 202 MCSE Book 100 1/2199

Association Rules In this example: For association rule {Computer} {Hard disk} Its support is 1/3=33.3% Its confidence is 1/2=50% How about {Computer} {MCSE book} {Computer, MCSE book}  {Hard disk}??? Confidence > Support???

Association Rule Mining Association rule mining: find all association rules with support larger than or equal to user-specified minimum support and confidence larger than or equal to user-specified minimum confidence from a transaction database For the example in slide 8 (3 transactions and 4 items), the process of mining association rules is not that complex. How about a transaction database with 1G transactions and 1M different items? An efficient algorithm is needed?

Association Rule Mining Itemset: a set of items, ex. {egg, milk} Size of Itemset: number of items in that itemset. The ratio of the number of transactions that purchases all items in an itemset to the total number of transactions is called the support of the itemset.

Association Rules Example: TID CID Item Price Date 101 201 Computer 1500 1/4/99 101 201 MS Office 300 1/4/99 101 201 MCSE Book 100 1/4/99 102 201 Hard disk 500 1/8/99 102 201 MCSE Book 100 1/8/99 103 202 Computer 1500 1/21/99 103 202 Hard disk 500 1/2199 103 202 MCSE Book 100 1/2199

Association Rules In this example: The support of the 2-itemset {Computer,Hard disk} is 1/3=33.3%. What is the support of 1-itemset {Computer}? What is the support of {Computer} {Hard disk} and {Hard disk}  {Computer}??

Association Rules Two Steps in Association rule mining: Find all itemsets that have support above user-specified minimum support. We call these itemsets large itemsets. For each large itemset L, find all association rules in the form of a(L-a) where a and (L-a) are non-empty subsets of L. Example: find all association rules in the example given in slide 8 with minimum support 60% and minimum confidence 80%.

Association Rule Mining Step 2 is trival compared to step 1: Exponential search space Size of transaction database Readings: Data mining book pp225-230

Apriori Algorithm Apriori is an efficient algorithm to discover all large itemsets from a huge database with large number of items. Apriori is developed by two researchers from IBM Almaden Research Lab.

Apriori Algorithm Apriori algorithm is based on Apriori property. Apriori property is that any subset of a large itemset must be large.

Apriori Algorithm Step 1: Scan DB one time to find all large 1-itemsets. Step 2: Generate candidate K-itemsets from large (k-1)-itemsets. Step 3: Find all large k-itemsets from candidate k-itemsets by scanning DB once Go back to step 2 and stop until no cadidateitemsets can be generated.

Apriori Algorithm Step 2 Candidate k-itemsets are k-itemsets that could be large. Why generate candidate k-itemsets only from large (k-1) itemsets? How to generate? Step 2-1: Join: Two large (k-1)-itemsets, L1 amd L2, that are joinable must satisfy the following conditions: L1(1)=L2(1) and L1(2)=L2(2) and …. L1(K-2)=L2(K-2) L1(K-1)<L2(K-1) Step 2-2: Prune: prune itemsets generated in step 2-1 that have subset not large.

Transaction ID Items 100 1, 3,4,6 200 2,3,5,7 300 1,2,3,5,8 400 2,5,9,10 500 1,4 Apriori Algorithm Minimum support =40% Minimum confidence =70%

Limitation of Confidence and Support Minimum support = 20% Minimum confidence = 50% TID Items 1 Game, VCR 2 Game, VCR 3 Game, VCR 4 Game, VCR Game 6 VCR 7 VCR 8 VCR 9 VCR 10 PC Support (Game  VCR) = 4/10=40% Confidence (Game  VCR) = 4/5=80% Is the rule interesting???

Independent TID Items 1 Game, VCR 2 Game, VCR 3 Game, VCR 4 Game, VCR Game 6 VCR 7 VCR 8 VCR 9 VCR 10 PC Minimum support = 20% Minimum confidence = 50% Support (Game  VCR) = 4/10=40% Confidence (Game  VCR) = 4/5=80% Is the rule interesting??? Support({VCR})=8/10=80% Confidence ((NOT Game)  VCR) = 4/5=80% Game and VCR are independent!! Rule Game  VCR is misleading!!!

Negative Correlation TID Items 1 Game, VCR 2 Game, VCR 3 Game, VCR 4 VCR Game 6 VCR 7 VCR 8 VCR 9 VCR 10 Game Minimum support = 20% Minimum confidence = 50% Support (Game  VCR) = 3/10=30% Confidence (Game  VCR) = 3/5=60% Is the rule interesting??? Support({VCR})=8/10=80% Confidence ((NOT Game)  VCR) = 5/5=100% Game and VCR are negative correlated!! Rule Game  VCR is misleading!!!

Positive Correlation TID Items 1 Game, VCR 2 Game, VCR 3 Game, VCR 4 Game, VCR PC 6 PC 7 VCR 8 VCR 9 VCR 10 Game, VCR Minimum support = 20% Minimum confidence = 50% Support (Game  VCR) = 5/10=50% Confidence (Game  VCR) = 5/5=100% Is the rule interesting??? Support({VCR})=8/10=80% Confidence ((NOT Game)  VCR) = 3/5=60% Game and VCR are POSITIVE correlated!! Rule Game  VCR is interesting!!!

Another Measurement: LIFT Lift of an association rule X Yis defined as Lift (XY)=conf(X Y )/supp(Y) If Lift (XY)=1 then X and Y are independent If Lift (XY)< 1, then X and Y are negative correlated If Lift (XY)>1, then X and Y are positive correlated Interesting association rules have lift larger than 1.

Sequential Pattern Mining

Sequential Patterns Given: A Transaction Database { cid, tid, date, item } Find: inter-transaction patterns among customers Example: customers typically rent “ Star Wars”, then “Empire Strikes Back” and then “Return of the Jedi”

Sequential Patterns cid tid date item 1 1 01/01/2000 30 1 2 01/02/2000 90 2 3 01/01/2000 40,70 2 4 01/02/2000 30 2 5 01/03/2000 40,60,70 3 6 01/01/2000 30,50,70 4 7 01/01/2000 30 4 8 01/02/2000 40,70 4 9 01/03/2000 90 5 10 01/01/2000 90

Sequential Patterns Itemset : is a non-empty set of items, e.g., {30} , {40, 70}. Sequence: is an ordered list of itemsets, e.g. <{30} {40,70}> , <{40,70} {30} >. Size of sequence is the number of itemsets in that sequence.

Sequential Patterns cid tid date item 1 1 01/01/2000 30 1 2 01/02/2000 90 2 3 01/01/2000 40,70 2 4 01/02/2000 30 2 5 01/03/2000 40,60,70 3 6 01/01/2000 30,50,70 4 7 01/01/2000 30 4 8 01/02/2000 40,70 4 9 01/03/2000 90 5 10 01/01/2000 90 Each transaction of a customer can be viewed as an itemset. All transactions of a customer can together be viewed as a sequence the customer Ex: customer 1 has two itemsets: {30} and {90}, the sequence of customer 1 is <{30} {90}>

Sequential Patterns cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> 5 <{90}>

Sequential Patterns A sequence <a1 a2 ….an> is contained in another sequence <b1 b2 ….bm> if there exists integers i1<i2….<in such that a1 is a subset of bi1, a2 is a subset of bi2 and ….. an is a subset of bin. Ex: <{3} {4,5} {8}> is contained in < {3,8}{4,5,6} {8}> Is <{3} {4,5} {8}> contained in <{7} {3,8} {9}{4,5,6} {8}> ? Is <{3} {4,5} {8}> contained in <{7} {9} {4,5,6} {3,8} {8}> ? Is <{3} {4,5} {8}> contained in <{7} {9} {3,8}{4,5,6} > ?

Sequential Patterns cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> <{90}> A customer supports sequence s if s is contained in the sequence for this customer. Ex: customer 1 and 4 support sequence <{30} {90}>

Sequential Patterns cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> <{90}> The support for a sequence s is defined as the fraction of total customers who support s . Ex: customers 1 and 4 support sequence <{30} {90}> Supp(<{30} {90}>) = 2/5 = 40%

Sequential Patterns cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> <{90}> Supp(<{40,70}>) = 2/5 = 40% Supp({40,70}) = 3/10 = 30%

Sequential Patterns Mining Given: A Transaction Database { cid, tid, date, item } Find: All sequences that have support larger than user-specified minimum support Apriori property: if a sequence is large then all sequences contained in that sequence should be large.

Sequential Patterns Mining Identify all Large 1-Sequences Repeat until there is no more Candidate k-Sequences Identify all Candidate k-Sequences using Large (k-1)-Sequences Join:Two large (k-1)-sequences, L1 amd L2, that are joinable must satisfy the following conditions: L1(1)=L2(1) and L1(2)=L2(2) and …. L1(K-2)=L2(K-2) L1(K-1) L2(K-1) Prune :prune candidate k-sequences generated in step 2-1 that have sub-sequences not large. Determine Large k-Sequences from Candidate k-Sequences

Sequential Patterns Mining cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> 5 <{90}> Minimum Support: 40%

Sequential Patterns Mining cid customer sequence 1 <{30} {90} > 2 <{40,70} {30} {40,60,70}> 3 <{30,50,70}> 4 <{30} {40,70} {90}> 5 <{90}> Minimum Support: 40% Large 1-Sequence: <{30}> support=4/5=80% <{40}> support=2/5=40% <{70}> support=3/5=60% <{90}> support=3/5=60% <{40,70}> support=2/5=40%

Sequential Patterns Mining Large 1-Sequence: <{30}> support=4/5=80% <{40}> support=2/5=40% <{70}> support=3/5=60% <{90}> support=3/5=60% <{40,70}> support=2/5=40% Candidate 2-Sequence: <{30} {40}> <{30} {70}> <{30} {90}> <{30} {40,70}> <{40} {30}> <{40} {70}> <{40} {90}> <{40} {40,70}> <{70} {30}> <{70} {40}> <{70} {90}> <{70} {40,70}> <{90} {30}> <{90} {40}> <{90} {70}> <{90} {40,70}> <{40,70} {30}> <{40,70} {40}> <{40,70} {70}> <{40,70} {90}>

Sequential Patterns Mining Candidate 2-Sequence: <{30} {40}> <{30} {70}> <{30} {90}> <{30} {40,70}> <{40} {30}> <{40} {70}> <{40} {90}> <{40} {40,70}> <{70} {30}> <{70} {40}> <{70} {90}> <{70} {40,70}> <{90} {30}> <{90} {40}> <{90} {70}> <{90} {40,70}> <{40,70} {30}> <{40,70} {40}> <{40,70} {70}> <{40,70} {90}> Large 2-Sequence: <{30} {40}> support=2/5=40% <{30} {70}> support=2/5=40% <{30} {40,70}> support=2/5=40%

Sequential Patterns Mining Large 2-Sequence: <{30} {40}> support=2/5=40% <{30} {70}> support=2/5=40% <{30 {40,70}> support=2/5=40% Candidate 3-Sequence: <{30} {40} {70}> <{30} {40} {40,70}> <{30} {70} {40}> <{30} {70} {40,70}> <{30} {40,70} {40}> <{30} {40,70} {70}> Prune: All sub-sequences of a candidate k-sequence should be large. Candidate 3-Sequence: No candidate 3-sequence. Stop.

Clustering

Problem Target Marketing Swiss Cheese and Belgian Chocolate Diaper Baby food Toys French Wine

Clustering Clustering is a data mining method for grouping data points such that data points within the same cluster are similar and data points in different clusters are dissimilar. How to calculate similarity between data points??

Clustering Why clustering SQL based OLAP is not suitable for clustering objects whose attributes have a large number of possible values SQL based OLAP is not suitable for clustering objects with a large number of attributes

Introduction Clustering Groups objects without pre-specified class labelsinto a set of non-predetermined classes of similar objects Class X O6 Clustering O5 O1 O3 Class Y O2 O5 O4 O1 O2 O6 Class Z Oi:contains relevant attribute values without class labels O3 O4 Classes X, Y or Z: non-predetermined

An example We can cluster customers based on their purchase behavior.

Applications For discovery Customers by shopping behavior, credit rating and/or demographics Insurance policy holders Plants, animals, genes, protein structures Hand writing Images Drawings Land uses Documents Web pages For pre-processing – data segmentation and outlier analysis For conceptual clustering – traditional clustering + classification/characterization to describe each cluster

Basic Terminology Cluster – a collection of objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Distance measure – how dissimilar (similar) objects are Non-negative Distance between the same objects = 0 Symmetric The distance between two objects, A & B, is smaller than the sum of the distance from A to another object C and the distance from C to B

Clustering Process Compute similarity between objects/clusters Clustering based on similarity between objects/clusters

Similarity/Dissimilarity An object (e.g., a customer) has a list of variables (e.g., attributes of a customer such as age, spending, gender etc.) When measuring similarity between objects we measure similarity between variables of objects. Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables.

Clustering Steps in clustering objects Compute similarity between objects Clustering based on similarity between objects

Similarity An object (e.g., a customer) has a list of variables (e.g., attributes of a customer such as age, spending, gender etc.) When measuring similarity between objects we measure similarity between variables of objects. Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables.

Measuring Similarity Continuous variable Use distance to measure dissimilarity between data points For two data points, distance between them can be measured in two ways Manhattan distance Euclidean distance

Dissimilarity For two objects X and Y with continuous variables 1,2,…n, Manhattan distance is defined as:

Measuring Dissimilarity (similarity) Example of Manhattan distance NAME AGE SPENDING($) Sue 21 2300 Carl 27 2600 TOM 45 5400 JACK 52 6000

Measuring Dissimilarity (similarity) For two objects X and Y with continuous variables 1,2,…n, Euclidean distance is defined as:

Measuring Similarity Example of Euclidean distance NAME AGE SPENDING($) Sue 21 2300 Carl 27 2600 TOM 45 5400 JACK 52 6000

Similarity/Dissimilarity Binary variable Normalized Manhattan distance = number of un-matched variables/total number of variables NAME Married Gender Home Internet Sue Y F Y Carl Y M Y TOM N M N JACK N M N

Similarity/Dissimilarity Nominal/ordinal variables NAME AGE BALANCE($) INCOME EYES GENDER Karen 21 2300 high Blue F Sue 21 2300 high Blue F Carl 27 5400 high Brown M We assign 0/1 based on exact-match criteria: Same gender = 0, Different gender = 1 Same eye color = 0, different eye color = 1 We can also “rank” an attribute income high =3, medium = 2, low = 1 E.g. distance (high, low)=2

Distance Calculation NAME AGE BALANCE($) INCOME EYES GENDER Sue 21 2300 high Blue F Carl 27 5400 high Brown M Manhattan Difference: 6 + 3100 + 0 + 1 + 1 = 3108 Euclidean Difference: Square root(62 + 31002+ 0 + 1 + 1) Is there a problem?

Normalization Normalization of dimension values: In the previous example, “balance” is dominant Set the minimum and maximum distance values for each dimension to be the same (e.g., 0 - 100) NAME AGE BALANCE($) INCOME EYES GENDER Sue 21 2300 high Blue F Carl 27 5400 high Brown M Don 18 0 low Black M Amy 62 16,543 low Blue F Assume that age range from 0 - 100 Manhattan Difference (Sue, Carl): 6 + 100* ((5400-2300)/16543) + 0 + 100 + 100

Standardization Calculate the mean value Calculate mean absolute deviation Standardize each variable value as: Standardized value = (original value – mean value)/ mean absolute deviation

Hierarchical Algorithms Output: a tree of clusters where a parent node (cluster) consists of objects in its child nodes (clusters) Input: Objects and distance measure only. No need for a pre-specified number of clusters. Agglomerative hierarchical clustering: Bottom-up Leaf nodes are individual objects Merge lower level clusters by optimizing a clustering criterion until the termination conditions are satisfied. More popular

Hierarchical Algorithms Output: a tree of clusters where a parent node (cluster) consists of objects in its child nodes (clusters) Input: Objects and distance measure only. No need for a pre-specified number of clusters. Divisive hierarchical clustering: Top-down The root node corresponds to the whole set of the objects Subdivides a cluster into smaller clusters by optimizing a clustering criterion until the termination conditions are met.

Clustering based on dissimilarity After calculating dissimilarity between objects, a dissimilarity matrix can be created with objects as indexes and dissimilarities between objects as elements. Distance between clusters Min, Max, Mean and Average

Clustering based on dissimilarity Sue Tom Carl Jack Mary Sue 0 6 8 2 7 Tom 6 0 1 5 3 Carl 8 1 0 10 9 Jack 2 5 10 0 4 Mary 7 3 9 4 0

Bottom-up Hierarchical Clustering Step 1:Initially, place each object in an unique cluster Step 2: Calculate dissimilarity between clusters Dissimilarity between clusters is the minimum dissimilarity between two objects of the clusters, one from each cluster Step 3: Merge two clusters with the least dissimilarity Step 4: Continue steps 1-3 until all objects are in one cluster

Nearest Neighbor Clustering (Demographic Clustering) Dissimilarity by votes Merge an object into a cluster with the lowest avg dissimilarity If the avg dissimilarity with each cluster exceeds a threshold, the object forms its own cluster Stop after a max # of passes, a max # of clusters or no significant changes in the avg dissimilarities in each cluster

Comparative Criteria for Clustering Algorithms Performance Scalability Ability to deal with different attribute types Clusters with arbitrary shape Need K or not Noise handling Sensitivity to the order of input records High dimensionality (# of attributes) Constraint-based clustering Interpretability and usability

Summary of Clustering Problem definition Input: objects without class labels Output: clusters for discovery and conceptual clustering for prediction Similarity/dissimilarity measures and calculations Hierarchical Clustering Criteria for comparing algorithms

Classification

Problem Credit rating Credit card approval  Credit rating  Rules+ applicant’s profile  Rules are learned from old data  How to learn these rules (classification) Product purchasing prediction

Introduction Classification Classifies objects into a set of pre-specified object classes based on the values of relevant object attributes and objects’ class lables Class X O6 O5 O1 O3 Classifier Class Y O2 O5 O4 O1 O2 O6 Class Z Oi:contains relevant attribute values and class labels O3 O4 Classes X, Y and Z are pre-determined

Introduction When to use it? Discovery (descriptive, explanatory) Prediction (prescriptive, decision support) When the relevant object data can be decided and is available Real World Applications Profiling/predicting customer purchases Loan/credit approval Fraud/intrusion detection Diagnosis decision support

Example Age Income :Churn :Not Churn

Notations Prediction Object Classification Samples Classification Attributes Class Label Attribute Problem Space Age Income :Churn Class Labels :Not Churn

Object Data Required Class Label Attribute: Dependent variable, output attribute, prediction variable,… Variable whose values label objects’ classes Classification Attributes: Independent variables, input attributes, or predictor variables Object variables whose values affect objects’ class labels Three Types: numerical (age, income) categorical (hair color, sex) ordinal (severity of a injury)

Data Two types of attributes: Description attribute: attribute that describes an object, such as age, income level of a customer Class label attribute: attribute that identifies the class an object belongs to.

Data Class label attribute Description attributes

Classification Vs. Prediction View 1 Classification: discovery Prediction: predictive utilizing classification results (rules) View 2 Either discovery or predictive Classification: categorical or ordinal class labels Prediction: numerical (continuous) class labels Class lectures, assignment and exam: View 1 Text: View 2

Classification & Prediction Main Function Mappings from input attribute values to output attribute values Methods affect how the mappings are derived and represented Process Training (supervised): derives the mappings Testing: evaluate accuracy of the mappings

Classification & Prediction Classification samples: divided into training and testing sets Often processed in batched modes Include class labels Prediction objects Often processed in online modes No class labels

Classification Methods Comparative Criteria Accuracy Speed Robustness Scalability Interpretability Data types Classic methods Decision Tree Neural Network Bayesian Network

Model Age >=30 <30 Low Income Low High Low High

Rules derived from a model If age<30 then credit rating Low If age>=30 and Income level=low then credit rating Low If age>=30 and Income level=high then credit rating High

Classification Entropy: is a measurement of the diversity of a data set E = - When only two classes in a data set: E = - (p1 log2p1 + p2 log2 p2) The bigger E is, the more diverse the data set is

Classification Divide and Conquer Pick an attribute to divide the data set with the most entropy reduction Stop until no attribute to pick or data in all leaf nodes are pure (I.e. belong to one class)

Classification Step 1: there are four attributes to pick: student, income, age, and credit rating E(BD) = 0.940 E(D|student) = 0.789 E(D|age) = 0.694 E(D|income) = 0.911 E(D|credit) = 0.892

Classification Step 2: Divide the original data set by age into subset1 (<=30), subset 2 (31:40) and subset3 (>40) Step 3-1: For subset 1, there are three attributes to pick: income, student, and credit E(BD) = 1.17 E(D|student) = 0 E(D|income) = ?? E(D|credit) = ??

Classification Step 3-2: Divide subset 1 by student into subset1-1 (yes) and subset 1-2 (no) Step 4-1: For subset 3, there are three attributes to pick: income, student, and credit E(BD) = 1.17 E(D|credit) = 0 E(D|income) = ?? E(D|student) = ?? Step 4-2: Divide subset 3 by credit into subset3-1 (fair) and subset 3-2 (excellent)

Extract rules from the model Each path from the root to a leaf node forms a IF-THEN rule. In this rule, root and internal nodes are conjuncted to form the IF part. Left node denotes the THEN part of the rule.

Example of Clustering & Classification

Iris Irises are wonderful garden plants that can grow in Deserts Swamps Cold weather Temperate climates

Iris – Vincent Van Gogh’s Iris Paintings

Iris --- Iris Setosa Grow in Alaska, Japan, China and Northern Asia

Iris --- Iris Versicolor Grow in the North of USA

Iris --- Iris Virginica Grow in the Southeast of USA

Classification – Iris data 1 5.10 3.50 1.40 0.20 Iris-setosa Iris class Sepal Length Sepal Width Petal Length Petal Width

Clustering – Bank data female 18.0 2 449 blue 5 Gender Number of siblings Income Product Type Age Education 1-6 gender 10-15 age 24-25 number of siblings 30-36 income 37-43 education 45-45 product type (CD, Saving, Checking, etc.)

Building Enterprise Bussines Intelligence