An Excel-based Data Mining Tool iDA

An Excel-based Data Mining Tool iDA

ESX: A Multipurpose Tool for Data Mining

The Algorithmic Logic Behind ESX • Given • A set of existing concept-level nodes C1, ..., Cn • An average class resemblance score S • A new instance I to be classified • Classify I with the concept class that will improve S the most, or hurt S the least. • If learning is unsupervised, create a new concept node with I alone if it results in a better S score.

iDAV Format for Data Mining • iDA attribute/value format • First row: attribute names • Second row: attribute type identifier • C: categorical, R: real (real stands for any numeric field) • Third row: attribute usage identifier • I: input, O: output, U: unused D: display only • Forth row +: test set data

A Five-step Approach for Unsupervised Clustering • Step 1: Enter the Data to be Mined • Step 2: Perform a Data Mining Session • Step 3: Read and Interpret Summary Results • Step 4: Read and Interpret Individual Class Results • Step 5: Visualize Individual Class Rules

Step 1: Enter The Data To Be Mined

Step 2: Perform A Data Mining Session • iDA -> begin mining session • Select instance similarity and real-valued tolerance setting

RuleMaker Settings

Step 3: Read and Interpret Summary Results • Class Resemblance Scores • Similarity of instances in the class • Domain Resemblance Score • Similarity of instances in the entire set • Cluster Quality • Class resemblance with reference to domain resemblance (clusters should be at least as good as the domain)

Step 3: Results about Attributes • Categorical • Domain Predictability • Given categorical attribute A with possible values v1,..,vn, domain predictability gives the number of instances that has A equal to vi (if domain predictability score is close to 100%, most of the instances have the same value, and the attribute is not very valuable for learning purposes) • Numeric • Attribute significance • Given attribute A, find the range of class means, and divide by the domain standard deviation (higher values are better for differentiation purposes)

Step 4: Read and Interpret Individual Class Results • Class Predictability is a within-class measure. • Given class C and categorical attribute A with possible values v1,..,vn, class predictability gives the percent of instances that has A equal to vi in C • Class Predictiveness is a between-class measure. • Given class C and categorical attribute A with possible values v1,..,vn, class predictiveness for vi is the probability that an instance belongs to C given it has value vi for A.

Necessary and Sufficient Conditions • A predictiveness score of 1.0 tells us that all instances with the particular attribute value belong to this particular class. => Attribute = v is a sufficient condition for membership in this class. • A predictability score of 1.0 tells us that all the instances in this class have Attribute = v. => Attribute = v is a necessary condition for membership in this class.

Necessary and/or Sufficient Conditions • If both predictability and predictiveness scores are 1.0, the particular value for the attribute is necessary and sufficient for class membership. • ESX outputs necessary and sufficient attribute values that meet a particular cut-off (0.80) as highly necessary and highly sufficient.

Step 5: Visualize Individual Class Rules

RuleMaker Settings • Recall that we used the setting to ask RuleMaker to generate all rules. This is a good way to learn about the nature of the problem at hand.

A Six-Step Approach for Supervised Learning • Step 1: Choose an Output Attribute • Step 2: Perform the Mining Session • Step 3: Read and Interpret Summary Results • Step 4: Read and Interpret Test Set Results • Step 5: Read and Interpret Class Results • Step 6: Visualize and Interpret Class Rules

Perform the Mining Session • Decide on the size of the training set. • The remaining items will be used by the software to test the model that is developed (and evaluation results will be reported).

Read and Interpret Summary Results • The worksheet +RES SUM contains summary information. • Class resemblance scores, attribute summary information (categorical and numerical) and most commonly occurring attributes for each class are given.

Read and Interpret Test Set Results

Read and Interpret Test Set Results • Worksheets + RES TST, + RES MTX • Reports performance on the test set (which was not part of model training) • RES MTX reports confusion matrix • RES TST reports for each instance in the test set the model’s classification, and whether it is accurate or not.

Read and Interpret Class Results • As individual clusters are of interest in unsupervised learning, the information about individual classes is relevant in supervised learning. • Worksheet + RES CLS contains the information. • Most and least typical instances are also given here. • The worksheet + RUL TYP gives typicality scores for all of the instances in the test set.

Visualize and Interpret Class Rules • All rules or covering set of rules? • Worksheet RES Rul contains the rules generated by RuleMaker • If all rules are generated, there might be overlapping coverage. • The covering set algorithm works iteratively, by identifying the best covering rule and updating the instance set to be covered. • It is possible to run RuleMaker without running the mining algorithm again. This menu item can be used to change the RuleMaker settings to generate alternative rule sets.

Generating Rules: The General Idea • Choose an attribute that differentiates all domain/subclass instances best. • Use the attribute to subdivide instances into classes. • For each subclass: • If the instances meet a predefined criteria, generate a defining rule for the subclass. • If the predefined criteria is not met, go to Step 1.

Techniques for Generating Rules • Define the scope of the rules. • Choose the instances. • Set the minimum rule correctness. • Define the minimum rule coverage. • Choose an attribute significance value.

Instance Typicality • Typicality Scores • Identify prototypical and outlier instances. • Select a best set of training instances. • Used to compute individual instance classification confidence scores.

Special Considerations and Features • Avoid Mining Delays • The Quick Mine Feature • Erroneous and Missing Data

An Excel-based Data Mining Tool iDA

An Excel-based Data Mining Tool iDA

Presentation Transcript

Chapter 4 An Excel-based Data Mining Tool (iData Analyzer)

Data Mining: An Introductory Overview

An Introduction to Data Mining

Graph-Based Data Mining

Applied Data Mining Using Microsoft Excel

Web-Based Data Archive, Monitoring, and Mining Tool

Data Mining as Pre-EDD Investigatory Tool

DATA MINING: AN INTRODUCTION

An Excel-based Data Mining Tool

Data Mining as a BI Tool

MS Excel Tool

Data Mining: An Introduction

An Introduction to Data Mining

Powerful web-based data mining tool Integrates NOAA software/databases

Data Mining: An Introduction

Data Mining: An introduction

Excel Unlocker Tool

Excel Unlocker Tool

Data Mining: An Introductory Overview