360 likes | 597 Vues
An Excel-based Data Mining Tool iDA. ESX: A Multipurpose Tool for Data Mining. The Algorithmic Logic Behind ESX. Given A set of existing concept-level nodes C1, ..., Cn An average class resemblance score S A new instance I to be classified
E N D
The Algorithmic Logic Behind ESX • Given • A set of existing concept-level nodes C1, ..., Cn • An average class resemblance score S • A new instance I to be classified • Classify I with the concept class that will improve S the most, or hurt S the least. • If learning is unsupervised, create a new concept node with I alone if it results in a better S score.
iDAV Format for Data Mining • iDA attribute/value format • First row: attribute names • Second row: attribute type identifier • C: categorical, R: real (real stands for any numeric field) • Third row: attribute usage identifier • I: input, O: output, U: unused D: display only • Forth row +: test set data
A Five-step Approach for Unsupervised Clustering • Step 1: Enter the Data to be Mined • Step 2: Perform a Data Mining Session • Step 3: Read and Interpret Summary Results • Step 4: Read and Interpret Individual Class Results • Step 5: Visualize Individual Class Rules
Step 2: Perform A Data Mining Session • iDA -> begin mining session • Select instance similarity and real-valued tolerance setting
Step 3: Read and Interpret Summary Results • Class Resemblance Scores • Similarity of instances in the class • Domain Resemblance Score • Similarity of instances in the entire set • Cluster Quality • Class resemblance with reference to domain resemblance (clusters should be at least as good as the domain)
Step 3: Results about Attributes • Categorical • Domain Predictability • Given categorical attribute A with possible values v1,..,vn, domain predictability gives the number of instances that has A equal to vi (if domain predictability score is close to 100%, most of the instances have the same value, and the attribute is not very valuable for learning purposes) • Numeric • Attribute significance • Given attribute A, find the range of class means, and divide by the domain standard deviation (higher values are better for differentiation purposes)
Step 4: Read and Interpret Individual Class Results • Class Predictability is a within-class measure. • Given class C and categorical attribute A with possible values v1,..,vn, class predictability gives the percent of instances that has A equal to vi in C • Class Predictiveness is a between-class measure. • Given class C and categorical attribute A with possible values v1,..,vn, class predictiveness for vi is the probability that an instance belongs to C given it has value vi for A.
Necessary and Sufficient Conditions • A predictiveness score of 1.0 tells us that all instances with the particular attribute value belong to this particular class. => Attribute = v is a sufficient condition for membership in this class. • A predictability score of 1.0 tells us that all the instances in this class have Attribute = v. => Attribute = v is a necessary condition for membership in this class.
Necessary and/or Sufficient Conditions • If both predictability and predictiveness scores are 1.0, the particular value for the attribute is necessary and sufficient for class membership. • ESX outputs necessary and sufficient attribute values that meet a particular cut-off (0.80) as highly necessary and highly sufficient.
RuleMaker Settings • Recall that we used the setting to ask RuleMaker to generate all rules. This is a good way to learn about the nature of the problem at hand.
A Six-Step Approach for Supervised Learning • Step 1: Choose an Output Attribute • Step 2: Perform the Mining Session • Step 3: Read and Interpret Summary Results • Step 4: Read and Interpret Test Set Results • Step 5: Read and Interpret Class Results • Step 6: Visualize and Interpret Class Rules
Perform the Mining Session • Decide on the size of the training set. • The remaining items will be used by the software to test the model that is developed (and evaluation results will be reported).
Read and Interpret Summary Results • The worksheet +RES SUM contains summary information. • Class resemblance scores, attribute summary information (categorical and numerical) and most commonly occurring attributes for each class are given.
Read and Interpret Test Set Results • Worksheets + RES TST, + RES MTX • Reports performance on the test set (which was not part of model training) • RES MTX reports confusion matrix • RES TST reports for each instance in the test set the model’s classification, and whether it is accurate or not.
Read and Interpret Class Results • As individual clusters are of interest in unsupervised learning, the information about individual classes is relevant in supervised learning. • Worksheet + RES CLS contains the information. • Most and least typical instances are also given here. • The worksheet + RUL TYP gives typicality scores for all of the instances in the test set.
Visualize and Interpret Class Rules • All rules or covering set of rules? • Worksheet RES Rul contains the rules generated by RuleMaker • If all rules are generated, there might be overlapping coverage. • The covering set algorithm works iteratively, by identifying the best covering rule and updating the instance set to be covered. • It is possible to run RuleMaker without running the mining algorithm again. This menu item can be used to change the RuleMaker settings to generate alternative rule sets.
Generating Rules: The General Idea • Choose an attribute that differentiates all domain/subclass instances best. • Use the attribute to subdivide instances into classes. • For each subclass: • If the instances meet a predefined criteria, generate a defining rule for the subclass. • If the predefined criteria is not met, go to Step 1.
Techniques for Generating Rules • Define the scope of the rules. • Choose the instances. • Set the minimum rule correctness. • Define the minimum rule coverage. • Choose an attribute significance value.
Instance Typicality • Typicality Scores • Identify prototypical and outlier instances. • Select a best set of training instances. • Used to compute individual instance classification confidence scores.
Special Considerations and Features • Avoid Mining Delays • The Quick Mine Feature • Erroneous and Missing Data