1 / 35

An Excel-based Data Mining Tool iDA

An Excel-based Data Mining Tool iDA. ESX: A Multipurpose Tool for Data Mining. The Algorithmic Logic Behind ESX. Given A set of existing concept-level nodes C1, ..., Cn An average class resemblance score S A new instance I to be classified

javier
Télécharger la présentation

An Excel-based Data Mining Tool iDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Excel-based Data Mining Tool iDA

  2. ESX: A Multipurpose Tool for Data Mining

  3. The Algorithmic Logic Behind ESX • Given • A set of existing concept-level nodes C1, ..., Cn • An average class resemblance score S • A new instance I to be classified • Classify I with the concept class that will improve S the most, or hurt S the least. • If learning is unsupervised, create a new concept node with I alone if it results in a better S score.

  4. iDAV Format for Data Mining • iDA attribute/value format • First row: attribute names • Second row: attribute type identifier • C: categorical, R: real (real stands for any numeric field) • Third row: attribute usage identifier • I: input, O: output, U: unused D: display only • Forth row +: test set data

  5. A Five-step Approach for Unsupervised Clustering • Step 1: Enter the Data to be Mined • Step 2: Perform a Data Mining Session • Step 3: Read and Interpret Summary Results • Step 4: Read and Interpret Individual Class Results • Step 5: Visualize Individual Class Rules

  6. Step 1: Enter The Data To Be Mined

  7. Step 2: Perform A Data Mining Session • iDA -> begin mining session • Select instance similarity and real-valued tolerance setting

  8. RuleMaker Settings

  9. Step 3: Read and Interpret Summary Results • Class Resemblance Scores • Similarity of instances in the class • Domain Resemblance Score • Similarity of instances in the entire set • Cluster Quality • Class resemblance with reference to domain resemblance (clusters should be at least as good as the domain)

  10. Step 3: Results about Attributes • Categorical • Domain Predictability • Given categorical attribute A with possible values v1,..,vn, domain predictability gives the number of instances that has A equal to vi (if domain predictability score is close to 100%, most of the instances have the same value, and the attribute is not very valuable for learning purposes) • Numeric • Attribute significance • Given attribute A, find the range of class means, and divide by the domain standard deviation (higher values are better for differentiation purposes)

  11. Step 4: Read and Interpret Individual Class Results • Class Predictability is a within-class measure. • Given class C and categorical attribute A with possible values v1,..,vn, class predictability gives the percent of instances that has A equal to vi in C • Class Predictiveness is a between-class measure. • Given class C and categorical attribute A with possible values v1,..,vn, class predictiveness for vi is the probability that an instance belongs to C given it has value vi for A.

  12. Necessary and Sufficient Conditions • A predictiveness score of 1.0 tells us that all instances with the particular attribute value belong to this particular class. => Attribute = v is a sufficient condition for membership in this class. • A predictability score of 1.0 tells us that all the instances in this class have Attribute = v. => Attribute = v is a necessary condition for membership in this class.

  13. Necessary and/or Sufficient Conditions • If both predictability and predictiveness scores are 1.0, the particular value for the attribute is necessary and sufficient for class membership. • ESX outputs necessary and sufficient attribute values that meet a particular cut-off (0.80) as highly necessary and highly sufficient.

  14. Step 5: Visualize Individual Class Rules

  15. RuleMaker Settings • Recall that we used the setting to ask RuleMaker to generate all rules. This is a good way to learn about the nature of the problem at hand.

  16. A Six-Step Approach for Supervised Learning • Step 1: Choose an Output Attribute • Step 2: Perform the Mining Session • Step 3: Read and Interpret Summary Results • Step 4: Read and Interpret Test Set Results • Step 5: Read and Interpret Class Results • Step 6: Visualize and Interpret Class Rules

  17. Perform the Mining Session • Decide on the size of the training set. • The remaining items will be used by the software to test the model that is developed (and evaluation results will be reported).

  18. Read and Interpret Summary Results • The worksheet +RES SUM contains summary information. • Class resemblance scores, attribute summary information (categorical and numerical) and most commonly occurring attributes for each class are given.

  19. Read and Interpret Test Set Results

  20. Read and Interpret Test Set Results • Worksheets + RES TST, + RES MTX • Reports performance on the test set (which was not part of model training) • RES MTX reports confusion matrix • RES TST reports for each instance in the test set the model’s classification, and whether it is accurate or not.

  21. Read and Interpret Class Results • As individual clusters are of interest in unsupervised learning, the information about individual classes is relevant in supervised learning. • Worksheet + RES CLS contains the information. • Most and least typical instances are also given here. • The worksheet + RUL TYP gives typicality scores for all of the instances in the test set.

  22. Visualize and Interpret Class Rules • All rules or covering set of rules? • Worksheet RES Rul contains the rules generated by RuleMaker • If all rules are generated, there might be overlapping coverage. • The covering set algorithm works iteratively, by identifying the best covering rule and updating the instance set to be covered. • It is possible to run RuleMaker without running the mining algorithm again. This menu item can be used to change the RuleMaker settings to generate alternative rule sets.

  23. Generating Rules: The General Idea • Choose an attribute that differentiates all domain/subclass instances best. • Use the attribute to subdivide instances into classes. • For each subclass: • If the instances meet a predefined criteria, generate a defining rule for the subclass. • If the predefined criteria is not met, go to Step 1.

  24. Techniques for Generating Rules • Define the scope of the rules. • Choose the instances. • Set the minimum rule correctness. • Define the minimum rule coverage. • Choose an attribute significance value.

  25. Instance Typicality • Typicality Scores • Identify prototypical and outlier instances. • Select a best set of training instances. • Used to compute individual instance classification confidence scores.

  26. Special Considerations and Features • Avoid Mining Delays • The Quick Mine Feature • Erroneous and Missing Data

More Related