Modeling and Testing a Knowledge Base for Instructing Users to Choose the Classification Task in Relational Data Minin

Modeling and Testing a Knowledge Base for Instructing Users to Choose the Classification Task in Relational Data Mining Lidia Martins da Silva Centro Universitário Cândido Rondon UNIRONDON Cuiabá, Mato Grosso Ana Estela Antunes da Silva Universidade Metodista de Piracicaba UNIMEP Piracicaba, São Paulo

SUMARY • This presentation will discuss: • Introduction; • Data Mining Domain Knowledge ; • The Classification Task Domain; • Knowledge Base Of The Classification Task; • Execution And Test Of The Knowledge Base; • Conclusion.

INTRODUCTION • Expert systems are computer programs used for executing rules on a base of knowledge making possible to solve specific problems. • Data mining consists of a set of tasks that, through the use of specific algorithms, are able to exploit a large data set, creating from them, knowledge in the form of assumptions and rules.

DATA MINING DOMAIN KNOWLEDGE • The set of activities of the Knowledge Discovery in Databases (KDD) process contains the phases: • data cleaning, • data integration, • data selection, • data transformation, • data mining, • pattern evaluation and, • knowledge presentation.

DATA MINING DOMAIN KNOWLEDGE • This work focus on the data mining phase. • The data mining process is the application of a set of techniques which explore data in order to discover new patterns and relations in data.

DATA MINING DOMAIN KNOWLEDGE • The main tasks utilized to perform data mining are: • association, classification and clustering. • Association is most applied to problems which can be modeled using transactions. • Classification is the task which separates all tuples of a table into classes. • Clustering separates data into groups without the help of a label.

THE CLASSIFICATION TASK DOMAIN • In order to construct a knowledge base to instruct the choice of the classification task in data mining, the domain problem was studied according to the main characteristics which could lead to the choice or the rejection of the classification task depending on the type of problem presented by the user.

KNOWLEDGE BASE OF THE CLASSIFICATION TASK • A knowledge base is a set of representations of actions and events in the world. Each representation is called a sentence. • It is considered the main part of a knowledge based system and contains knowledge under one or more of the techniques mentioned above. In this work production rules are used to represent the knowledge base.

EXECUTION AND TEST OF THE KNOWLEDGE BASE • The knowledge base was created specifically for the classification task with the following levels of adequacy for the application of the classification task: low, medium, high, very high and not_possible. • In order to get values to the attributes in the knowledge base specific questions are asked to users.

EXECUTION AND TEST OF THE KNOWLEDGE BASE Figure 1 presents the production rules that represent the knowledge base of the classification task. Each predicate represents a part of the domain problem. • Then classification=high • R14. If identify=no and single_target=no Then task_not_identified=yes • R15. If training_data=yes and classification = medium • Then classification=high • R16. If training_data=no Then classification=low • R17. If training_data=yes and single_target=yes and identify=yes and • classification=high Then classification=very_high • R18. If different_sets=yes and classification = medium • Then classification=high • R19. If different_sets=no Then classification=low • R20. If different_sets=no and training_data=no and classification=high • Then task_not_identified=yes • R21. If applied_results=yes and classification= medium • Then classification=high • R22. If applied_results=no Then classification=low • R23. If decision_results=yes and applied_attribute=yes and • classification=high Then classification=very_high • R24. If decision_results=no and classification= low • Then Classification=low • R25. If task_not_identified= yes Then classification = not_possible R01. If various_attributes=yes Then classification=medium R02. If various_attributes=no Then classification=low R03. If all_numerical=yes and classification=medium Then classification=low R04. If all_numerical=no and various_attributes = yes Then classification=medium R05. If large_data_volume=yes and classification = medium Then classification=high R06. If large_data_volume=no Then classification=low R07. If categorical_attribute=yes and classification = medium Then classification=high R08. If categorical_attribute=no and (classification=medium or classification=high) Then task_not_identified=yes R09. If transaction=yes Then classification=low R10. If transaction=no and classification = medium Then classification=high R11. If single_target=yes and classification=medium Then classification=high R12. If single_target=no and classification=low Then classification=low R13. If identify=yes and classification = medium and single_target = yes Figure 1. Knowledge Base Representing the degree of use adequacy of the classification task.

EXECUTION AND TEST OF THE KNOWLEDGE BASE • The answers to these questions represent the values of eleven from the thirteen predicates existing in the knowledge base. Do you wish to identify several attributes for the mining process? Are chosen attributes all numeric? Is there a large volume of data in your database? Is the classification attribute a category? Are chosen attributes part of a transactional database? For your mining process do you need to choose only one attribute to characterize the whole process? Can you identify such attribute among the chosen attributes or create it? Is there a data set to be used for a training process? Is there a different data set to be used for a testing process? Do you wish to use mining results in order to apply them to other available data? Do you wish to use mining results in order to make an immediate decision about your organization? Figure 2. Questions asked to users

EXECUTION AND TEST OF THE KNOWLEDGE BASE • After answering the questions the knowledge base can be executed. One example of execution is presented in Figure 3. Attribute Values: 1 - various_attributes : yes; 2 - all_numerical: no; 3 - large_data_volum: no; 4 - categorical_attribute: yes; 5- transaction : no; 6 - single_target: yes; 7 - identify : yes; 8 - training_data: yes; 9 - different_sets: yes; 10 - applied_results: yes; 11 - decision_result: yes. Successful Rules: R01, R04, R05, R07 and R17 Figure 3. Example of forward chaining reasoning. • In the example in Fig. 3, the result of the execution of the knowledge base is a very_high level of adequacy for the application of the classification task in the problem domain presented by the user through the answers of the questions presented in Fig.2.

CONCLUSION • This work presents the modeling and testing of a knowledge base for instructing users to choose the task of classification through the use of questions that lead them to finding out if the classification task is suitable to be used in their domain problem.

CONCLUSION • Among the contributions of this work the following are pointed out: • creation of questions to instruct users about the choice of the classification task when applying mining techniques to their problems; • acquisition of knowledge about classification task; • knowledge modeling to solve the problem of the adequacy of application of the classification task in data mining for general domains of problems; • initial tests of the knowledge base.

CONCLUSION • As future work authors propose: • The performance of more tests to validate the knowledge base; • The creation of pertinence functions for the predicates: low, medium, high and very high, turning them fuzzy sets; • The expansion of the knowledge domain by including the association task; • The insertion of the knowledge base in the KIRA tool. The tool is an instructional tool for the data mining process including the phases: problem description; data cleaning; data selection; application of mining tasks and data analysis.

REFERENCES • AMO, S. Course of Data Mining. Masters Program in Computer Science, Federal University of Uberlandia, 2003. Available at: <http://www.deamo.prof.ufu.br/CursoDM.html>. Accessed: 01 Out. 2009. • BERRY, M. J. A.; LINOFF, G.. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. 2 ed. Wiley Publishing. USA, 2004. • BINDILATTI, A. Modeling a Knowledge Base for Data Mining Process of the Kira Tool . Under-graduation Research Project. Unimep: Methodist University of Piracicaba. Piracicaba, São Paulo, 2009. • LIU B.; HSU W.; MA Y. Integrating Classification and Association Rule Mining. Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98). Nova York, USA. pp. 80-86. 1998. • LUGER, G. F. Artificial Intelligence Structures and Strategies for Complex Problem Solving. Fifth Edition. England. Addison-Wesley. 2005.

Modeling and Testing a Knowledge Base for Instructing Users to Choose the Classification Task in Relational Data Minin