120 likes | 236 Vues
This paper presents advancements in Multi-Relational Data Mining (MRDM) with a focus on optimizing the Multi-Relational Decision Tree Learning (MRDTL) algorithm. As relational database systems become more prevalent, the challenge of efficiently learning from multi-relational databases grows. We detail a framework that accelerates MRDM by minimizing redundant calculations and refining selection graphs within relational databases. Our experimental results demonstrate that the optimized MRDTL algorithm offers competitive performance in terms of accuracy and speed. Future directions include addressing missing values and enhancing evaluation techniques.
E N D
Speeding Up Multi-Relational Data Mining Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html * Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc.
Motivation Importance of relational learning: • Growth of data stored in MRDB • Techniques for learning unstructured data often extract the data into MRDB One of the promising approaches to relational learning: • MRDM (Multi-Relational Data Mining) framework developed by Knobbe et. al. (1999) • MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva et. al. (2002) Goal • Speed up MRDM framework and in particular MRDTL algorithm
Problem Formulation Given: Data stored in relational database Goal: Learn a predictive model for the instances in the target table Example of multi-relational database schema instances
Grad.Student GPA >3.9 MRDM overview. Selection graphs Grad.Student Department • Nodes correspond to the tables from the database • Edges correspond to the associations between tables • It corresponds to the subset of the instances from the target table having some property • It is a way of specifying attributes in the relational setting Staff Specialization=math Staff
MRDM overview. Transforming selection graphs into SQL queries Select distinctT0.id FromStaff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Generic query: select distinctT0.primary_key fromtable_list wherejoin_list andcondition_list Staff Grad. Student SelectdistinctT0.id FromStaff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Select distinct T0. id From Staff T0, Graduate_Student T1 WhereT0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Staff Grad. Student GPA >3.9
Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Grad.Student GPA>2.0 MRDM overview. Refinements of selection graphs refinement GPA >2.0 Specialization=math Specialization=math complement refinement Specialization=math
Grad.Student Department Staff Grad.Student GPA >3.9 The most time consuming operations of MRDTL Query associated with the selection graph: Specialization=math select distinct Staff.Salary, count(distinct Staff.ID) fromStaff, Grad.Student, Department wherejoin_list andcondition_list group by Staff.Salary
Grad.Student Department Staff Grad.Student GPA >3.9 A way to speed up - eliminate redundant calculations Problem:For selection graph with 160 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation:Tables Staff and Grad.Student will be joined for all the children refinements A way to fix:make the join only once and save necessary information for all further calculations Specialization=math
Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math
Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math Query associated with the selection graph: selectS.Salary, count(distinct S.Staff_ID) fromS group by S.Salary
Summary • A general approach for speeding up MRDM framework • MRDTL algorithm is a competitive algorithm for learning from RDB in terms of both accuracy and time Future work • techniques for handling missing values • pruning techniques or complexity regularizations • use of the aggregates for the attribute values • more extensive evaluation of MRDTL on real-world data sets