230 likes | 335 Vues
ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database. Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (30 th September, 2007). Overview. Introduction Objectives Experimental Design
E N D
ADBIS 2007Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (30th September, 2007)
Overview • Introduction • Objectives • Experimental Design • Data Pre-processing: Discretization • Data Summarization (DARA) • Experimental Evaluation • Experimental Results • Conclusions ADBIS 2007, Varna, Bulgaria
Introduction • Handling numerical data stored in a relational database is unique • due to the multiple occurrences of an individual record in the non-target table and • non-determinate relations between tables. • Most traditional data mining methods deal with a single table and discretization process is based on a single table. • In a relational database, multiple records from one table with numerical attributes are associated with a single structured individual stored in the target table. • Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database ADBIS 2007, Varna, Bulgaria
Introduction • This paper considers different alternatives for dealing with continuous attributes in MRDM • The discretization procedures considered in this paper include algorithms • that do not depend on the multi-relational structure and also • that are sensitive to this structure. • A few discretization methods implemented, including the proposed entropy-instance-based discretization, embedded in DARA algorithm ADBIS 2007, Varna, Bulgaria
Objectives • To study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. • Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm • In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm • We demonstrate on the empirical results obtained that discretization can be improved by taking into consideration the multiple-instance problem ADBIS 2007, Varna, Bulgaria
Experimental Design • Data Pre-processing • Discretization of Continuous Attributes in Multi-relational setting using Entropy-Instance-Based Algorithm • Data Aggregation • Data summarization using DARA as a mean of data summarization based on Cluster dispersion and Impurity • Evaluation of the discretization methods using C4.5 classifiers Categorical Data Relational Data Summarized Data Discretization of Continuous Attributes Using Entropy-Instance-Based Algorithm Data Summarization using DARA based on Cluster Dispersion and Impurity Learning can be done using any traditional AV data mining methods ADBIS 2007, Varna, Bulgaria
Data Pre-processing: Discretization • To study the effects of one-to-many association issue in the process of discretizing continuous numbers. • Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm • In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm • Equal Height – each bin has same number of samples • Equal Weight - considers the distribution of numeric values present and the groups they appear in • Entropy-Based – uses the class information entropy • Entropy-Instance-based - uses the class information entropy and individual information entropy • We demonstrate that discretization can be improved by considering the one-to-many problem ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization • Background • Based on the entropy-based multi-interval discretization method (Fayyad and Irani 1993) • Given a set of instances S, two samples of S, S1 and S2, a feature A, and a partition boundary T, the class information entropy is • So, for k bins, the class information entropy for multi-interval entropy-based discretization is ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization • In EIB, besides the class information entropy, another measure that uses individual information entropy is added to select multi-interval boundaries for discretization • Given n individuals, the individual information entropy of a subset S is IndEnt(S) = where p(Ii, S) is the probability of the i-th individual in the subset S • The total individual information entropy for all partitions is ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization • As a result, by minimizing the function Ind_I(A,T,S,k), that consists of two sub-functions, I(A,T,S,k) and Ind(A,T,S,k), we are discretizing the attribute’s values based on the class and individual information entropy. + Ind_I(A,T,S,k) = = ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization • One of the main problems with this discretization criterion is that it is relatively expensive • Use a GA-based discretization to obtain a multi-interval discretization for continuous attributes, consists of • an initialization step • the iterative generations of the • reproduction phase, • the crossover phase and • mutation phase ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization • An initialization step • a set of strings (chromosomes), where each string consists of b-1 continuous values representing the b partitions, is randomly generated within the attribute’s values of min and max • For instance, given minimum and maximum values of 1.5 and 20.5 for a continuous field, we have (2.5,5.5,9.3,12.6,15.5,20.5) • The fitness function for genetic entropy-instance-based discretization is defined as f = 1/ Ind_I(A,T,S,k) ADBIS 2007, Varna, Bulgaria
Entropy-Instance-Based (EIB) Discretization • the iterative generations of • the reproduction phase • Roulette wheel selection is used • the crossover phase and • a crossover probability pc of 0.50 is used • mutation phase • a fixed probability pm of 0.10 is used ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA) • Data summarization based on Information Retrieval (IR) Theory • Dynamic Aggregation of Relational Attributes (DARA) – categorizes objects with similar patterns based on tf-idf weights, borrowed from IR theory • Scalable and produce interpretable rules T= Target table NT = Non-target table = Data Summarization NT NT NT NT NT T NT NT NT ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA) • Data summarization based on Information Retrieval (IR) Theory • TF-IDF (term frequency-inverse document frequency) - a weight often used in information retrieval and text mining • A statistical measure used to evaluate how important a word is to a document in a corpus • The importance of term increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA) • In a multi-relational setting, • an object (a single record) is considered as a document • All corresponding values of attributes stored in multiple tables are considered as terms that describe the characteristics of the object (the record) • DARA transforms data representation in a relational model into a vector space model and employs TF-IDF weighting scheme to cluster and summarize them ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA) • tfi∙idfi (term frequency-inverse document frequency) where ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms. • The inverse document frequency is a measure of the general importance of the term with |D|: total number of documents in the corpus and d is the number of documents where the term tiappears ADBIS 2007, Varna, Bulgaria
Data Summarization (DARA) Data Summarization Stages • Information Propagation Stage • Propagates the record ID and classes from the target concepts to the non-target tables • Data Aggregation Stage • Summarize each record to become a single tuple • Uses a clustering technique based on the TF-IDF weight, in which each record can be represented as • The cosine similarity method is used to compute the similarity between two records Ri and Rj , cos(Ri,Rj) = Ri·Rj/(||Ri||·|||Rj||) (tf1 log(n/df1), tf2 log(n/df2), . . . , tfmlog(n/dfm)) ADBIS 2007, Varna, Bulgaria
Experimental Evaluation • Implement the discretization methods in the DARA algorithm, in conjunction with the C4.5 classifier, as an induction algorithm that is run on the DARA’s discretized and transformed data representation • chose three varieties of a well-known datasets, the Mutagenesis relational database • The data describes 188 molecules falling in two classes, mutagenic (active) and non-mutagenic (inactive) and 125 of these molecules are mutagenic. ADBIS 2007, Varna, Bulgaria
Experimental Evaluation • three different sets of background knowledge (referred to as experiment B1, B2 and B3). • B1: The atoms in the molecule are given, as well as the bonds between them, the type of each bond, the element and type of each atom. • B2: Besides B1, the charge of atoms are added • B3: Besides B2, the log of the compound octanol/water partition coefficient (logP), and energy of the compounds lowest unoccupied molecular orbital (ЄLUMO) are added • Perform a leave-one-out cross validation using C4.5 for different number of bins, b, tested for B1, B2 and B3. ADBIS 2007, Varna, Bulgaria
Experimental Results • Performance (%) of leave-one-out cross validation of C4.5 on Mutagenesis dataset • The predictive accuracy for EqualHeight and EqualWeight is lower on datasets B1 and B2, when the number of bins is smaller • the accuracy of entropy and entropy-instance based discretization is lower when the number of bins is smaller on dataset B3 • The result of entropy-based and entropy-instance-based discretization on B1, B2 and B3 are virtually identical, (five out of nine tests EIB performs better than EB) ADBIS 2007, Varna, Bulgaria
Conclusions • presented a method called dynamic aggregation of relational attributes (DARA) with entropy-instance-based discretization to propositionalise a multi-relational database • The DARA method has shown a good performance on three well-known datasets in term of performance accuracy. • The entropy-instance-based and entropy-based discretization methods are recommended for discretization of attribute values in multi-relational datasets • Disadvantage – computation is expensive when the number of bins is large ADBIS 2007, Varna, Bulgaria
Thank YouDiscretization Numbers for Multiple-Instances Problem in Relational Database