270 likes | 392 Vues
This study presents a novel approach for one-to-many data linkage through a One-Class Clustering Tree (OCCT). We define key concepts and framework for the methodology, emphasizing splitting attributes using Coarse-Grained and Fine-Grained Jaccard measures. By analyzing request location, day of the week, and part of the day, we demonstrate how to optimally split data sets to improve linkage accuracy. Our goal is to enhance data integration processes in various engineering applications, offering robust solutions for complex data landscapes.
E N D
Ben-Gurion University of The Negev Faculty of Engineering Sciences Department of Information Systems Engineering OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage Ma'ayanGafny, AsafShabtai, LiorRokach, Yuval Elovici
Definitions TA: TB: r(a) r(b) A = {a1,a2,a3,…,an} |A| = n |TA| = num of records in TA r(a) = a record from TA B={b1,b2,b3,…,bm} |B|=m |TB| = num of records in TB r(b) = a record from TB
Definitions TA: TB: r=(r(a) , r(b)) TA x TB :
Definitions TA x TB : TAB TAB
Definitions TA x TB : TAB TAB
Definitions Ad⊆A– the subset of attributes of TA that were already selected as splitting attributes in the path from the root of the tree to node d. Ad4 = {a1,a2} Ad2 = {a1}
Coarse Grained Jaccard – Splitting the root of the tree Three candidates for split: • Request location • Request day of week • Request part of day
CGJ– Splitting the root of the tree * W1 = 16/31 Score1=1/23 + reqLocation = Bonn reqLocation = Berlin reqLocation = Hamburg d d d Score(SplitreqLocation) = 0.0561 • reqLocation !=Hamburg • reqLocation != Berlin • reqLocation != Bonn Score2=2/23 W2 = 9/31 * + Score3=1/23 W3 = 6/31 *
CGJ– Splitting the root of the tree * W1 = 7/31 Score1=3/15 + * Score2=5/15 W2 = 5/31 dayOfWeek= Wednesday dayOfWeek = Friday dayOfWeek = Thursday dayOfWeek = Friday dayOfWeek= Monday d d d d d + • dayOfWeek!= Wednesday • dayOfWeek!= Thursday • dayOfWeek!= Friday • dayOfWeek!= Friday • dayOfWeek!= Monday Score(SplitdayOfWeek) = 0.260 * Score3=3/15 W3 = 3/31 + * Score4=5/15 W4 = 9/31 + * Score5=3/15 W5= 7/31
CGJ– Splitting the root of the tree Score1=4/23 partOfDay= Morning d partOfDay= Afternoon Score(SplitpartOfDay) = 0.173
Coarse Grained Jaccard – Splitting the root of the tree Three candidates for split: • Request location 0.0561 • Request day of week 0.260 • Request part of day 0.173 The split in the root
Fine Grained Jaccard – Splitting the root of the tree Req. Location = Berlin d Req. Location != Berlin
LPI – Splitting the root of the tree Req. Location = Berlin d Req. Location != Berlin
Req. Location = Berlin Req. Location != Berlin
LPI – Splitting the root of the tree Req. Location = Berlin d Req. Location != Berlin
MLE – Splitting the root of the tree Cust. City Cust. Type Cust. City Cust. Type Cust. City Cust. Type p(Cust. City|Cust. Type) p(Cust. Type|Cust. City)