210 likes | 334 Vues
This research presents a novel approach to hierarchical classification using class proximity principles. We develop classifiers at each level of a class hierarchy, guiding the classification process through a path of specialized classifiers from the root. Emphasizing local feature determination, our methodology addresses issues such as misclassification biases across sibling and nephew classes while balancing specialization and generalization. By implementing generalized association rules to determine class rankings, we enhance document classification effectiveness within complex class structures, validated through extensive experiments.
E N D
Building Hierarchical Classifiers Using Class Proximity Ke Wang Senqiang Zhou Shiang Chen Liew National University of Singapore
Hierarchical classification • Given • a class hierarchy • a collection of pre-classified documents • a document is a set of terms • Build • a classifier that assigns a relevant class to a new document • Key • extract features of classes
Yahoo classes Yahoo recreation science sports automotive cycling skating
ACM classes Hardware Level 1 General Memory_structure Level 2 Level 3 Design_style General Cache_memories Level 4
Existing local approaches • build one classifier at each split of the class hierarchy • determine features locally at each node • classify a document by going through a path of classifiers starting from the root
Diminishing of high level structure • rely on classification at high levels • but high level structures usually weak, i.e., divergence of topics • e.g., “car” is a feature at Recreation: Automotive, but not at Recreation
Bias of misclassification • sibling classes Vs. nephew classes • misclassification at high levels Vs. at low levels • specialisation Vs. generalisation
Features should be • determined wrt the target class • determined at all concept levels • correlated The solution: generalised association rules (SA95, HF95) {sql, IO} DB {language, performance} CS
Our approach • class proximity • global classifier • term hierarchy • use the “best” generalised association rule T C to determine the class
Rank association rules • Biased confidence • Biased J-measure
An example Arts Music Literature ... ... A_Music A_Literature author story Class hierarchy Term hierarchy editor writer poem fiction
Term hierarchy(T)=YesClass proximity(B)=Yes • R0: author,storyLiterature (ConfB=1,Clist=d6,d7) • R1: authorLiterature (ConfB=1) • R2: storyLiterature (ConfB=0.67, Wlist=d5(1)) • R4: hallMusic (ConfB=0.4, Clist=d1,d2, Wlist=d3(1)) • R3: StatesA_Literature (ConfB=0.33, Clist=d4,d5)
Experiment I • http://www.acm.org/dl/toc.html/ • 26,515 papers, 78 classes, 14,754 terms • class hierarchy=Level-1 and level-2 categories • term hierarchy=Level-3 and level-4 categories • document=Title and level-4 categories
Best rules found by (B,T) • CSO: • vector,stream,processor,parallelProcessor_Architectures • multiple_instruction_streamProcessor_Architectures • data_flow,architecturProcessor_Architectures • internet, architecturComputer_Communication_Networks • mode,atmComputer_Communication_Networks • network,circuit_switching Computer_Communication_Networks • tecniqu, model, attributPerformance_of_Systems • Software: • program,function, applicationProgramming_Techniques • object_oriented_programmingProgramming_Techniques • reusable_softwareSoftware_Engineering • software,methodologieSoftware_Engineering • organization, distributed_systemOperating_Systems
() --- | (T) --- (B) --- (B,T) --- (CDAR97,T) --- (CDAR97) ---
Experiment II • http://dir.yahoo.com/recreation/sports • 7,550 documents • 367 classes, 7 levels • 10,747 terms • 90% of the terms occur in no more than 10 documents and many documents contain only such terms
Best rules found by (B,T) • Sports:Cycling: • page,mountain Mountain_Biking • product,bikeMountain_Biking • mtb,mountain Mountain_Biking • held,bicyclRaces • classic,bicyclRaces • trip,tourTravelogues • trip,canada Travelogues • bicycl,alaskaTravelogues • Sports:Auto_Racing: • team,result,driverFormula_one • model,featurTracks_and_Speedways • ovalTracks_and_Speedways • racewayTracks_and_Speedways