210 likes | 323 Vues
Building Hierarchical Classifiers Using Class Proximity. Ke Wang Senqiang Zhou Shiang Chen Liew National University of Singapore. Hierarchical classification. Given a class hierarchy a collection of pre-classified documents a document is a set of terms Build
E N D
Building Hierarchical Classifiers Using Class Proximity Ke Wang Senqiang Zhou Shiang Chen Liew National University of Singapore
Hierarchical classification • Given • a class hierarchy • a collection of pre-classified documents • a document is a set of terms • Build • a classifier that assigns a relevant class to a new document • Key • extract features of classes
Yahoo classes Yahoo recreation science sports automotive cycling skating
ACM classes Hardware Level 1 General Memory_structure Level 2 Level 3 Design_style General Cache_memories Level 4
Existing local approaches • build one classifier at each split of the class hierarchy • determine features locally at each node • classify a document by going through a path of classifiers starting from the root
Diminishing of high level structure • rely on classification at high levels • but high level structures usually weak, i.e., divergence of topics • e.g., “car” is a feature at Recreation: Automotive, but not at Recreation
Bias of misclassification • sibling classes Vs. nephew classes • misclassification at high levels Vs. at low levels • specialisation Vs. generalisation
Features should be • determined wrt the target class • determined at all concept levels • correlated The solution: generalised association rules (SA95, HF95) {sql, IO} DB {language, performance} CS
Our approach • class proximity • global classifier • term hierarchy • use the “best” generalised association rule T C to determine the class
Rank association rules • Biased confidence • Biased J-measure
An example Arts Music Literature ... ... A_Music A_Literature author story Class hierarchy Term hierarchy editor writer poem fiction
Term hierarchy(T)=YesClass proximity(B)=Yes • R0: author,storyLiterature (ConfB=1,Clist=d6,d7) • R1: authorLiterature (ConfB=1) • R2: storyLiterature (ConfB=0.67, Wlist=d5(1)) • R4: hallMusic (ConfB=0.4, Clist=d1,d2, Wlist=d3(1)) • R3: StatesA_Literature (ConfB=0.33, Clist=d4,d5)
Experiment I • http://www.acm.org/dl/toc.html/ • 26,515 papers, 78 classes, 14,754 terms • class hierarchy=Level-1 and level-2 categories • term hierarchy=Level-3 and level-4 categories • document=Title and level-4 categories
Best rules found by (B,T) • CSO: • vector,stream,processor,parallelProcessor_Architectures • multiple_instruction_streamProcessor_Architectures • data_flow,architecturProcessor_Architectures • internet, architecturComputer_Communication_Networks • mode,atmComputer_Communication_Networks • network,circuit_switching Computer_Communication_Networks • tecniqu, model, attributPerformance_of_Systems • Software: • program,function, applicationProgramming_Techniques • object_oriented_programmingProgramming_Techniques • reusable_softwareSoftware_Engineering • software,methodologieSoftware_Engineering • organization, distributed_systemOperating_Systems
() --- | (T) --- (B) --- (B,T) --- (CDAR97,T) --- (CDAR97) ---
Experiment II • http://dir.yahoo.com/recreation/sports • 7,550 documents • 367 classes, 7 levels • 10,747 terms • 90% of the terms occur in no more than 10 documents and many documents contain only such terms
Best rules found by (B,T) • Sports:Cycling: • page,mountain Mountain_Biking • product,bikeMountain_Biking • mtb,mountain Mountain_Biking • held,bicyclRaces • classic,bicyclRaces • trip,tourTravelogues • trip,canada Travelogues • bicycl,alaskaTravelogues • Sports:Auto_Racing: • team,result,driverFormula_one • model,featurTracks_and_Speedways • ovalTracks_and_Speedways • racewayTracks_and_Speedways