1 / 21

Building Hierarchical Classifiers Using Class Proximity

Building Hierarchical Classifiers Using Class Proximity. Ke Wang Senqiang Zhou Shiang Chen Liew National University of Singapore. Hierarchical classification. Given a class hierarchy a collection of pre-classified documents a document is a set of terms Build

todd
Télécharger la présentation

Building Hierarchical Classifiers Using Class Proximity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Hierarchical Classifiers Using Class Proximity Ke Wang Senqiang Zhou Shiang Chen Liew National University of Singapore

  2. Hierarchical classification • Given • a class hierarchy • a collection of pre-classified documents • a document is a set of terms • Build • a classifier that assigns a relevant class to a new document • Key • extract features of classes

  3. Yahoo classes Yahoo recreation science sports automotive cycling skating

  4. ACM classes Hardware Level 1 General Memory_structure Level 2 Level 3 Design_style General Cache_memories Level 4

  5. Existing local approaches • build one classifier at each split of the class hierarchy • determine features locally at each node • classify a document by going through a path of classifiers starting from the root

  6. Diminishing of high level structure • rely on classification at high levels • but high level structures usually weak, i.e., divergence of topics • e.g., “car” is a feature at Recreation: Automotive, but not at Recreation

  7. Bias of misclassification • sibling classes Vs. nephew classes • misclassification at high levels Vs. at low levels • specialisation Vs. generalisation

  8. Features should be • determined wrt the target class • determined at all concept levels • correlated The solution: generalised association rules (SA95, HF95) {sql, IO}  DB {language, performance}  CS

  9. Our approach • class proximity • global classifier • term hierarchy • use the “best” generalised association rule T C to determine the class

  10. Rank association rules • Biased confidence • Biased J-measure

  11. An example Arts Music Literature ... ... A_Music A_Literature author story Class hierarchy Term hierarchy editor writer poem fiction

  12. Term hierarchy(T)=YesClass proximity(B)=Yes • R0: author,storyLiterature (ConfB=1,Clist=d6,d7) • R1: authorLiterature (ConfB=1) • R2: storyLiterature (ConfB=0.67, Wlist=d5(1)) • R4: hallMusic (ConfB=0.4, Clist=d1,d2, Wlist=d3(1)) • R3: StatesA_Literature (ConfB=0.33, Clist=d4,d5)

  13. Experiment I • http://www.acm.org/dl/toc.html/ • 26,515 papers, 78 classes, 14,754 terms • class hierarchy=Level-1 and level-2 categories • term hierarchy=Level-3 and level-4 categories • document=Title and level-4 categories

  14. Best rules found by (B,T) • CSO: • vector,stream,processor,parallelProcessor_Architectures • multiple_instruction_streamProcessor_Architectures • data_flow,architecturProcessor_Architectures • internet, architecturComputer_Communication_Networks • mode,atmComputer_Communication_Networks • network,circuit_switching Computer_Communication_Networks • tecniqu, model, attributPerformance_of_Systems • Software: • program,function, applicationProgramming_Techniques • object_oriented_programmingProgramming_Techniques • reusable_softwareSoftware_Engineering • software,methodologieSoftware_Engineering • organization, distributed_systemOperating_Systems

  15. () --- | (T) ---  (B) ---  (B,T) ---  (CDAR97,T) ---  (CDAR97) --- 

  16. Experiment II • http://dir.yahoo.com/recreation/sports • 7,550 documents • 367 classes, 7 levels • 10,747 terms • 90% of the terms occur in no more than 10 documents and many documents contain only such terms

  17. Best rules found by (B,T) • Sports:Cycling: • page,mountain Mountain_Biking • product,bikeMountain_Biking • mtb,mountain Mountain_Biking • held,bicyclRaces • classic,bicyclRaces • trip,tourTravelogues • trip,canada Travelogues • bicycl,alaskaTravelogues • Sports:Auto_Racing: • team,result,driverFormula_one • model,featurTracks_and_Speedways • ovalTracks_and_Speedways • racewayTracks_and_Speedways

More Related