Download
automating creation of hierarchical faceted metadata structures n.
Skip this Video
Loading SlideShow in 5 Seconds..
Automating Creation of Hierarchical Faceted Metadata Structures PowerPoint Presentation
Download Presentation
Automating Creation of Hierarchical Faceted Metadata Structures

Automating Creation of Hierarchical Faceted Metadata Structures

131 Vues Download Presentation
Télécharger la présentation

Automating Creation of Hierarchical Faceted Metadata Structures

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept. of Mathematical Sciences, NMSU

  2. Focus: Browse Large Datasets • Standard search interface - query box + retrieved results – not suited for browsing and navigation • User interfaces need to group and organize the results

  3. How do we Create Faceted Hierarchies? • Goals: • Help an information architect to create the hierarchy • Currently they do it all by hand! • Balance depth and breadth • Avoid “skinny” paths • Don’t go too deep or too broad • Choose understandable labels • Disambiguate between word senses

  4. Related Work • Automated text categorization • LOTS of work on this • Assumes that a set of categories is already created • Little if any work on building facet hierarchies

  5. Castanet • Carves out a structure from the hypernym(IS-A) relations within WordNet • Semi-automatic algorithm for creating hierarchical faceted metadata • Produces surprisingly good results for a wide range of subjects • e.g., recipes, medicine, math, news, fine arts image description

  6. #1 cactus tuna food fish #2 fish bony fish WordNet Challenges • A word may have more than one sense -Fine granularity of word sense distinctions e.g., newspaper (#1) - daily publication on folded sheets newspaper (#3) - physical object - Ambiguity for the same sense

  7. WordNet Challenges (cont.) • The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes) • Sparse coverage of proper names and noun phrases (not addressed)

  8. Build core tree Augment core tree Select terms WordNet Divide into facets Remove top level categories Compress Tree Our Approach Documents

  9. Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) 1. Select Terms Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet

  10. Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet entity entity substance,matter substance,matter nutriment nutriment dessert dessert frozen dessert frozen dessert ice cream sundae sherbet,sorbet sherbet sundae 2. Build Core Tree • Build a “backbone” • Create paths from unambiguous terms only • Bias the structure towards appropriate senses of words

  11. Merge hypernym paths to build a tree entity entity entity substance,matter substance,matter substance,matter nutriment nutriment nutriment frozen dessert dessert dessert dessert frozen dessert frozen dessert ice cream sundae sherbet,sorbet ice cream sundae sherbet,sorbet sherbet sundae sundae sherbet 2. Build Core Tree (cont.)

  12. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet 3. Augment Core Tree • Attach to Core tree the terms with more than one sense • Favor the more common path over other alternatives

  13. Choose this path since it has more items assigned Augment Core Tree (cont.) Date (p1) Date (p2) entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date Sunday berries date ? ?

  14. Optional Step: Domains • To disambiguate, use Domains • Wordnet has 212 Domains • medicine, mathematics, biology, chemistry, linguistics, soccer, etc. • A better collection has been developed by Magnini (2000) • Assigns a domain to every noun synset • Automatically scan the collection to see which domains apply • The user selects which of the suggested domains to use or may add own • Paths for terms that match the selected domains are added to the core tree

  15. Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3

  16. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet abstraction dessert frozen dessert sundae parfait sherbet 4. Compress Tree • Rule 1: • Eliminate a parent with fewer than kchildren unless it is the root or its distribution is larger than 0.1*maxdist dessert frozen dessert ice cream sundae parfait sherbet,sorbet sundae sherbet

  17. Build core tree Augm. core tree Documents Select terms Comp. tree Remove top level categ. WordNet abstraction dessert sundae parfait sherbet 4. Compress Tree (cont.) • Rule 2: • Eliminate a child whose name appears within the parent’s name dessert frozen dessert sundae parfait sherbet

  18. Divide into facets 5. Divide into Facets

  19. entity substance,matter food,nutriment food stuff,food product ingredient,fixings flavorer flavorer herb herb sweetening sweetening parsley oregano sugar syrup parsley oregano sugar syrup Divide into facets 5. Divide into Facets(Remove top levels) Rule 1: Eliminate the top t levels (t =4 for recipe collection). Rule 2: For each resulting tree, test if it has at least n children (n =2) If yes, stop. If not, delete the root and repeat. Manual cleaning: remove facets that don’t make sense

  20. Example: Recipes (13,500 docs)

  21. Castanet Output(shown in Flamenco)

  22. Castanet Output

  23. Castanet Evaluation • This is a tool for information architects (IA), so people of this type did the evaluation • Each IA compared Castanet to other state-of-the-art algorithms • LDA (Blei et al. 04) • Subsumption (Sanderson & Croft ’99) • Baseline: most frequent terms in the collection • Datasets • 13,000 recipes from Southwestcooking.com

  24. Subsumption Output

  25. Subsumption Output

  26. LDA Output

  27. LDA Output

  28. Evaluation Method C C • For each of 2 systems’ output: • Examined and commented on top-level • Examined and commented on two sub-levels • Then comment on overall properties • Meaningful? • Systematic? • Likely to use in your work? } } 16 18 L S

  29. Evaluation (cont.) Sample questions for top level categories: - Would you add/remove/rename any category ? - Did this category match your expectations ? Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ? General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?

  30. Evaluation Results • “Would you use this system in your work?” “yes definitely”, “yes, in some cases” Castanet 85% LDA 0 % Subsumption 37% Baseline 74% • Average response to questions about quality (4 = “strongly agree”, 3 = “agree somewhat”, 2 = “disagree somewhat”, 1 = “strongly disagree”)

  31. Evaluation Results • Average responses for top-level categories • (4= “no changes”, 3 = “one or two”, 2 = “a few”, 1 = “many”) • Average responses for 2 subcategories

  32. Needed Improvements • Take spelling variations and morphological variants into account • Use verbs and adjectives, not just nouns • Normalize noun phrases • Allow terms to have more than one sense • Improve algorithm for assigning documents to categories.

  33. Conclusions • Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet. • The method has been tested on various domains: • medicine, recipes, math, news, description of images • Usability study shows: • Castanet is preferred to other state-of-the art solutions. • Information architects want to use the tool in their work. • Future work • Apply to tags (flickr, delicious)

  34. Learn More • Funding • This work supported in part by NSF (IIS-9984741) • For more information: • Stoica, E., Hearst, M., and Richardson, M., Automating Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007 • See http://flamenco.berkeley.edu