1 / 41

University of Sheffield CIIR, University of Massachusetts

Deriving concept hierarchies from text Mark Sanderson, Bruce Croft. University of Sheffield CIIR, University of Massachusetts. The question is. What paper already presented at this SIGIR is most like the one you’re about to see? We’ll have the answer, right after this!.

gaille
Télécharger la présentation

University of Sheffield CIIR, University of Massachusetts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deriving concept hierarchies from text Mark Sanderson, Bruce Croft University of SheffieldCIIR, University of Massachusetts

  2. The question is... • What paper already presented at this SIGIR is most like the one you’re about to see? • We’ll have the answer, right after this!

  3. Concept hierarchies from documents? • Hierarchy ofconcepts, Yahoo • General down to specific • Child under one or more parents • No training data • Why? • Understandable

  4. Current methods • Polythetic clustering

  5. An alternative? • Monothetic clustering • Clusters based on a single features • More ‘Yahoo/Dewey decimal’ like? • Easier to understand? • Preferable to users? • What about hierarchies of clusters?

  6. How to arrange cluster terms? • Existing techniques • WordNet • earthquake, volcano (eruption?) • Key phrases (Hearst 1998) • “such as”, “especially” • Phrase classification (Grefenstette 1997) • NP head or modifier “types of research” from “research things” • Hierarchical phrase analysis (Woods 1997) • Head modifier again, “car washing” under “washing”, not “car”

  7. WordNet (aside) • 1 sense of earthquake, sense 1 • earthquake, quake, temblor, seism -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity) • geological phenomenon -- (a natural phenomenon involving the structure or composition of the earth) • natural phenomenon, nature -- (all non-artificial phenomena) • phenomenon -- (any state or process known through the senses rather than by intuition or reasoning)

  8. WordNet (aside) • 5 senses of eruption, sense 1 • volcanic eruption, eruption -- (the sudden occurrence of a violent discharge of steam and volcanic material) • discharge -- (the sudden giving off of energy) • happening, occurrence, natural event -- (an event that happens) • event -- (something that happens at a given place and time)

  9. Start with something simpler? • Term clustering? • simple monothetic clusters • No ordering.

  10. x y x y Use subsumption • Initially using subsumption. • Finds related terms • Decides which is more general, which is more specific (idf?) • Strict interpretation • X s Y iff P(x|y) = 1, P(y|x) < 1 • In practice • X s Y iff P(x|y) > 0.8, P(y|x) < 1 • P(x|y) > 0.8, P(y|x) < P(x|y)

  11. How to build a “hierarchy” • X s Y • X s Z • X s M • X s N • Y s Z • A s B • A s Z • B s Z X A M N Y B Z really it’s a DAG

  12. How to display it? • DAGs were big • Unlikely to get all on screen • Only want to see current focus plus route to taken there? • Use a method users are familiar with • Hierarchical menus X A M N Y B Z Z

  13. What about ambiguity? • Monothetic clusters of ambiguous terms? • Derive hierarchy from retrieved documents • Take a query and retrieve on it, • take top 500 documents, • build hierarchy from them. • Topics/concepts are words/phrases taken from • Query • Retrieved documents • Comparison of frequencies

  14. Poliomyelitis and Post-PolioTREC topic 302

  15. Poliomyelitis and Post-PolioTREC topic 302

  16. Poliomyelitis and Post-PolioTREC topic 302

  17. Poliomyelitis and Post-PolioTREC topic 302

  18. Poliomyelitis and Post-PolioTREC topic 302

  19. Poliomyelitis and Post-PolioTREC topic 302

  20. Poliomyelitis and Post-PolioTREC topic 302

  21. Poliomyelitis and Post-PolioTREC topic 302

  22. Poliomyelitis and Post-PolioTREC topic 302

  23. Poliomyelitis and Post-PolioTREC topic 302

  24. Poliomyelitis and Post-PolioTREC topic 302

  25. Poliomyelitis and Post-PolioTREC topic 302

  26. Poliomyelitis and Post-PolioTREC topic 302

  27. Poliomyelitis and Post-PolioTREC topic 302

  28. Poliomyelitis and Post-PolioTREC topic 302

  29. Poliomyelitis and Post-PolioTREC topic 302

  30. Poliomyelitis and Post-PolioTREC topic 302

  31. Poliomyelitis and Post-PolioTREC topic 302

  32. Poliomyelitis and Post-PolioTREC topic 302

  33. Poliomyelitis and Post-PolioTREC topic 302

  34. Poliomyelitis and Post-PolioTREC topic 302

  35. Did you guess the paper? • Bit like Peter Anick’s work?

  36. Experiment • Test properties of hierarchy • Does it mimic (in some way) Yahoo-like categories? • Parent related to child? • Parent more general than child?

  37. Experimental set-up • Gathered eight subjects • Presented subsumption categories and ‘random’ categories. • Ask if parent child pair are ‘interesting’. • If yes, then what type is relationship, (roughly) from WordNet • Aspect of • Type of • Same as • Opposite of • Don’t know

  38. Results • Question of parent/child pairing ‘interesting’ or not • Random, 51% • Subsumption, 67% • Difference significant from t-test, p<0.002 • If interesting, what is parent/child type? Odd?

  39. Yahoo categories?

  40. Results and conclusions • Interesting AND (aspect of OR type of) • Random, 28% (51% * (47% + 8%)) • Subsumption, 48% (67% * (49% + 23%)) • Appears that subsumption and an ordering based on document frequency does a reasonable job. • Term frequency work see. • Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval, in Journal of Documentation, 28(1): 11-21 • Caraballo, S.A., Charniak, E. (1999) Determining the specificity of nouns from text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP):

  41. Future work? • More user studies. • Incorporate other term relationship techniques • Other visualisations • Application of techniques to whole document collections. • Presentation of Cross Language IR results?

More Related