1 / 95

Concept Hierarchy Induction by Philipp Cimiano

Concept Hierarchy Induction by Philipp Cimiano. Objective. Structure information into categories Provide a level of generalization to define relationships between data Application: Backbone of any ontology. Overview. Different approaches of acquiring conceptual hierarchies from text corpus.

gaston
Télécharger la présentation

Concept Hierarchy Induction by Philipp Cimiano

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concept Hierarchy Inductionby Philipp Cimiano

  2. Objective Structure information into categories Provide a level of generalization to define relationships between data Application: Backbone of any ontology

  3. Overview • Different approaches of acquiring conceptual hierarchies from text corpus. • Various clustering techniques. • Evaluation • Related Work • Conclusion

  4. Machine Readable Dictionaries Entries: ‘a tiger is a mammal’, or ‘mammals such as tigers, lions or elephants’. exploit the regularity of dictionary entries. the head of the first NP - hypernym.

  5. Example

  6. Exception

  7. Exception is-a (corolla, part)………..is a NOT VALID is-a (republican, member) ……….. is a NOT VALID is-a (corolla, flower)………..is a NOT VALID is-a (republican, political party)………..is a NOT VALID

  8. Exception

  9. Alshawis solution

  10. Results using MRDs Dolan et al. - 87% of the hypernym relations extracted are correct Calzolari cites a precision of > 90% Alshawi - precision of 77%

  11. Strengths And Weaknesses Correct, explicit knowledge Robust basis for ontology learning Weakness- domain independent

  12. Lexico-Syntactic patterns Task: automatically learning hyponym relations from the corpora. 'Such injuries as bruises, wounds and broken bones' hyponym (bruise, injury) hyponym (wound, injury) hyponym (broken bone, injury)

  13. Hearst patterns 'Such injuries as bruises, wounds and broken bones'

  14. Requirements Occur frequently in many text genres. Accurately indicate the relation of interest. Be recognizable with little or no pre-encoded knowledge

  15. Strengths And Weaknesses • Identified easily and are accurate Weakness: • patterns appear rarely • is-a relation do not appear in Hearst style pattern

  16. Distribution Similarity 'you shall know a word by the company it keeps’ [Firth, 1957]. semantic similarity of words – similarity of the contexts.

  17. Using distribution similarity

  18. Strengths And Weaknesses • reasonable concept hierarchy. Weakness: • Cluster tree lacks clear and formal interpretation • Does not provide any intentional description of concepts • Similarities may be accidental (sparse data)

  19. Formal Concept Analysis (FCA)

  20. FCA output

  21. Similarity measures

  22. Smoothing

  23. Evaluation • Semantic cotopy (SC). • Taxonomy overlap (TO)

  24. Evaluation Measure

  25. 100% Precision Recall

  26. Low Recall

  27. Low Precision

  28. Results

  29. Results

  30. Results

  31. Results

  32. Strengths And Weaknesses • FCA generates formal concepts • Provides intentional description Weakness: • Size of the lattice can get exponential in the size • spurious clusters • Finding appropriate labels for the cluster

  33. Problems with Unsupervised Approaches to Clustering • Data sparseness leads to spurious syntactic similarities • Produced clusters can’t be appropriately labeled

  34. Guided Clustering • Hypernyms directly used to guide clustering • WordNet • Hearst • Agglomerative clustering

  35. Similarity Computation Ten most similar terms of the tourism reference taxonomy

  36. The Hypernym Oracle • Three sources • WordNet • Hearst patterns matched in a corpus • Hearst patterns matched in the World Wide Web • Record hypernyms and amount of evidence found in support of hypernyms.

  37. WordNet • Collect hypernyms found in any dominating synset containing term, t • Include number of times the hypernym appears in a dominating synset

  38. Hearst Patterns (Corpus) • Record number of isa-relations found between two terms

  39. Hearst Patterns (WWW) • Download 100 Google abstracts for each concept and clue:

  40. Evidence • Total Evidence for Hypernyms: • time: 4 • vacation: 2 • period: 2

  41. Clustering Algorithm • Input a list of terms • Calculate the similarity between each pair of terms and sort from highest to lowest • For each potential pair to be clustered consult the oracle.

  42. Consulting the Oracle case 1 • If term 1 is a hypernym of term 2 or vice-versa: • Create appropriate subconcept relationship.

  43. Consulting the Oracle case 2 • Find the common hypernym for both terms with greatest evidence. • If one term has already been classified: t’ = h h is a hypernym of t’ t’ is a hypernym of h

  44. Consulting the Oracle case 3 • Neither term has been classified: • Each term becomes a subconcept of the common hypernym.

  45. Consulting the Oracle case 4 • The terms do not share a common hypernym: • Set aside the terms for further processing.

  46. r-matches • For all unprocessed terms, check for r-matches (i.e. ‘credit card’ matches ‘international credit card’)

  47. Further Processing • If either term in a pair is already classified as t’, the other term is classified under t’ as well. • Otherwise place both terms under the hypernym of either term with the most evidence. • Any unclassified terms are added under the root concept.

More Related