140 likes | 256 Vues
This paper explores the extraction of taxonomy from various dictionaries using a language-independent method. By analyzing machine-readable dictionaries and parsing definitional phrases, we identify and combine high-quality resources to generate new data. The study emphasizes the importance of comparing entries from multiple sources, addressing issues such as differing lexicons, hyperonyms, and the need for systematic cleanup. Our approach facilitates enhanced understanding and representation of word meanings and relationships, as illustrated through examples from various dictionaries.
E N D
Combining Resources: Taxonomy Extraction from Multiple Dictionaries Rogelio Nazar & Maarten Janssen IULA, Universitat Pompeu Fabra, Barcelona
Information from Dictionaries • Dictionaries good source for information • Long tradition of taxonomy extraction • Calzolari (1977), Amsler (1981), Chodorow et al (1985), Fox et al. (1988), Alshawi (1989), Boguraev (1991), Barrière & Popowich (1996), Chang (1998), Renau & Battaner (2008) • Exploiting Machine Readable Dictionaries • Parsing definitional phrases • Pattern extraction, Shallow parsing • Full treatment of a single dictionary
Combining Resources • There is a lot of information available • Hand crafted, high-qualify resources • Combining yields new data • Taxonomy from multiple dictionaries • Language-independent shallow method • Combining definitions of the same word • Various dictionaries, online versions • DRAE, DGLE, Clave, DEM • Frequency Based
Consolidated Genus Terms • Dictionaries differ • Different lexicon and definitions • Even if only for legal reasons • Hyperonym should be the same • A cat is an animal • Unless there is uncertainty in the hyperonym • Most dictionaries should use same genus • Statistically relevant
Example 3x ablandabrevas persona 2x com. inútil 1x substantivo común fig.
Raw HTML input • Directly from harvested text • With begin/end tags • No textual analysis • More than definitions • Examples, multiple senses, etc. • Sense matching impossible • Entries unsystematic • Dictionaries do not match in senses
Cleanup • Minimum number of dictionaries • Raw frequency count • Hyperonym tends to be repeated • Candidates have to be words • Of the same word-class • Use of a stop-list • Dictionary generated • Words that occur in more than 10% entries
# deconstrucción (3 dictionaries) teoría 2 1 EWN: 0.desconstrucción; 0.deconstrucción; 1.teoría filosófica; 1.doctrina filosófica; 2.filosofía; 3.creencia; 4.contenido mental; 5.conocimiento; 5.cognición; 6.rasgo psicológico; # descubrimiento (5 dictionaries) acción 3 3 cosa 3 5 efecto 2 - EWN: 0.descubrimiento; 1.logro; 1.presentación; 1.revelación; 2.realización; 2.información; 2.exposición; 3.acción; 3.hecho; 3.acto de habla; 3.comunicación visual; 4.acto; 4.actividad humana; 4.comunicación; 5.relación social; 6.relación; 7.abstracción; # cumbia (5 dictionaries) danza 2 - EWN: 0.cumbiamba; 0.cumbia; 1.baile regional; 1.danza popular; 2.baile social; 3.baile; 4.recreación; 4.diversión; 5.actividad; 6.acto; 6.actividad humana; # asta (5 dictionaries) mar 6 - lanza 6 - media 5 - toro 5 - cuerno 5 - bandera 4 - EWN: 0.cuerno; 0.asta; 1.tomadero; 1.materia animal; 1.cogedero; 1.bastón; 1.agarradera; 1.asimiento; 1.asidero; 1.asa; 2.materia; 2.apéndice; 2.vara; 2.palo; 3.porción; 3.sustancia; 3.parte; 3.herramienta; 4.utillaje; 5.artefacto; 6.objeto físico; 6.cosa; 6.objeto; 6.objeto inanimado; 7.competente; 7.respirar; 7.capaz; 7.entidad;
WordNet Verification • WordNet (still) best available taxonomy • Not the best resource for evaluation • Automatic Verification • 100 Random nouns • Best 5 hyperonymy candidates • Match when candidate in chain • Only about 50% accurracy
WordNet vs. Dictionary • WordNet • Many intermediate/artificial levels • Compulsory hyperonym • Contains proper names • Dictonaries • More word-senses • Alternative definitions (synonymy, paraphrasis, …) • Differences • Different choice of hyperonym • Different lexicon
Thank you • Question?