190 likes | 307 Vues
This paper addresses the challenge of automatically extracting terminological units from specialized texts, specifically utilizing Wikipedia as a resource. We present our methodology for identifying and filtering relevant Wikipedia categories and pages that pertain to specific domains. The evaluation highlights our approach's effectiveness, particularly in the medical domain, while also acknowledging areas for improvement, such as the elimination of proper names from the term list. Future work will focus on enhancing filtering processes and applying this method across diverse languages and domains.
E N D
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es
Outline • Introduction • Related approaches • Methodology • Evaluation • Conclusions and future work
Introduction • Problem: to automatically extract terminological units from specialized texts • Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.
Related approaches • Magnini et al., 2000 • Montoyo et al., 2001 • Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004 • Bernardini et al., 2006 • Cui et al., 2008
Graph structure of Wikipedia WP categories WP pages … … P1 Redirection table A B … … P2 C D E … F Disamb. pages Interwiki links External links InfoBox P3 G … … …
Methodology: overview domain WP top categories Categories Pages domain categories filtering bootstrapping final domain term set domain pages filtering Main steps: 4) Remove proper names and service classes 5) Filter categories and pages 1) To find in WP the domain name as a category. 2) Look for all the subcategories/pages related to the domain 3) Extract all descendants from the domain name avoiding loops
Methodology: filtering Category level Page level
Methodology: filtering Category level Top Category of the Domain Direct super-categories CatSet1 Direct super-categories CatSet1 Direct neutral super-categories C Category Score CatSet1
Methodology: filtering Page level Top Category of the Domain neutral categories categories CatSet2 Pages C ... ... P categories CatSet2 Page Score C CatSet2
Methodology: page filtering • Additionalcategoryfilteringusingpages scores: • catTerm: set of pagesassociatedto a category • MicroStrict: acceptcatif # elements of catTermwith positive scoringisgreaterthat # elementswithnegativescoring • MicroLoose: Idemwithgreaterorequal test. • Macro: instead of countingthepageswith positive/negativescoringwe use thecomponents of such scores.
Page filtering example: “semantics” (in Computing domain) theoretical computer science Computing semantics software software engineering formal methods semantics {linguistics, philosophy of language, semiotics, theoreticalcomputerscience, philosophicalLogic} WPCD(semantics) = 0.25
Category filtering example using pages score: “chemistry”
Evaluation • Partial evaluation: “chemistry” and “astronomy”: • Test against Magnini et al., 2000 (WordNet 1.6) • Low coverage: 25% for Chemistry and 15% for Astronomy • Full evaluation. “Medicine” • Test against SNOMED-CT Spanish Edition (2009) • Wide coverage of the clinical domain: 800K terms
Conclusions • Good results when evaluated against a specialised resource • Term list filtering must be improved (ex. Eliminate proper names)
Future work • Apply this method to other languages/domains • Improve filtering using in/out links of selected pages • Improve filtering using also the page content • Use this WP knowledge to improve a term extractor
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es