1 / 83

Semi-automatic methods for WordNet construction

Semi-automatic methods for WordNet construction. German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Universitat Politècnica de Catalunya Eneko Agirre http://www.ji.si.upc.es/users/eneko IxA NLP Group University of the Basque Country.

efrat
Télécharger la présentation

Semi-automatic methods for WordNet construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-automatic methods for WordNet construction German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Universitat Politècnica de Catalunya Eneko Agirre http://www.ji.si.upc.es/users/eneko IxA NLP Group University of the Basque Country 2002 International WordNet Conference

  2. Setting • NLP and the Lexicon • Theoretical: WG, GPSG, HPSG. • Practical: realistic complexity and coverage • Lexical bottleneck (Briscoe 91) • Even worse for languages other than English Semi-automatic methods for WN construction 2002 International WordNet Conference - 2

  3. Setting • Which LK is needed by a concrete NLP system? • Where is this LK located? • Which procedures can be applied? Semi-automatic methods for WN construction 2002 International WordNet Conference - 3

  4. Setting • Which LK is needed by a concrete NLP system? • Phonology: phonemes, stress, etc. • Morphology: POS, etc. • Syntactic: category, subcat., etc. • Semantic: class, SRs, etc. • Pragmatic: usage, registers, TDs, etc. • Translations: translation links Semi-automatic methods for WN construction 2002 International WordNet Conference - 4

  5. Setting • Where is this LK located? • Human brain • Structured Lexical Resources: • Monolingual and bilingual MRDs • Thesauri • Unstructured Lexical Resources: • Monolingual and bilingual Corpora • Mixing resources Semi-automatic methods for WN construction 2002 International WordNet Conference - 5

  6. Setting • Which procedures can be applied? • Prescriptive approach • Machine-aided manual construction • Descriptive approach • Automatic acquisition from pre-existing Lexical Resources • Mixed approach Semi-automatic methods for WN construction 2002 International WordNet Conference - 6

  7. Outline • Setting • Words and Works • Merge approach • Taxonomy construction: monolingual MRDs • Mapping taxonomies: bilingual MRDs • Expand approach • Translation of synsets: bilingual MRDs • Interface for manual revision • Conclusions Semi-automatic methods for WN construction 2002 International WordNet Conference - 7

  8. Words and WorksWhere is this Lexical Knowledge located? • Human brain: • Linguistic String Project (Fox et al. 88) • Lexical Information for 10,000 entries • WordNet (Miller et al. 90) • Semantic Information v1.6 with 99,642 synsets • Comlex (Grishman et al. 94) • Syntactic information 38,000 English words • CYC Ontology (Lenat 95) • a person-century of effort to produce 100,000 terms • LDOCE3-NLP • dictionary with 80,000 senses Semi-automatic methods for WN construction 2002 International WordNet Conference - 8

  9. Words and WorksWhere is this Lexical Knowledge located? • Structured Lexical Resources • Monolingual MRDs: • LDOCE • learner’s dictionary • 35,956 entries and 76,059 definitions • 86% semantic and 44% pragmatic codes • controlled vocabulary of 2,000 words • (Boguraev & Briscoe 89) • (Vossen & Serail 90) • (Bruce & Guthrie 92), (Wilks et al. 93) • (Dolan et al. 93), (Richardson 97) Semi-automatic methods for WN construction 2002 International WordNet Conference - 9

  10. Words and WorksWhere is this Lexical Knowledge located? • Structured Lexical Resources • Other Monolingual MRDs: • Webster’s (Jensen & Ravin 87) • LPPL (Artola 93) • DGILE (Castellón 93), (Taulé 95), (Rigau 98) • CIDE (Harley & Glennon 97) • AHD (Richardson 97) • WordNet (Harabagiu 98) • Bilingual MRDs • Collins Spanish/English (Knigth & Luk 94) • Vox/Harrap’s Spanish/English (Rigau 98) Semi-automatic methods for WN construction 2002 International WordNet Conference - 10

  11. Words and WorksWhere is this Lexical Knowledge located? • Structured Lexical Resources • Thesauri: • Roget’s Thesaurus • 60,071 words in 1,000 categories • (Yarowsky 92), (Grefenstette 93), (Resnik 95) • Roget’s II and The New Collins Thesaurus • (Byrd 89) • Macquarie’s thesaurus • (Grefenstette 93) • Bunrui Goi Hyou Japanese thesaurus • (Utsuro et al. 93) Semi-automatic methods for WN construction 2002 International WordNet Conference - 11

  12. Words and WorksWhere is this Lexical Knowledge located? • Structured Lexical Resources • Encyclopaedia • Grolier’s Encyclopaedia (Yarowsky 92) • Encarta (Richardson et al. 98) • Others • Telephonic Guides • Mixing structured lexical resources • Roget’s Thesaurus and Grolier’s (Yarowsky 92) • LDOCE, WN, Collins, ONTOS, UM (Knight & Luk 94) • Japanese MRD to WN (Okumura & Hovy 94) • LLOCE, LDOCE (Chen & Chang 98) Semi-automatic methods for WN construction 2002 International WordNet Conference - 12

  13. Words and WorksWhere is this Lexical Knowledge located? • Unstructured Lexical Resources • Corpora: • WSJ, Brown Corpus (SemCor), Hansard • Proper Nouns (Hearst & Schütze 95) • Idiosyncratic Collocations (Church et al. 91) • Preposition preferences (Resnik and Hearst 93) • Subcategorization structures (Briscoe and Carroll 97) • Selectional restrictions (Resnik 93), (Ribas 95) • Thematic structure (Basili et al. 92) • Word semantic classes (Dagan et al. 94) • Bilingual Lexicons for MT (Fung 95) Semi-automatic methods for WN construction 2002 International WordNet Conference - 13

  14. Words and WorksWhere is this Lexical Knowledge located? • Using both structured and non-structured Lexical Resources • MRDs and Corpora • (Liddy & Paik 92) • (Klavans & Tzoukermann 96) • WordNet and Corpora • (Resnik 93), (Ribas 95), (Li & Abe 95), (McCarthy 01) Semi-automatic methods for WN construction 2002 International WordNet Conference - 14

  15. Words and WorksInternational Projects on Lexical Acquisition • Japanese Projects • EDR (Yokoi 95) • Nine years project oriented to MT • Bilingual Corpora with 250,000 words • Monolingual, bilingual and coocurrence dictionaries • 200,000 general vocabulary • 100,000 technical terminology • 400,000 concepts Semi-automatic methods for WN construction 2002 International WordNet Conference - 15

  16. Words and WorksInternational Projects on Lexical Acquisition • American Projects • Comlex (Grishman et al. 94) • Syntactic information for 38,000 words • WordNet (Miller 90) • Semantic Information • more than 123,000 words organised in 99,000 synsets • more than 116,000 relations between synsets • Pangloss (Knight & Luk 94) • PUM, ONTOS, LDOCE semantic categories, WordNet • Cyc (Lenat 95) • common-sense knowledge • 100,000 concepts and 1,000,000 axioms Semi-automatic methods for WN construction 2002 International WordNet Conference - 16

  17. Words and WorksInternational Projects on Lexical Acquisition • European Projects • Acquilex I and II • LA from monolingual and bilingual MRDs and corpora • LE-Parole • Large-scale harmonised set of corpora and lexicons for all the EU languages • EuroWordNet • Multilingual WordNet for several European Languages • Meaning • Large-scale of LK from the web • Large-scale WSD Semi-automatic methods for WN construction 2002 International WordNet Conference - 17

  18. Words and WorksLexical Acquisition from MRDs • Syntactic Disambiguation (Dolan et al. 93) • Semantic Processing (Vanderwende 95) • WSD (Lesk 86), (Wilks & Stevenson 97), (Rigau 98) • IR (Krovetz & Croft 92) • MT (Knight and Luk 94), (Tanaka & Umemura 94) • Semantically enriching MRDs • (Yarowsky 92), (Knight 93), (Chen & Chan 98) • Building LKBs • (Bruce & Guthrie 92) • (Dolan et al. 93) • (Artola 93) • (Castellón 93), (Taulé 95), (Rigau 98) Semi-automatic methods for WN construction 2002 International WordNet Conference - 18

  19. Words and WorksAcquisition of LK from MRDs • This tutorial focus on: • the massive acquisition of LK • from MRDs (conventional, in any language) • using (semi) automatic methodologies • Why MRDs? The conventional dictionaries for human use usually “contain spelling, pronunciation, hyphenation, capitalization, usage notes for semantic domains, geographic regions, and propiety; ethimological, syntactic and semantic information about the most basic units of the language” (Amsler 81) Semi-automatic methods for WN construction 2002 International WordNet Conference - 19

  20. Words and WorksMain Problems of MRDs • Conventional dictionaries are not systematic • Dictionaries are built for human use • Implicit Knowledge • words are described/translated in terms of words Semi-automatic methods for WN construction 2002 International WordNet Conference - 20

  21. Words and WorksMRDs and Semantic Knowledge jardín_1_1 Terreno donde se cultivan plantas y flores ornamentales. florero_1_4 Maceta con flores. ramo_1_3 Conjunto natural o artificial de flores, ramas o hierbas. pétalo_1_1 Hoja que forma la corola de la flor. tálamo_1_3 Receptáculo de la flor. miel_1_1 Substancia viscosa y muy dulce que elaboran las abejas, en una distensión del esófago, con el jugo de las flores y luego depositan en las celdillas de sus panales. florería_1_1 Floristería; tienda o puesto donde se venden flores. florista_1_1 Persona que tiene por oficio hacer o vender flores. camelia_1_1 Arbusto cameliáceo de jardín, originario de Oriente, de hojas perennes y lustrosas, y flores grandes, blancas, rojas o rosadas (Camellia japonica). camelia_1_2Flor de este arbusto. rosa_1_1Flor del rosal. Semi-automatic methods for WN construction 2002 International WordNet Conference - 21

  22. Outline • Setting • Words and Works • Merge approach • Taxonomy construction: monolingual MRDs • Mapping taxonomies: bilingual MRDs • Expand approach • Translation of synsets: bilingual MRDs • Interface for manual revision • Conclusions Semi-automatic methods for WN construction 2002 International WordNet Conference - 22

  23. MRD1 LKB1 LDB1 Tax1 MLKB ... ... ... ... MRDn Taxn LKBn LDBn Merge approachMain Methodology Semi-automatic methods for WN construction 2002 International WordNet Conference - 23

  24. Merge approachMain Methodology • Taxonomy construction: (Rigau et al. 98, 97) • monolingual MRDs • Step 1: Selection of the main top beginners for a semantic primitive • Step 2: Exploiting genus, construction of taxonomies for each semantic primitive • Mapping taxonomies: (Daudé et al. 99) • bilingual MRDs • Step 3: Creation of translation links Semi-automatic methods for WN construction 2002 International WordNet Conference - 24

  25. Merge approach: Taxonomy ConstructionMethodology • Problems following a pure descriptive approach • Circularity • Errors and inconsistencies • Definitions with omitted genus • Top dictionary senses do not usually represent useful knowledge for the LKB • Too general • Too specific Semi-automatic methods for WN construction 2002 International WordNet Conference - 25

  26. Merge approach: Taxonomy ConstructionMethodology Mixed Methodology Prescriptive approach Manual construction of the Top Structure Semi-automatic methods for WN construction 2002 International WordNet Conference - 26

  27. Merge approach: Taxonomy Construction Methodology Mixed Methodology Prescriptive approach Manual construction of the Top Structure Descriptive approach Acquiring implicit information from MRDs Semi-automatic methods for WN construction 2002 International WordNet Conference - 27

  28. Merge approach: Taxonomy Construction Methodology Mixed Methodology Prescriptive approach Manual construction of the Top Structure Descriptive approach Acquiring implicit information from MRDs Semi-automatic methods for WN construction 2002 International WordNet Conference - 28

  29. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • Word sense: zumo_1_1 • Attached-to: c_art_subst type. • Definition: líquido que se extrae de las flores, • hierbas, frutos, etc. • (liquid extracted from flowers, herbs, • fruits, etc). Semi-automatic methods for WN construction 2002 International WordNet Conference - 29

  30. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • A) Attaching DGILE senses to semantic primitives • 1) First labelling: • Conceptual Distance (Rigau 94) • 2) Second labelling: • Salient Words (Yarowsky 92) • B) Filtering Process Semi-automatic methods for WN construction 2002 International WordNet Conference - 30

  31. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • A.1) First labelling: • Conceptual Distance (Agirre et al. 94) • length of the shortest path • specificity of the concepts • using WordNet • Bilingual dictionary Semi-automatic methods for WN construction 2002 International WordNet Conference - 31

  32. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners <entity> <object, ...> <artifact, artefact> <structure, construction> <house, lodging> <building, edifice> <place of worship, ...> <religious residence, cloiser> <church, church building> <convent> <monastery> <abbey> <abbey> <abbey> abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess) Semi-automatic methods for WN construction 2002 International WordNet Conference - 32

  33. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners <entity> <object, ...> <artifact, artefact> <structure, construction> <house, lodging> <building, edifice> <place of worship, ...> <religious residence, cloiser> <church, church building> <convent> <monastery> <abbey> 06 ARTIFACT <abbey> <abbey> abadía_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess) Semi-automatic methods for WN construction 2002 International WordNet Conference - 33

  34. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • A.1) First labelling (Results) • 29,205 labelled definitions (31%) • 61% accuracy at a sense level • 64% accuracy at a file level Semi-automatic methods for WN construction 2002 International WordNet Conference - 34

  35. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • A.2) Second labelling: • Salient Words (Yarowsky 92) • Importance • local frequency • appears more significantly more often in the corpus of a semantic category than at other points in the whole corpus Semi-automatic methods for WN construction 2002 International WordNet Conference - 35

  36. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • A.2) Second labelling (Results): • 86,759 labelled definitions (93%) • 80% accuracy at a file level biberón_1_1 ARTIFACT 4.8399 Frasco de cristal ... (glass flask ...) biberón_1_2 FOOD 7.4443 Leche que contiene este frasco ... (milk contained in that flask ...) Semi-automatic methods for WN construction 2002 International WordNet Conference - 36

  37. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • B) Filtering process (FOODs) • removes all genus terms • FILTER 1: not FOODs by the bilingual mapping • FILTER 2: appear more often as genus in other Semantic Primitive • FILTER 3: with a low frequency Semi-automatic methods for WN construction 2002 International WordNet Conference - 37

  38. Merge approach: Taxonomy Construction Step 1: Selection of the main top beginners • B) Filtering process (FOOD Results) Semi-automatic methods for WN construction 2002 International WordNet Conference - 38

  39. Merge approach: Taxonomy Construction Step 2: Exploiting Genus • Word sense: vino_1_1 • Hypernym: zumo_1_1. • Definition: zumo de uvas fermentado. • (fermented juice of grapes). • Word sense: rueda_2_1 • Hypernym: vino_1_1. • Definition: vino procedente de la región de Rueda (Valladolid). • (wine from the region of Rueda). Semi-automatic methods for WN construction 2002 International WordNet Conference - 39

  40. Merge approach: Taxonomy Construction Step 2: Exploiting Genus • Genus Sense Identification • 97% accuracy for nouns • Genus Sense Disambiguation • Unrestricted WSD (coverage 100%) • Knowledge-based WSD (not supervised) • Eight Heuristics (McRoy 92) • Combining several lexical resources • Combining several methods Semi-automatic methods for WN construction 2002 International WordNet Conference - 40

  41. Merge approach: Taxonomy Construction Step 2: Exploiting Genus Results: Semi-automatic methods for WN construction 2002 International WordNet Conference - 41

  42. Merge approach: Taxonomy Construction Step 2: Exploiting Genus Knowledge provided by each heuristic: Semi-automatic methods for WN construction 2002 International WordNet Conference - 42

  43. Merge approach: Taxonomy Construction Step 2: Exploiting Genus F2+F3>9: 35,099 definitions F2+F3>4: 40,754 definitions No filters: 111,624 definitions Semi-automatic methods for WN construction 2002 International WordNet Conference - 43

  44. Merge approach: Taxonomy Construction Step 2: Exploiting Genus • ... • zumo_1_1 vino_1_1 quianti_1_1 • zumo_1_1 vino_1_1 raya_1_8 • zumo_1_1 vino_1_1 requena_1_1 • zumo_1_1 vino_1_1 reserva_1_12 • zumo_1_1 vino_1_1 ribeiro_1_1 • zumo_1_1 vino_1_1 rioja_1_1 • zumo_1_1 vino_1_1 roete_1_1 • zumo_1_1 vino_1_1 rosado_1_3 • zumo_1_1 vino_1_1 rueda_2_1 • zumo_1_1 vino_1_1 sherry_1_1 • zumo_1_1 vino_1_1 tarragona_1_1 • zumo_1_1 vino_1_1 tintilla_1_1 • zumo_1_1 vino_1_1 tintorro_1_1 • zumo_1_1 vino_1_1 toro_3_1 • ... Semi-automatic methods for WN construction 2002 International WordNet Conference - 44

  45. Merge approach: Mapping Taxonomies Step 3: Creation of translation links C1 C2 C3 C4 C5 C6 Semi-automatic methods for WN construction 2002 International WordNet Conference - 45

  46. Merge approach: Mapping Taxonomies Step 3: Creation of translation links C1 C2 C3 C4 C5 C6 Semi-automatic methods for WN construction 2002 International WordNet Conference - 46

  47. Merge approach: Mapping Taxonomies Step 3: Creation of translation links • Connecting already existing Hierarchies • Relaxation labelling Algorithm • Constraints • Between • Spanish taxonomy automatically derived from an MRD (Rigau et al. 98) • WordNet • using a bilingual MRD Semi-automatic methods for WN construction 2002 International WordNet Conference - 47

  48. Merge approach: Mapping Taxonomies Step 3: Creation of translation links animal (Tops <animal, animate_being, ...>) (person <beast, brute, ...>) (person <dunce, blockhead, ...>) ave (animal <bird>) (artifact <bird, shuttle, ...>) (food <fowl, poultry, ...>) (person <dame, doll, ...>) faisán (animal <pheasant>) (food <pheasant>) rapaz (animal <bird>) (artifact <bird, shuttle, ...>) (food <fowl, poultry, ...>) (person <dame, doll, ...>) Semi-automatic methods for WN construction 2002 International WordNet Conference - 48

  49. Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm • Iterative algorithm for function optimisation based on local information • it can deal with any kind of constraints • variables (senses of the taxonomy) • labels (synsets) • Finds a weight assignment for each possible label for each variable • weights for the labels of the same variable add up to one • weight assignation satisfies -to the maximum possible extent- the set of constraints Semi-automatic methods for WN construction 2002 International WordNet Conference - 49

  50. Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm 1) Start with a random weight assignment 2) Compute the support value for each label of each variable (according to the constraints) 3) Increase the weights of the labels more compatible with context and decrease those and decrease those of the less compatible labels. 4) If a stopping/convergence is satisfied, stop, otherwise go to step 2. Semi-automatic methods for WN construction 2002 International WordNet Conference - 50

More Related