1 / 37

Automatic Lexicon Generation through WordNet

Automatic Lexicon Generation through WordNet. by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004. Introduction. A lexicon is the heart of any natural language processing system. Difficult to construct requiring enormous amount of time and man power.

artan
Télécharger la présentation

Automatic Lexicon Generation through WordNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Lexicon Generation through WordNet by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004

  2. Introduction • A lexicon is the heart of any natural language processing system. • Difficult to construct requiring enormous amount of time and man power. • Document specific dictionary generation – • Given a document D and word W therein, which sense S of W should be picked up from the document ? • Can one construct a document specific dictionary wherein single senses of the words are stored ?

  3. Introduction UW Dictionary • An important machine readable lexical resource used by the enconverter and deconverter software's. UW Dictionary Analysis Rules Enconverter Natural Language UNL

  4. Introduction (UW dictionary) Restriction • Format of dictionary entries – • Semantic attributes (derived from the ontology). • Syntactic attributes (POS, person, number, tense). • Used for the firing of appropriate analysis rules. [crane] “crane (icl>bird)” (N, ANIMT, FAUNA, BIRD); HW UW Attributes (both syntactic and semantic)

  5. Introduction Ontology* • Animate (ANIMT) • Flora (FLORA) • Shrubs (ANIMT, FLORA, SHRB), e.g. jasmine • Aquatic plants(ANIMT, FLORA, AQTC), e.g. lotus • …. • Fauna (FAUNA) • Mammals (MML) • Reptiles (ANIMT, FAUNA, RPTL), e.g. lizard • Birds (ANIMT, FAUNA, BIRD) • Fish (ANIMT, FAUNA, FISH) • Insects (ANIMT, FAUNA, INSCT), e.g. butterfly • …… *Dictionary group, CFILT, IIT Bombay.

  6. English-UW dictionary generation

  7. English-UW dictionary generation • Resources used – • English WordNet, a WSD* system (soft word sense disambiguation method), the UNLKB and an inferencer. • Knowledge based approach. * G. Ramakrishnan and P. Bhattacharya. Soft Word Sense Disambiguation, GWN 2004

  8. English-UW dictionary generation Method • Stage 1 – • Stage 2 – Word1 word2.. ----------- ----------- Word1:N:1 Word2:N:3 ----------- ----------- WSD* POS and Sense tagged document Input Document

  9. English-UW dictionary generation (Method) ----------- ----------- ----------- ------ Word1:pos1:sense1 Word2:pos2:sense2 ----------- ----------- Inference Engine UW Dictionary Tagged Document KB WordNet UNL KB Explanation Database of rules

  10. UW generation UW generation for nouns

  11. UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document KB WordNet UNL KB

  12. UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 2 A query to collect semantic information KB WordNet UNL KB

  13. UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document organism 2 A query to collect semantic information fauna, animal bird 3 KB crane WordNet UNL KB

  14. UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane WordNet UNL KB

  15. UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane WordNet UNL KB 5

  16. UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- Crane(icl>bird) crane:N:4 6 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane 6 WordNet UNL KB 5

  17. UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- Crane(icl>bird) crane:N:4 6 1 Inference Engine 7 Tagged Document Explanation 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane 6 WordNet UNL KB 5

  18. UW generation UW generation for verbs

  19. UW generation for verbs Input word {hypernyms(word)} Π {‘be’, ‘continue’, etc} = 0 true (icl > be) e.g. : exist (icl > be) false {hypernyms(nominal word)} Π {‘phenomenon’, ‘natural event’, etc} = 0 true (icl > occur) e.g. : rain (icl > occur) false (icl > do) e.g. : make (icl > do)

  20. UW generation for adjectives Input word UW present in the UNL KB ? Yes Pick the UW e.g. : broad (aoj > thing) No IS_DEFINED (is_a_value_ofrelation) on the input word ? Yes (aoj > thing) e.g. : good (aoj > thing) No (mod > thing) e.g. : green (mod > thing)

  21. English-UW dictionary generation (Method) Semantic attribute generation

  22. Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document KB WordNet Database of rules

  23. Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 2 A query to collect semantic information KB WordNet Database of rules

  24. Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document organism 2 A query to collect semantic information fauna, animal bird 3 KB crane WordNet Database of rules

  25. Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal bird A query to collect relevant rules 3 KB crane WordNet Database of rules

  26. Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal bird A query to collect relevant rules IF hypernym=‘organism’ THEN generate ‘ANIMT’ ELSE generate ‘INANI’; IF hypernym=‘fauna’ THEN generate ‘FAUNA’; IF hypernym=‘bird’ THEN generate ‘BIRD’; --- ------ ---- 3 KB crane WordNet Database of rules 5

  27. Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 6 (N,ANIMT,FAUNA,BIRD) Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal bird A query to collect relevant rules IF hypernym=‘organism’ THEN generate ‘ANIMT’ ELSE generate ‘INANI’; IF hypernym=‘fauna’ THEN generate ‘FAUNA’; IF hypernym=‘bird’ THEN generate ‘BIRD’; --- ------ ---- 3 KB crane WordNet Database of rules 5

  28. Semantic attribute generation Database of rules Table 1. Rules for nouns (96) Table 2. Rules for verbs (405) Table 3.1. Rules for adjectives (29) Table 3.2. Rules for adjectives (3258) Table 4. Rules for adverbs (556) • No of such rules: 4344

  29. Experiments and Results Precision = No of correct entries in the dictionary Total no of entries in the dictionary Precision for nouns – 93.9% Precision for verbs – 84.4% Document No  Document No 

  30. Experiments and results Precision = No of correct entries in the dictionary Total no of entries in the dictionary Precision for adjectives – 90.06% Precision for adverbs – 86% Document No  Document No 

  31. Implementation details • Subtasks identified – • MySQL database is used for storing the rules and the UNL KB. • 7540 entries in the UNL KB. • 4344 entries in the rule base. • Inference engine in C++. • Web interface of the DDG in CGI & PHP. • Other utilities like UNL KB organizer, Rule entry interface, WSD integrator are implemented in Perl. • LOC 4761

  32. Demo

  33. Method Hindi-UW dictionary generation

  34. Hindi-UW dictionary generation • WordNet API is used to obtain all possible parts-of-speech and all possible senses for every word. • Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.

  35. Hindi-UW dictionary generation • Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes. • The Hindi UW dictionary database is queried (on the basis of input-word and its POS) to obtain an appropriate UW. • In this step the irrelevant entries are disabled and the incorrect ones are corrected manually by the lexicographer.

  36. Demo

  37. Conclusion and future work • The burden of lexicography has been reduced considerably. • The system is being routinely used in our work on machine translation in a tri-language setting (English, Hindi and Marathi). • Future work will be directed towards the implementation of part-of-speech tagger and word-sense-disambiguator, for Hindi and Marathi languages.

More Related