380 likes | 518 Vues
Automatic Lexicon Generation through WordNet. by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004. Introduction. A lexicon is the heart of any natural language processing system. Difficult to construct requiring enormous amount of time and man power.
E N D
Automatic Lexicon Generation through WordNet by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004
Introduction • A lexicon is the heart of any natural language processing system. • Difficult to construct requiring enormous amount of time and man power. • Document specific dictionary generation – • Given a document D and word W therein, which sense S of W should be picked up from the document ? • Can one construct a document specific dictionary wherein single senses of the words are stored ?
Introduction UW Dictionary • An important machine readable lexical resource used by the enconverter and deconverter software's. UW Dictionary Analysis Rules Enconverter Natural Language UNL
Introduction (UW dictionary) Restriction • Format of dictionary entries – • Semantic attributes (derived from the ontology). • Syntactic attributes (POS, person, number, tense). • Used for the firing of appropriate analysis rules. [crane] “crane (icl>bird)” (N, ANIMT, FAUNA, BIRD); HW UW Attributes (both syntactic and semantic)
Introduction Ontology* • Animate (ANIMT) • Flora (FLORA) • Shrubs (ANIMT, FLORA, SHRB), e.g. jasmine • Aquatic plants(ANIMT, FLORA, AQTC), e.g. lotus • …. • Fauna (FAUNA) • Mammals (MML) • Reptiles (ANIMT, FAUNA, RPTL), e.g. lizard • Birds (ANIMT, FAUNA, BIRD) • Fish (ANIMT, FAUNA, FISH) • Insects (ANIMT, FAUNA, INSCT), e.g. butterfly • …… *Dictionary group, CFILT, IIT Bombay.
English-UW dictionary generation • Resources used – • English WordNet, a WSD* system (soft word sense disambiguation method), the UNLKB and an inferencer. • Knowledge based approach. * G. Ramakrishnan and P. Bhattacharya. Soft Word Sense Disambiguation, GWN 2004
English-UW dictionary generation Method • Stage 1 – • Stage 2 – Word1 word2.. ----------- ----------- Word1:N:1 Word2:N:3 ----------- ----------- WSD* POS and Sense tagged document Input Document
English-UW dictionary generation (Method) ----------- ----------- ----------- ------ Word1:pos1:sense1 Word2:pos2:sense2 ----------- ----------- Inference Engine UW Dictionary Tagged Document KB WordNet UNL KB Explanation Database of rules
UW generation UW generation for nouns
UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document KB WordNet UNL KB
UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 2 A query to collect semantic information KB WordNet UNL KB
UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document organism 2 A query to collect semantic information fauna, animal bird 3 KB crane WordNet UNL KB
UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane WordNet UNL KB
UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane WordNet UNL KB 5
UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- Crane(icl>bird) crane:N:4 6 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane 6 WordNet UNL KB 5
UW generation for nouns crane:N:4 Word2:pos2:sense2 ----------- ----------- Crane(icl>bird) crane:N:4 6 1 Inference Engine 7 Tagged Document Explanation 4 organism 2 A query to collect semantic information fauna, animal A query to collect relevant rules bird 3 KB crane 6 WordNet UNL KB 5
UW generation UW generation for verbs
UW generation for verbs Input word {hypernyms(word)} Π {‘be’, ‘continue’, etc} = 0 true (icl > be) e.g. : exist (icl > be) false {hypernyms(nominal word)} Π {‘phenomenon’, ‘natural event’, etc} = 0 true (icl > occur) e.g. : rain (icl > occur) false (icl > do) e.g. : make (icl > do)
UW generation for adjectives Input word UW present in the UNL KB ? Yes Pick the UW e.g. : broad (aoj > thing) No IS_DEFINED (is_a_value_ofrelation) on the input word ? Yes (aoj > thing) e.g. : good (aoj > thing) No (mod > thing) e.g. : green (mod > thing)
English-UW dictionary generation (Method) Semantic attribute generation
Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document KB WordNet Database of rules
Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 2 A query to collect semantic information KB WordNet Database of rules
Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document organism 2 A query to collect semantic information fauna, animal bird 3 KB crane WordNet Database of rules
Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal bird A query to collect relevant rules 3 KB crane WordNet Database of rules
Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal bird A query to collect relevant rules IF hypernym=‘organism’ THEN generate ‘ANIMT’ ELSE generate ‘INANI’; IF hypernym=‘fauna’ THEN generate ‘FAUNA’; IF hypernym=‘bird’ THEN generate ‘BIRD’; --- ------ ---- 3 KB crane WordNet Database of rules 5
Semantic attribute generation crane:N:4 Word2:pos2:sense2 ----------- ----------- crane:N:4 1 6 (N,ANIMT,FAUNA,BIRD) Inference Engine Tagged Document 4 organism 2 A query to collect semantic information fauna, animal bird A query to collect relevant rules IF hypernym=‘organism’ THEN generate ‘ANIMT’ ELSE generate ‘INANI’; IF hypernym=‘fauna’ THEN generate ‘FAUNA’; IF hypernym=‘bird’ THEN generate ‘BIRD’; --- ------ ---- 3 KB crane WordNet Database of rules 5
Semantic attribute generation Database of rules Table 1. Rules for nouns (96) Table 2. Rules for verbs (405) Table 3.1. Rules for adjectives (29) Table 3.2. Rules for adjectives (3258) Table 4. Rules for adverbs (556) • No of such rules: 4344
Experiments and Results Precision = No of correct entries in the dictionary Total no of entries in the dictionary Precision for nouns – 93.9% Precision for verbs – 84.4% Document No Document No
Experiments and results Precision = No of correct entries in the dictionary Total no of entries in the dictionary Precision for adjectives – 90.06% Precision for adverbs – 86% Document No Document No
Implementation details • Subtasks identified – • MySQL database is used for storing the rules and the UNL KB. • 7540 entries in the UNL KB. • 4344 entries in the rule base. • Inference engine in C++. • Web interface of the DDG in CGI & PHP. • Other utilities like UNL KB organizer, Rule entry interface, WSD integrator are implemented in Perl. • LOC 4761
Method Hindi-UW dictionary generation
Hindi-UW dictionary generation • WordNet API is used to obtain all possible parts-of-speech and all possible senses for every word. • Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.
Hindi-UW dictionary generation • Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes. • The Hindi UW dictionary database is queried (on the basis of input-word and its POS) to obtain an appropriate UW. • In this step the irrelevant entries are disabled and the incorrect ones are corrected manually by the lexicographer.
Conclusion and future work • The burden of lexicography has been reduced considerably. • The system is being routinely used in our work on machine translation in a tri-language setting (English, Hindi and Marathi). • Future work will be directed towards the implementation of part-of-speech tagger and word-sense-disambiguator, for Hindi and Marathi languages.