Building a Large-Scale Knowledge Base for Machine Translation

Building a Large-Scale Knowledge Base for Machine Translation Kevin Knight and Steve K. Luk Presenter: Cristina Nicolae

Linguistic resources combined into PANGLOSS • PENMAN Upper Model(Bateman 1990) • top-level network of 200 nodes implemented in the LOOM KR language • makes extensive use of syntactic-semantic correspondences (taxonomy  grammar) • ONTOS(Carlson & Nirenburg 1990) • top-level ontology designed to support machine translation • Longman’s Dictionary (LDOCE) • words with definition, usage, syntactic code ([B3] for adj+to), semantic code ([H] for human), pragmatic code ([ECZB] for economics/business) • WordNet(Miller 1990) • semantic word database • Collins Bilingual Dictionary • Spanish-English dictionary

Merging resources

Merging resources – contributions • LDOCE: syntax and subject area • WordNet: synonyms and hierarchical structuring • the upper structures: organize the knowledge for NLP in general and the English generation in particular • the bilingual dictionary: lets us index the ontology from a second language

Definition Match Algorithm • two word senses should be matched if their two definitions share words • looks also at related words and senses (e.g. synonyms) LDOCE • (batter_2_0) “mixture of flour, eggs and milk, beaten together and used in cooking” • (batter_3_0) “a person who bats, esp. in baseball – compare BATSMAN” WordNet • (BATTER-1) “ballplayer who bats” • (BATTER-2) “a flourmixture thin enough to pour or drop from a spoon” • Match: • (batter_2_0) with (BATTER-2) • (batter_3_0) with (BATTER-1)

Definition Match Algorithm – Results Ran algorithm on all nouns from LDOCE and WordNet.

Hierarchy Match Algorithm • uses sense hierarchies inside LDOCE and WordNet • once two senses are matched, it is a good idea to look at their respective ancestors and descendants for further matches • Match: • animal_1_2 with ANIMAL-1 • and their respective animal-subhierarchies • start with unambiguous words and match them, then look downward and upward in the hierarchies rooted at them and match those too

Hierarchy Match Algorithm – Results • In the end, the algorithm produced 11,128 noun sense matches at 96% accuracy.

Bilingual Match Algorithm • goal is to annotate the ontology with a large Spanish lexicon • from: • mappings between Spanish and English words (from Collins) • mappings between English words and ontological entities (from WordNet) • conceptual relations between ontological entities • we obtain: • direct links between Spanish words and ontological entities

Discussion • each merge algorithm presented above is verified by humans afterwards (humans are faster at verifying info than generating it from scratch) • semi-automatic merging brings together complementary sources of information • also allows us to detect errors and omissions where resources are redundant

Building a Large-Scale Knowledge Base for Machine Translation