Eurovoc Indexing Workshop: Hungarian Experience

Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

Overview of the project • Objectives • Partners • Resources • Methods • Results • Conclusions EUROVOC Indexing Workshop

Project objectives • Hungarian EUROVOC version • only a draft version planned at first • an authorative full-scale system • Automatic indexing of documents • using the technology developed at JRC • prototype system for one domain EUROVOC Indexing Workshop

Partners • Project consortium: • HAS RIL (coordinator) • MorphoLogic Kft. (partner) • Collaborators: • JRC, Ispra • Hungarian Parliament • Ministry of Justice EUROVOC Indexing Workshop

Resources • NLP toolset (RIL) • Digital dictionaries, software technology (MorphoLogic) • Indexing technology (JRC Ispra) • Terminology database, translation, supervision expertise (Justice Ministry) • Coordination funding of Hungarian EUROVOC (Hungarian Parliament) EUROVOC Indexing Workshop

EUROVOC translation • Done by the Translation Coordination Unit of the Ministry of Justice • Team coordinating the massive effort of preparing the Hungarian translation of Acquis Communitaire • Maintaining an online Terminological Database EUROVOC Indexing Workshop

Terminological Database EUROVOC Indexing Workshop

Translation process • English, French, German & Spanish EUROVOC versions in xml files • Automatic lookup of Terminological Database (cc. 20% coverage) • Notepad2 xml-aware editor used • micro-thesauri translated first, corresponding descriptors second • pool of experts consulted when needed EUROVOC Indexing Workshop

Indexing strategies • Corpus: Hungarian translation of Acquis Communitaire • Two approaches • To translate English associate terms (possible short-cut?) • To reconstruct the generation of associate terms by running the JRC technology on the Hungarian data EUROVOC Indexing Workshop

Translation of associate terms • Hypothesis: • relation between English associate term and EUROVOC descriptor is language independent • hence Hungarian equivalent of English term will also serve as appropriate associate term in Hungarian texts EUROVOC Indexing Workshop

Online dictionary lookup • MorphoLogic Online English-Hungarian dictionaries applied • 24.7 % direct match <LIBELLE_EN>suspension of payments</LIBELLE_EN> <LIBELLE_DE>Zahlungseinstellung</LIBELLE_DE> <LIBELLE_FR>cessation de paiement</LIBELLE_FR> <LIBELLE_ES>suspensión de pagos</LIBELLE_ES> <LIBELLE_HU>kifizetések felfüggesztése</LIBELLE_HU> EUROVOC Indexing Workshop

Manual check of automatic assignments • Equivalence cannot be judged on its own merits: the Hungarian equivalent must be the one occuring in the texts • the Hungarian terms must be looked up in the translation corpus as well • parallel corpus aligned at least on the document level must be compiled EUROVOC Indexing Workshop

Manual check • Even frequency lists are useful: <LIBELLE_EN>sales promotion</LIBELLE_EN> <LIBELLE_DE>Absatzförderung</LIBELLE_DE> <LIBELLE_FR>promotion commerciale</LIBELLE_FR> <LIBELLE_ES>promoción comercial</LIBELLE_ES> <LIBELLE_HU>eladásösztönzés</LIBELLE_HU> Reklám 149Promóció 60Eladásösztönzés 1 EUROVOC Indexing Workshop

Manual check • Even frequency lists are useful: <LIBELLE_EN>toxic substance</LIBELLE_EN> <LIBELLE_DE>Giftstoff</LIBELLE_DE> <LIBELLE_FR>substance toxique</LIBELLE_FR> <LIBELLE_ES>sustancia tóxica</LIBELLE_ES> <LIBELLE_HU>toxikus anyagok</LIBELLE_HU> <LIBELLE_HU>mérgező anyagok</LIBELLE_HU> Equally frequent EUROVOC Indexing Workshop

Generation of Hungarian associate-lists • Tasks • Compile corpus of Hungarian translation of Acquis Communitaire • Tag and lemmatize words • Compile list of stop words • Run automatic indexing tools (JRC) EUROVOC Indexing Workshop

Hungarian Acquis Communautaire corpus • 8308 files <!ELEMENT document (title+,text,lemmatised, descriptors,description) > HUN tokens 21,899,924 EN tokens 20,394,088 EUROVOC Indexing Workshop

English stop-word list • English stop word list: 1720 items • function words • "EUspeak" • objective, arrangements, committee • Some strange multiword strings necessary_to_comply_with_this_directive forward_this_resolution_to_the_commission EUROVOC Indexing Workshop

Hungarian stop-word list • translated English items • checked their occurrence in HU CELEX • generated unigram,bigram and trigram frequency lists from HU CELEX corpus • checked first 3000 items on each list and added to the stwd list if needed • double checked infrequent items on English translation list and replaced translation with synonyms EUROVOC Indexing Workshop

Hungarian stop-word list single word entries 1265 multi-word entries 752 Total 2017 EUROVOC Indexing Workshop

Automatic indexing run 1 7971 texts divided into 3 sets:(total length of 65702474 chars) • 202 optimisation (evaluation set) • 179 final evaluation (test set) • 7590 the training set EUROVOC Indexing Workshop

Precision/recall in terms of number of Eurovoc descriptors EUROVOC Indexing Workshop

Evaluation in terms of rank EUROVOC Indexing Workshop

Precision/Recall graph : EUROVOC Indexing Workshop

Conclusions • First run already yields results comparable to other languages • scope for fine-tunig/filtering process • interesting to compare results gained from the two approaches EUROVOC Indexing Workshop

Eurovoc Indexing Workshop: Hungarian Experience