240 likes | 340 Vues
This project outlines the objectives, partners, resources, methods, results, and conclusions of the Hungarian Eurovoc indexing workshop. It details the translation process, terminological database, indexing strategies, online dictionary lookup, and manual checks carried out in developing the Hungarian Eurovoc version. The project involves collaboration with various partners and stakeholders to create a full-scale authoritative Eurovoc system for Hungary.
E N D
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu
Overview of the project • Objectives • Partners • Resources • Methods • Results • Conclusions EUROVOC Indexing Workshop
Project objectives • Hungarian EUROVOC version • only a draft version planned at first • an authorative full-scale system • Automatic indexing of documents • using the technology developed at JRC • prototype system for one domain EUROVOC Indexing Workshop
Partners • Project consortium: • HAS RIL (coordinator) • MorphoLogic Kft. (partner) • Collaborators: • JRC, Ispra • Hungarian Parliament • Ministry of Justice EUROVOC Indexing Workshop
Resources • NLP toolset (RIL) • Digital dictionaries, software technology (MorphoLogic) • Indexing technology (JRC Ispra) • Terminology database, translation, supervision expertise (Justice Ministry) • Coordination funding of Hungarian EUROVOC (Hungarian Parliament) EUROVOC Indexing Workshop
EUROVOC translation • Done by the Translation Coordination Unit of the Ministry of Justice • Team coordinating the massive effort of preparing the Hungarian translation of Acquis Communitaire • Maintaining an online Terminological Database EUROVOC Indexing Workshop
Terminological Database EUROVOC Indexing Workshop
Translation process • English, French, German & Spanish EUROVOC versions in xml files • Automatic lookup of Terminological Database (cc. 20% coverage) • Notepad2 xml-aware editor used • micro-thesauri translated first, corresponding descriptors second • pool of experts consulted when needed EUROVOC Indexing Workshop
Indexing strategies • Corpus: Hungarian translation of Acquis Communitaire • Two approaches • To translate English associate terms (possible short-cut?) • To reconstruct the generation of associate terms by running the JRC technology on the Hungarian data EUROVOC Indexing Workshop
Translation of associate terms • Hypothesis: • relation between English associate term and EUROVOC descriptor is language independent • hence Hungarian equivalent of English term will also serve as appropriate associate term in Hungarian texts EUROVOC Indexing Workshop
Online dictionary lookup • MorphoLogic Online English-Hungarian dictionaries applied • 24.7 % direct match <LIBELLE_EN>suspension of payments</LIBELLE_EN> <LIBELLE_DE>Zahlungseinstellung</LIBELLE_DE> <LIBELLE_FR>cessation de paiement</LIBELLE_FR> <LIBELLE_ES>suspensión de pagos</LIBELLE_ES> <LIBELLE_HU>kifizetések felfüggesztése</LIBELLE_HU> EUROVOC Indexing Workshop
Manual check of automatic assignments • Equivalence cannot be judged on its own merits: the Hungarian equivalent must be the one occuring in the texts • the Hungarian terms must be looked up in the translation corpus as well • parallel corpus aligned at least on the document level must be compiled EUROVOC Indexing Workshop
Manual check • Even frequency lists are useful: <LIBELLE_EN>sales promotion</LIBELLE_EN> <LIBELLE_DE>Absatzförderung</LIBELLE_DE> <LIBELLE_FR>promotion commerciale</LIBELLE_FR> <LIBELLE_ES>promoción comercial</LIBELLE_ES> <LIBELLE_HU>eladásösztönzés</LIBELLE_HU> Reklám 149Promóció 60Eladásösztönzés 1 EUROVOC Indexing Workshop
Manual check • Even frequency lists are useful: <LIBELLE_EN>toxic substance</LIBELLE_EN> <LIBELLE_DE>Giftstoff</LIBELLE_DE> <LIBELLE_FR>substance toxique</LIBELLE_FR> <LIBELLE_ES>sustancia tóxica</LIBELLE_ES> <LIBELLE_HU>toxikus anyagok</LIBELLE_HU> <LIBELLE_HU>mérgező anyagok</LIBELLE_HU> Equally frequent EUROVOC Indexing Workshop
Generation of Hungarian associate-lists • Tasks • Compile corpus of Hungarian translation of Acquis Communitaire • Tag and lemmatize words • Compile list of stop words • Run automatic indexing tools (JRC) EUROVOC Indexing Workshop
Hungarian Acquis Communautaire corpus • 8308 files <!ELEMENT document (title+,text,lemmatised, descriptors,description) > HUN tokens 21,899,924 EN tokens 20,394,088 EUROVOC Indexing Workshop
English stop-word list • English stop word list: 1720 items • function words • "EUspeak" • objective, arrangements, committee • Some strange multiword strings necessary_to_comply_with_this_directive forward_this_resolution_to_the_commission EUROVOC Indexing Workshop
Hungarian stop-word list • translated English items • checked their occurrence in HU CELEX • generated unigram,bigram and trigram frequency lists from HU CELEX corpus • checked first 3000 items on each list and added to the stwd list if needed • double checked infrequent items on English translation list and replaced translation with synonyms EUROVOC Indexing Workshop
Hungarian stop-word list single word entries 1265 multi-word entries 752 Total 2017 EUROVOC Indexing Workshop
Automatic indexing run 1 7971 texts divided into 3 sets:(total length of 65702474 chars) • 202 optimisation (evaluation set) • 179 final evaluation (test set) • 7590 the training set EUROVOC Indexing Workshop
Precision/recall in terms of number of Eurovoc descriptors EUROVOC Indexing Workshop
Evaluation in terms of rank EUROVOC Indexing Workshop
Precision/Recall graph : EUROVOC Indexing Workshop
Conclusions • First run already yields results comparable to other languages • scope for fine-tunig/filtering process • interesting to compare results gained from the two approaches EUROVOC Indexing Workshop