The use of machine translation tools for cross-lingual text-mining

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University

Outline • Cross-lingual text mining • Kernel CCA • Machine translation • Information retrieval experiment • Classification experiment • Conclusions

Cross-lingual text mining When applying text mining to a multilingual text corpora specific language issues appear: Information retrieval: retrieved documents should depend only on the meaning of the query and not its language. Classification: only one classifier should be learned and not a separate classifier for each language Clustering: documents should be grouped into clusters based on their content, not on the language they are written in.

loss, income, company, quarter verlust, einkommen, firma, viertel wage, payment, negotiati-ons, union zahlung, volle, gewerkschaft, verhand-lungsrunde KCCA (Kernel Canonical Correlation Analysis) KCCA learns a semantic representation of the text from a corpus of unlabeled paired documents. • On input we have set of paired documents (for each document we have a version in each language) • On output we get set of mappings from native language space into “language independent space” – subspace with semantic dimensions [Vinokourov et. al, 2002] KCCA Semantic dimensions

Paired training set and machine translation KCCA needs paired dataset for training. When there is no paired dataset available we have two options: • We use human made dataset from some other domain. • This could be unreliable because of a big semantic and vocabulary gap. • We use machine translation tools to generate paired dataset. • In our experiments we used Google Language Tools for translating documents.

Experiments • We investigated how the quality of machine translation generated train set compares with a true human generated paired corpus. • Two major issues are addressed: How much do we win or lose by using machine translation when a human generated corpus is available for • the target domain? • only for a different domain?

Experiment #1 – Information retrieval We compared two paired corpora: • Hansard corpus: aligned pairs of text chunks from the official records of the 36th Canadian Parliament Proceedings. [Germann, 2001] • Artificial corpus: half of the English and half of the French translations from Hansard corpus were replaced by machine translation. Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query. The goal was to retrieve the paired document. Experimental procedure (for each corpus): (1) KCCA trained on 1500 paired documents, (2) All 896 test documents (in both languages) projected into the KCCA semantic space, (3) Each query was projected into the KCCA semantic space and documents were retrieved using nearest neighbour based on cosine distance to the query.

Results For 65% of queries the correct document appeared on the first place. For 95% of queries the correct document appeared among first 10 results. There is no difference when query and document are in the same language When query and document are from different languages, there is around 5-10% drop in retrieval accuracy

Experiment #2 – Classification Reuters multilingual corpora (English and French) was used as a dataset. [Reuters, 2004] • First paired train set, Hansard, was taken from previous experiment; different domain than news articles. • Second paired train set was generated from the Reuters dataset using machine translation (Google). Experimental procedure (for each corpus): (1) KCCA trained on 1500 paired documents, (2) Whole Reuters corpus was projected into the KCCA semantic space, (3) Linear SVM classifier was learned in KCCA semantic space on a subset of 3000 documents and tested on a subset of 50.000 (results are averaged over 5 random splits).

Results #KCCA dimensions: 800 FE … French training set, English testing set. Artificial paired training set generates significantly better semantic space than train set taken from a different domain!

Conclusions We have shown that the machine translation can be used to generate training set for Kernel CCA which can give almost as good performance as a train set made by human translators. When no hand made translations are available this can significantly decrease the cost of a multi-lingual text mining. We would like also to thank Miha Grcar for making an automated interface to Google Language Tools!

Questions?

The use of machine translation tools for cross-lingual text-mining