230 likes | 340 Vues
Finding Translations for Low-Frequency Words in Comparable Corpora. Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni ILP, University of Wolverhampton, UK Contact email: v.pekar@wlv.ac.uk. Overview. Distributional Hypothesis and bilingual lexicon acquisition
E N D
Finding Translations for Low-Frequency Words in Comparable Corpora Viktor Pekar, Ruslan Mitkov, Dimitar Blagoev, Andrea Mulloni ILP, University of Wolverhampton, UK Contact email: v.pekar@wlv.ac.uk
Overview • Distributional Hypothesis and bilingual lexicon acquisition • The effect of data sparseness • Methods to model co-occurrence vectors of low-frequency words • Experimental evaluation • Conclusions
Distributional Hypothesis in the bilingual context • Words of different languages that appear in similar contexts are translationally equivalent • Acquisition of bilingual lexicons from comparable, rather thanparallel corpora • Bilingual comparable corpora: not translated texts, but same topic, size, style of presentation • Advantages over parallel corpora: • Broad coverage • Easy domain portability • Virtually unlimited number of language pairs • Parallel corpora = restoration of existing dictionaries
General approach • Comparable corpora in languages L1 and L2 • Words to be aligned: N1 and N2 • Extract co-occurrence data on N1 and N2 from respective corpora: V1and V2 • Create co-occurrence matrices N1×V1, each cell containing f(v,n) or p(v|n) • Create a translation matrix using a bilingual lexicon: V1×V2 • Equivalences between only the core vocabularies • Each cell encodes translation probability • Used to map a vector from L1 to the vector space of L2 • Words with the most similar vectors are taken to be equivalent
Data sparseness • The approach works quite unreliably on all but very frequent words (e.g., Gaussier et al 2004) • Polysemy and synonymy: many-to-many correspondences between the two vocabularies • Noise introduced during the translation between vector spaces
Dealing with data sparseness • How can one deal with data sparseness? • Various smoothing techniques exist: Good Turing, Kneser-Ney, Katz’s back-off • Previous comparative studies: • Class-based smoothing (Resnik 1993) • Web-based smoothing (Keller&Lapata 2003) • Distance-based averaging (Pereira et al 1993; Dagan et al. 1999)
Distance-based averaging • Probability of an unknown co-occurrence p*(v|n) is estimated from known probabilities of N’, a set of nearest neighbours of n: • where w is a weight with which n’ influences the average of known probabilities of N’; w is computed from distance/similarity between n and n’ • norm is a normalisation factor
Adjusting probabilities for rare co-occurrences • DBA was used to predict unseen probabilities • We’d like predict unseen as well as adjust seen, but unreliable probabilities: • 0 ≤ γ ≤1,the degree to which the seen probability is smoothed with data on the neighbours • Problem: how does one estimate γ?
Heuristical estimation of γ • The less frequent is n, the more it gets smoothed • Log-transformed corpus counts to downplay differences between frequent words
Performance-based estimation of γ • Exact relationship between corpus frequency of n and γ is determined on held-out pairs • The held-out data are split into frequency ranges • Mean rank of the correct equivalent in each range is computed • Function g(x) is interpolated along the mean rank points • g(n) – predicted rank for n • RR - random rank, lowest bound on mean rank
Less frequent neighbours • Remove less frequent neighbours, in order to avoid “diluting” corpus-attested probabilities
Experimental setup • 6 language pairs: all combinations with English, French, German, and Spanish • Corpora: • EN: WSJ (87-89), Connexor FDG • FR: Le Monde (94-96), Xerox Xelda • GE: die Tageszeitung (87-89, 94-98), Versley • SP: EFE (94-95), Connexor FDG • Extracted verb-direct object pairs from each corpus
Experimental setup • Translation matrices: • Equivalents between verb synsets in EuroWordNet • Translation probabilities equally distributed among different translations of a source word • Evaluation samples of noun pairs: • 1000 pairs from EWN for each language pair • Sampled from equidistant positions in a sorted frequency list • Divided into 10 frequency ranges • Each noun might have several translations in the sample (1.06 to 1.15 translations)
Experimental setup • Assignment algorithm • To pair each source noun with a correct target noun • Similarity measured using Jensen-Shannon Divergence • Kuhn-Munkres algorithm to determine the most optimal assignment on the entire set • Evaluation measure • Mean rank of the correct equivalent
Discard less frequent neighbours significant reduction of Mean Rank: Fr-Ge, Fr-Sp, Ge-Sp
Heuristical estimation of γ significant reduction of Mean Rank: all language pairs
Performance-based estimation of γ significant reduction of Mean Rank: all language pairs
Conclusions • Smoothing co-occurrence data on rare words using intra-language similarities to improve retrieval of their translational equivalents • Extensions of DBA, to smooth rare co-occurrences: • Heuristical (amount of smoothing is a linear function of frequency) • Performance-based (the smoothing function is estimated on held-out data) • Both lead to considerable improvement: • up to 48 ranks reduction (from 146 to 99, 32%) in low frequency ranges • up to 27 ranks reduction (from 81 to 54, 33%) overall