Multilingual Information Retrieval using GHSOM

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

Outline • Introduction • Document Processing and Clustering by GHSOM • Association Discovery and MLIR • Experimental Result • Conclusions

Introduction • Most of the search engines provide only monolingual search interface. • It would be convenient for the users to express their queries in familiar language and search documents in other languages. • Cross-lingual or multilingual information retrieval • How to achieve this?

Introduction • Translate the queries or the documents into another language • Easy and convenient • Some kind of machine translation engine must be used • Imprecise for modern machine translation systems • Match queries and documents directly • Direct match of semantics • Difficult to match semantics; need for schemes of semantic relatedness discovery between languages

Introduction • Multilingual text mining • Discovering semantic relationships between linguistic entities of different languages • In this work, we will develop a MLTM scheme based on GHSOM and apply it on MLIR task.

Chinese documents query Chinese document vectors preprocessing Train by GHSOM English documents English document vectors Parallel corpora Document associations Hierarchy of Chinese documents Hierarchy of English documents Association discovery Document/Keyword associations Retrieval result MLTM process Keyword associations MLIR process System Architecture

Document Processing and Clustering by GHSOM • GHSOM was proposed by Rauber et al. to provide the SOM with capabilities of dynamic map expansion and hierarchy construction. • We used GHSOM to organize multilingual documents into hierarchies.

Layer 0 Layer 1 Layer 2 Layer 3 Document Processing and Clustering by GHSOM • A typical structure of GHSOM

Document Processing and Clustering by GHSOM • Document preprocessing • word segmentation • stemming • stopword elimination • keyword selection • Document encoding • A document Dj is encoded into a vector Dj = {tf-idfij}, 1  i  |V|, where V denotes the vocabulary.

Chinese hierarchy English hierarchy Eq Ck Ep C4 E3 C1 C2 E1 E2 E5 E4 C3 Document labelling C5 Document Processing and Clustering by GHSOM • Document clustering • Document vectors were trained by GHSOM. • Two hierarchies were constructed for English and Chinese documents respectively. keyword cluster k1 k2 k5 document cluster

Association Discovery • The constructed hierarchies reveal document and keyword associations for individual languages. • However, associations between documents or keywords of different languages are much difficult to find because there is no direct mapping between these hierarchies.

Association Discovery • Finding Associations • to associate a Chinese keyword cluster with an English keyword cluster • a kind of general problem of ontology alignment • A Chinese keyword cluster is considered to be related to an English one if they represent the sametheme. • the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it

Association Discovery • Thus we could associate two clusters according to their corresponding document clusters. • parallel corpora were used • the correspondence between documents of different languages is known a priori • To associate a Chinese cluster Ck with some English cluster El, we use a voting scheme to calculate the likelihood of such association.

Association Discovery • Vote for best-matched cluster • For each pair of Chinese documents Ci and Cj in Ck, we should find the neuron clusters which their English counterpartsEi and Ej are labelled to in the English hierarchy. Let these clusters be Ep and Eq. • Find the shortest path between Ep and Eq in the English hierarchy. • Add 1 to both Ep and Eq. Add 1/(dist(Ci, Cj)-1) to all other clusters in the path. • Repeat 1-3 for all pairs of documents in Ck.

English hierarchy 0.83 2 1.33 0.83 2 2 0 Association Discovery • We associate Ck with El when it has the highest score. • An example

Association Discovery • Document associations • Chinese document Ci is associated with English document Ej if their corresponding clusters are associated. • Chinese document  English document • Keyword associations • A Chinese keyword labelled to neuron k in the Chinese hierarchy will be associated with an English keyword labelled to neuron l in the English hierarchy if Ck and El are associated. • Chinese keyword  English keyword

Association Discovery • Document-keyword associations • When Ck is associated with El, all documents and keywords labelled to these two neurons are associated. • Chinese document  English keyword • English keyword  Chinese document

MLIR application • The documents associated with a query keyword q  Q are retrieved according to the document-keyword associations. • Ranking: SR(q,Dj) = SC(q,Dj)SK(q,Dj) • take account of the importance of q in a cluster as well as a document

The Ranking • SC(q,Dj): cluster score, measures the importance of the cluster that Djbelongs to • Eq is the cluster that Cq, which is the Chinese cluster that q associates with, is associated with. • EDj is the document cluster associated with Dj in the English hierarchy • (Eq, EDj) measures the shortest path length between Eq and EDj

The Ranking • SK(q,Dj): document score, measures the importance of q in Dj • the value of the element corresponding to q in the document vector of Dj, i.e. Dj • The ranking score of a Chinese document in responding to an English query keyword is also calculated in the same way by exchanging the languages of the query and document.

Experimental Result • Sinorama parallel corpora were used • Chinese article was faithfully translated into English • Our corpus contains 10672 parallel documents. • We have a Chinese vocabulary of size 12941 and English vocabulary of size 13723. • Each document is transformed into a vector. • We used the GHSOM program developed by Rauber’s team to train the bilingual vectors. • http://www.ifs.tuwien.ac.at/~andi/ghsom/

Experimental Result • An example Sinorama document

Experimental Result

Experimental Result • We developed a simple search engine to evaluate the performance of our method in MLIR. • Performance evaluation is based on classic recall and precision measures. • 31 queries words: 19 Chinese and 12 English • Relevant documents to query word q • documents labelled to either Cq or Eq

Experimental Result

Conclusions • We proposed a text mining method to extract associations between multilingual texts and keywords. • GHSOM performs well in clustering and organizing documents. • The discovered associations seems plausible for MLIR and other MLTM applications.

Thanks for your attention.

Multilingual Information Retrieval using GHSOM

Multilingual Information Retrieval using GHSOM

Presentation Transcript

Information retrieval

Information Retrieval

AIRUS (Automatic Information Retrieval Using Speech)

Information Retrieval Using SQL

Simultaneous Multilingual Search for Translingual Information Retrieval

Information Retrieval

Multilingual Issues in Information Retrieval and Resource Description Overview

Information Retrieval

Automatic Image Annotation Using GHSOM

Browsing by phrases: terminological information in interactive multilingual text retrieval

A Multilingual Hierarchy Mapping Method Based on GHSOM

Information Retrieval

Information Retrieval

Information Retrieval

Multilingual Information Exchange

information retrieval

Information Retrieval