Language Translation and Media Transformation in Cross-Language Image Retrieval

自然語言處理實驗室 資訊工程學研究所臺灣大學 Natural Language Processing Lab. National Taiwan University Language Translation and Media Transformation in Cross-Language Image Retrieval Hsin-Hsi Chen 陳信希 Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hhchen@csie.ntu.edu.tw

Outline • What is Cross-Language Image Retrieval? • A Trans-Media Dictionary Approach • A Media-Mapping Approach • Summary

What is Cross-Language Image Retrieval? • Allow users employ textual queries (in one language) and example images (in one medium) to access image database with text descriptions (in another language/medium) • Two languages, i.e., query language and document language, and two media, i.e., text and image, are adopted to express the information needs and the data collection.

Why is Cross-Language Image Retrievalimportant? • Large scale of images associated with captions, metadata, Web page links, and so on, are available on the Web. • Users often use their familiar languages to annotate images and express their information needs.

Cross-Language Image Retrieval • One of Cross-Language Cross-Medium Information Retrievals • Queries and documents are in different languages and media • Language translation + medium transformation

A Trans-Media Dictionary Approach

Construction of a Trans-Media Dictionary • First alternative • translating the text description through a bilingual dictionary • then mining the relationship between textual terms and visual terms • Second alternative • mining the relationship between textual terms and visual terms directly without translation

Mine relationships between text and images • Divide an image into several smaller parts. • Link the words in caption to the corresponding parts. • Analogous to word alignment in a sentence aligned parallel corpus. • Build a trans-media dictionary.

How to represent an image - Blobs 1. Use Blobworld to segment an image into regions. 2. Partition the regions of all images into 2,000 clusters by the K-meansclustering algorithm. 3. Each cluster is assigned a unique number, i.e., a blob token. 4. Each image is represented by the blob tokens of clusters that its regions belong to. 5. Treat blobs as a language in which each blob token is a word and use text retrieval system to index and retrieve images using blob language.

An Example This image is represented as 9 Blob words: B01 B02 B02 B03 B03 B04 B04 B04 B04

Learning relations between textual and visual information - Mutual Information • p(x) is the occurrence probability of word x in text descriptions • p(y) is the occurrence probability of blob y in image blobs • p(x,y) is the probability that x and y occur in the same image • For a word wj, we can generate related blobs whose MI values exceed a threshold • The generated blobs can be regarded as the visual representation of wj. trans-media (word-blob) dictionary

Learning Correlation • Mare and foal in field, slopes of Clatto Hill, Fife segmentation hill mare foal field slope B01 B02 B03 B04

Source language textual query Target collection Training collection Language resources Images Image captions Images Image captions Query transformation Query translation Text-Image correlation learning Visual query Target language textual query Visual index Textual index Transmedia dictionary Content-based image retrieval Text-based image retrieval Result merging Retrieved images Flow of cross-language image retrieval

A Media-Mapping Approach

Test Data from ImageCLEF – St. Andrews Image Collection • Total 28,133 photographs from St. Andrews University Library’s photographic collection • All images are accompanied by a textual description written in English • <HEADLINE> and <CATEGORIES> fields of English captions are used for indexing • Okapi IR system is used to build both the textual and visual indices • The test set contains 25 and 28 topics, in 2004 and 2005 sets, respectively.

Example - An image and Its Description

Example - A Topicin ImageCLEF2004 <top> <num> Number: 2 </num> <title> Photos of Rome taken in April 1908 </title> <narr> Any view of Rome including buildings and specific locations (e.g. the coliseum) taken in April 1908 is relevant. Pictures by any photographer are relevant, but taken at any other time are not relevant. </narr> </top> <top> <num> Number: 2 </num> <title> 1908年四月拍攝的羅馬照片 </title> </top>

Experiments onTrans-Media Dictionary Approach

Analysis - Monolingual Image Retrieval • The approach of nouns only, higher threshold and more blobs has better performance • The performances of using verbs or adjectives only in different setting of n and t are similar. • The performance is dropped when name entities are added

Experiment- Cross-Language Image Retrieval • Query language is Chinese and target language is English • Chinese queries are segmented and tagged, and named entities are identified by Chinese NER tools • For each Chinese query term, find its translation equivalents using a Chinese-English bilingual dictionary • For those named entities that are not included in the dictionary, a similarity-based backward transliteration scheme is adopted • To learn the correlations between Chinese words and blob tokens, image captions are translated into Chinese by SYSTRAN system

Analysis - Cross-Language Image Retrieval • Use nouns only to generate visual query has better performance than using verbs and adjectives only • The improvement of retrieval performance is not as well as monolingual experiment. One of the reasons is that there are translation errors in training data

Upper bound - Ideal Visual Query • The performance of image retrieval depends on image segmentation and blob clustering • Set up an experiment of ideal visual query to know how far our visual generated formula can be improved • Use x² score to select blobs from relevant images • For each query, we choose 10 blobs whose x² scores are larger than 7.88 • The selected blobs form a visual query to retrieve images. • The retrieval result is combined with that of a textual query

Experiments onMedia-Mapping Approach

Experiment Results at ImageCLEF2005 + +15.78% +25.96% + +11.01% Performance of EE+EX > CE+EX  EE > EX > CE > Visual run

Lessons Learned • Combining Textual and Visual information can improve performance • Comparing to initial visual retrieval, average precision is increased from 8.29% to 34.25% after feedback cycle.

Summary • An approach of combining textual and image features is proposed for Chinese-English image retrieval.  a corpus-based feedback cycle from CBIR • Compared with the performance of monolingual IR (0.3952), integrating visual and textual queries achieves better performance in CL image retrieval (0.3977). resolve part of translation errors • The integration of visual and textual queries also improves the performance of the monolingual IR from 0.3952 to 0.5053.  provide more information • The improvement is the best among all the groups.  78.2% of the best monolingual text retrieval

Thanks!!

Language Translation and Media Transformation in Cross-Language Image Retrieval