150 likes | 169 Vues
XRCE at ImageCLEF 07. Stephane Clinchant, Jean-Michel Renders and Gabriela Csurka Xerox Research Centre Europe France. Outline. Problem statement Image Similarity Text Similarity Fusion between text and image Cross-Media Similarities Experimental results Conclusion. Problem Statement.
E N D
XRCE at ImageCLEF 07 Stephane Clinchant, Jean-Michel Renders and Gabriela Csurka Xerox Research Centre Europe France
Outline • Problem statement • Image Similarity • Text Similarity • Fusion between text and image • Cross-Media Similarities • Experimental results • Conclusion
Problem Statement 1 Query Image(s) + text Ranked Documents Image similarity 3 Database Cross-media similarity 2 Text similarity • Problem: • Retrieve relevant images from a given cross-media database (images with text) given a set of query images and a query text • Proposed solutions: • Rank the images in the database based on image similarity (1), text similarity (2) and cross-media similarities (3)
Image Similarity • The goal is to define an image similarity measure that is able to “best” reflect a “semantic” similarity of the images. • E.g. sim( , ) > sim( , ) • Our proposed solution (detailed in next slides) is to • consider both local color and local texture features • build a generative model (GMM) in the low level feature space • represent the image based on Fisher Kernel Principles • define a similarity measure between Fisher Vectors
Fisher Vector • Given a generative model with parameters (GMM) • the gradient vector • normalized by the Fisher information matrix • leads to a unique “model-dependent” representation of the image, called Fisher Vector • As similarity between Fisher vectors the L1-norm was used: Fisher Kernels on Visual Vocabularies for Image Categorization, F. Perronnin and C. Dance, CVPR 2007.
Text similarity • The text is first pre-processed including: • Tokenization, lemmatization, word decompounding and stop-word removal • The text is modeled by a multinomial language model and smoothed via Jelinek-Mercer method: • where pML(w |d ) #(w, d) and pML(w| C ) d#(w,d) • The textual similarity between two documents is defined by the cross-entropy function:
Enriching the text using external corpus • Reason: the texts related to the images in the corpus are poor (title only). • How: each “text” in the corpus was enriched as follows: • For each terms in the document we add related terms based on their clustered usage analysis an external corpus • The external corpus was the Flickr image database • The relationship between terms was based on the frequency of their co-occurrence as “tags” for the same image in Flickr (see top 5 ex. below)
Fusion between image and text • Early fusion: • Simple concatenation of image and text features (e.g. bag-of-words and bag-of-visual-words) • Estimating their co-occurences or joint probabilities (Mori et al, Vinokourov et al, Duygulu et al, Blei et al, Jeon et al, etc ) • Late fusion • Simply combining the scores of mono-media searches (Maillot et al, Clinchant et al) • Intermediate level fusion • Relevance models (Jeon et al ) • Trans-media (or intermedia) feedback (Maillot et al, Chang et al)
Intermediate level fusion • Compute mono-media similarities between an aggregate of objects coming from a first retrieval step and a multimodal object . • Use the duality of data to switch media during feedback process Pseudo Feedback: Top N ranked images based on image similarity Final rank: Re-ranked documents based on textual similarity … Aggregate textual information …
Aggregate information from pseudo-feedback • Aim: • Compute similarities between an aggregate of objects Nimg(q) corresponding to a first retrieval for query q and a new multimodal object u in the Corpus • Where Nimg(q)={T(I1), T(I2)… T(IN)} , T(Ik) is the textual part of the kth image Ik inthe(pseudo)-feedback group based on image similarity • Possible solutions: • Direct Concatenation: Aggregate (concatenate) T(Ik), k=1,N to form a single objectand compute text similarity between it and T(u). • Trans-media document re-ranking: Aggregate all similarity measures between couple of objects . • Complementary (or Inter-media) Feedback: Use a pseudo feedback algorithm to extract relevant features of Nimg(q) and use them to compute the similarity with T(u).
Trans-media document re-ranking • We define the following similarity measure between an aggregate of objects Nimg and a multimodal object u: • Notes • This approach can be seen as a document re-ranking method instead of a query expansion mechanism. • The values simTXT(T(u),T(v)) can be pre-computed offlineif the corpus is of reasonable size. • By duality, we can inverse the role of images and text:
Complementary Feedback • We derive a LM (F)for the “relevance concept” from the text set F=Nimg(q): • F is assumed to be multinomial (peaked at relevant terms) estimated by EM from: where P(w|C) is word probability built upon the Corpus, and (=0.5) a fixed parameter. • The similarities between Nimg and T(u) is given by the cross-entropy similarity between F and T(u) or we can first interpolate F with the query text: • Notes • (=0.5 in our exp) can be seen as a mixing weight between image and text • Unlike, trans-media re-ranking method, it needs a second retrieval step. • We can inverse the role of images and text if we use Rocchio’s method instead of (1). A Study of smoothing methods for Language Models applied to Information Retrieval, Zhai and Lafferty, SIGIR 2001.
XRCE’s ImageCLEF Runs • LM: language model with cross entropy • FV+L1: Fisher Vector with L1 norm • FLR: - text enriched by Flicker tags • TR: Transmedia Reranking • CF: Complementary Feedback • Ri: Run I • QT: Query Translation
Conclusion • Our image similarity measure (L1 norm on Ficher Vectors) seems to be quite suitable for CBIR. • It was the second best “Visual Only ” system and unlike the first system it does not used any query expansion (nor feedback) • Combining it with text similarity within an “intermediate level fusion” allowed for a significant improvement. • Mixing the modalities increased the performance of about ~50% (relative) over mono-media (pure text or pure image) systems . • Three out of six proposed cross-media systems were the best three “Automatic Mixed Runs” . • The system well performed even when the query and the Corpus were in different languages (English versus German).