Supervised Multimodal Image Retrieval with SVMs for Enhanced Ranking
This research presents a novel approach to improving image ranking by combining visual, semantic, and geographic information. The proposed method utilizes SVMs for learning a multimodal similarity measure, integrating visual features, annotated concepts, and GPS coordinates. Experimental results demonstrate enhanced retrieval accuracy by leveraging the supervised machine learning approach.
Supervised Multimodal Image Retrieval with SVMs for Enhanced Ranking
E N D
Presentation Transcript
Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information Presented by Ivan Chiou
Author • Duc-Tien Dang-Nguyen, • Giulia Boato, • Alessandro Moschitti, • Francesco G.B. De Natale • Department to Information and Computer Science –University of Trento – Italy
Abstract • Background: approaching to improve image ranking • Concerned about user annotation, time and location • Propose • To define a novel multimodal similarity measure • Combined visual features, annotated concepts, and geo tagging. • Propose a learning approach based on SVMs(Support Vector Machine).
Introduction • Image-graph based techniques • Vertices represent including visual and semantic information. • Probabilistic models • PLSA(Probabilistic Latent Semantic Analysis) methodology • Visual features • Annotation • GPS coordinates • SVMs, able to learn from the data weight to be assigned. • Random set of image queries • Retrieve a set of images having highest similarity • Judged relevant by human annotators • Train SVMs with examples.
Combining visual, Concept and GPS signals(1/2) • PLSA • User generated multimedia contents • Visual content • Image tagging • Geo location • Producing corresponding topic spaces with reduced dimensions. • Expectation Maximization • Fast on-line retrieval for very large dataset
Combining visual, Concept and GPS signals(2/2) • PLSA – with 100 topics. • Visual feature • SIFT(Scale Invariant Feature Transform) • 128 element descriptor with 2500 salient points. • 2500 salient points (K-Means, training set of 5000 images) • Bag-of-words associating a feature vector with each image. • Image annotation • Consists of all the tags in the dataset, except words used just once or by a single user. • Total number:5500 words • GPS coordination • Calculated as distance between the GPS coordinates of the query and the retrieved images.
Supervised Multimodal Approach(1/2) • Improve retrieval accuracy • Relies on Development Set(DS) • Relevant images • Relevant • Irrelevant • Annotated by users • Proposing SVMs • Two important property • They are robust to overfitting , offering the possibility to trade-off between generalization and empirical error to tune our model to a more general setting. • Include additional features in the parameter vector
Supervised Multimodal Approach(2/2) • SVMs: • Multimodal 2(MM2)
Experimental Result(1/5) • 100.000 images of Paris from Flickr. • 2500 SIFT / 50.000 images. • 5.500 tags / 50.000 images. • Maximum two images per user. • Avoid similar images taken by the same photographer. • 100 query images and retrieved top-ranked 9 images • How to judge it is relevant • Half of 72 annotators to consider the image relevant
Experimental Result(2/5) • Result • 900 retrieved images • VS: 305 relevant images • TS: 218 relevant images • VS+TS: 308 relevant images • MM1: 641 • GPS coordinates. • MM2: accuracy: 72% and MAP of 0.78
Experimental Result(4/5) • Figure 4-8 • Improve the basic model when the tag annotation is not reliable • Improve diversification retrieval result. (reduce the same pictures with night or day, diff perspective, and diff point of view)
Conclusion • Presented a novel way to combine visual information with tags and GPS. • Proposed a supervised machine learning approach (MM2), based on Support Vector Machines. • Result confirm that the approaches improve the accuracy.
BACKUP Presented by Ivan Chiou