JIGSAW ( Joint search with ImaGe , Speech And Words)

JIGSAW(Joint search with ImaGe, Speech And Words) Interactive Mobile Visual Search with Multimodal Queries Yang Wang University of Science and Technology of China Yang Wang Jingdong Wang Houqiang Li Shipeng Li Tao Mei

Search for images On The Go ACM Ubicomp 2011

Search for images On The Go ?

Search for images On The Go ? ACM Multimedia 2011

Existing image search method on mobile phone Formulate an intent Present the resuts Drawbacks • Clear goals • Explicit intent • See and search • Product, landmark, … ACM Multimedia 2011

Motivations – partial object Mind  Description  Image  Information • User has no exact entity name or instant photo. • The user has only visual descriptions: • Can only describe an oil paint without it’s name. (“Find a oil paint with a man with a straw hat.”) • Want to find local business with only description (“Find a restaurant with a red gate and two stone lions in front and two red lantern on the top.”) • ……

Our approach • Multi-modal + Multi-touch = Visual intent • Benefits: • Explicate the search intent • Express visual intent • Natural interface … “sky”, “grass” Find a picture with sky and grass … Describe a visual scenario with speech Extract entity words from the speech Exemplary images Composite a visual query Search results ACM Multimedia 2011

Existing research -> A solution for mobile -> Using exemplar images -> Using region-based matching HaoXu, et al. Image Search by Concept Map. SIGIR ’10. Changhu Wang, et al. MindFinder: Image Search by Interactive Sketching and Tagging. WWW ’10. ACM Multimedia 2011

Flow chart ② ① ④ ③ ACM Multimedia 2011

Speech recognition & entity extraction voice natural sentence “Find an iron tower under grass” commercial speech recognition engine • Nouns from WordNet* (117, 798 nouns) • Can be represented by images in ImageNet* • Ignoring preposition, verbs, adjectives, etc. • 22, 117 entities “tower”, “grass” ① *http://wordnet.princeton.edu *http://www.image-net.org ACM Multimedia 2011

Exemplary image generation • Obtain Images with each text query (top 500) & extract features • Cluster images and keep cluster centers • User can choose one exemplary image from each entity (such as I1 & I2) ② ACM Multimedia 2011

Composite visual query • Component: • Ck={Tk, Ik, Rk} • Visual query: • {Ck} • Component: • Ck={Tk, Ik} ③ ACM Multimedia 2011

Search ④ ACM Multimedia 2011

Visual matching ACM Multimedia 2011

Implementation • local pattern: SIFT • 6000-D bag of words • Color information: Color histogram • 192-D HSV • Shape information: Gradient histogram • 64-D • Normalized and combined in a single vector. • Calculate similarity with a idf weight. ACM Multimedia 2011

Application UI design • Windows Phone 7 • Two-step interaction • User interface ACM Multimedia 2011

Experiments • Settings • One million images from commercial search engine • Objective Evaluations • 100 test queries • Normalized Discounted Cumulative Gain (NDCG) • Response time • User study • Usability ACM Multimedia 2011

NDCG • Compared JIGSAW, Concept Map*, and text search • More Efficient than text search • Better performance than Concept Map *HaoXu, et al. Image Search by Concept Map. SIGIR ’10. ACM Multimedia 2011

System response time System response time in searching (on the phone) • Given a number of key wordsn • (500ncandidate images) x (nscores are calculated) =O(n2) • Pruned by early abortion • ~O(n) ACM Multimedia 2011

User’s time The time distribution for different users to complete a task • Single component: 30 sec • Failed trial: +20 sec • Extra component: +20 sec ACM Multimedia 2011

Demo ACM Multimedia 2011

Number of interactions (tap & drag) • Multi-touch takes up only 5% of all operations because the exemplary images are always too small on the screen. ACM Multimedia 2011

Visual results ACM Multimedia 2011

Discussions • Contributions • Introduce a new interactive visual search system on mobile • Propose a visual search method for this application • Deployed the system on a WP7 mobile phone • Future works • Improve the efficiency of visual search • Handle relative positions between objects ACM Multimedia 2011

Thanks!

Backup slides

Similarity between image and exemplar ACM Multimedia 2011

Similarity between image and exemplar We index the features in 9x9 cells Multiple cells are combined to approximate the region to be compared ACM Multimedia 2011

Similarity between image and exemplar • Combine features covered by the region • Slide the window around • Calculate similarity e(i, j) • Calculate similarity e Save features in M x M cells ACM Multimedia 2011

Consider more positions Draw the desired distribution from a Gaussian shape centered at the desired position Compare ekJ(i, j) with a desired distribution of the k-th component Dk(i, j) ACM Multimedia 2011

Penalty • For cells outside R(k), penalty is pooled by • Instead of e, the feature of single cell o(i, j) is used to accelerate the matching speed. • The spatial relevance score between the candidate image and the k-th component: ACM Multimedia 2011

Fusion scores count the average score Divergence penalty: ACM Multimedia 2011

Ranking -1 < score(J) < 1 Rank the candidate images by their scores in descending order ACM Multimedia 2011

JIGSAW ( Joint search with ImaGe , Speech And Words)

JIGSAW ( Joint search with ImaGe , Speech And Words)

Presentation Transcript

Genetic Jigsaw

Jigsaw Project

Jigsaw Technique

Jigsaw Procedure

RJ Jigsaw

Genetic Jigsaw

JIGSAW

Jigsaw Activity

Jigsaw Read

Jigsaw Research

Jigsaw

Jigsaw

Jigsaw Learning

Jigsaw 101

JIGSAW

Jigsaw

Jigsaw

Poetry Jigsaw

Country Jigsaw

Jigsaw Counting

Jigsaw

Jigsaw Mats