1 / 35

JIGSAW ( Joint search with ImaGe , Speech And Words)

JIGSAW ( Joint search with ImaGe , Speech And Words). Interactive Mobile Visual Search with Multimodal Queries. Yang Wang University of Science and Technology of China. Yang Wang. Jingdong Wang. Houqiang Li. Shipeng Li. Tao Mei. Search for images On The Go.

viveka
Télécharger la présentation

JIGSAW ( Joint search with ImaGe , Speech And Words)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JIGSAW(Joint search with ImaGe, Speech And Words) Interactive Mobile Visual Search with Multimodal Queries Yang Wang University of Science and Technology of China Yang Wang Jingdong Wang Houqiang Li Shipeng Li Tao Mei

  2. Search for images On The Go ACM Ubicomp 2011

  3. Search for images On The Go ?

  4. Search for images On The Go ? ACM Multimedia 2011

  5. Existing image search method on mobile phone Formulate an intent Present the resuts Drawbacks • Clear goals • Explicit intent • See and search • Product, landmark, … ACM Multimedia 2011

  6. Motivations – partial object Mind  Description  Image  Information • User has no exact entity name or instant photo. • The user has only visual descriptions: • Can only describe an oil paint without it’s name. (“Find a oil paint with a man with a straw hat.”) • Want to find local business with only description (“Find a restaurant with a red gate and two stone lions in front and two red lantern on the top.”) • ……

  7. Our approach • Multi-modal + Multi-touch = Visual intent • Benefits: • Explicate the search intent • Express visual intent • Natural interface … “sky”, “grass” Find a picture with sky and grass … Describe a visual scenario with speech Extract entity words from the speech Exemplary images Composite a visual query Search results ACM Multimedia 2011

  8. Existing research -> A solution for mobile -> Using exemplar images -> Using region-based matching HaoXu, et al. Image Search by Concept Map. SIGIR ’10. Changhu Wang, et al. MindFinder: Image Search by Interactive Sketching and Tagging. WWW ’10. ACM Multimedia 2011

  9. Flow chart ② ① ④ ③ ACM Multimedia 2011

  10. Speech recognition & entity extraction voice natural sentence “Find an iron tower under grass” commercial speech recognition engine • Nouns from WordNet* (117, 798 nouns) • Can be represented by images in ImageNet* • Ignoring preposition, verbs, adjectives, etc. • 22, 117 entities “tower”, “grass” ① *http://wordnet.princeton.edu *http://www.image-net.org ACM Multimedia 2011

  11. Exemplary image generation • Obtain Images with each text query (top 500) & extract features • Cluster images and keep cluster centers • User can choose one exemplary image from each entity (such as I1 & I2) ② ACM Multimedia 2011

  12. Composite visual query • Component: • Ck={Tk, Ik, Rk} • Visual query: • {Ck} • Component: • Ck={Tk, Ik} ③ ACM Multimedia 2011

  13. Search ④ ACM Multimedia 2011

  14. Visual matching ACM Multimedia 2011

  15. Implementation • local pattern: SIFT • 6000-D bag of words • Color information: Color histogram • 192-D HSV • Shape information: Gradient histogram • 64-D • Normalized and combined in a single vector. • Calculate similarity with a idf weight. ACM Multimedia 2011

  16. Application UI design • Windows Phone 7 • Two-step interaction • User interface ACM Multimedia 2011

  17. Experiments • Settings • One million images from commercial search engine • Objective Evaluations • 100 test queries • Normalized Discounted Cumulative Gain (NDCG) • Response time • User study • Usability ACM Multimedia 2011

  18. NDCG • Compared JIGSAW, Concept Map*, and text search • More Efficient than text search • Better performance than Concept Map *HaoXu, et al. Image Search by Concept Map. SIGIR ’10. ACM Multimedia 2011

  19. System response time System response time in searching (on the phone) • Given a number of key wordsn • (500ncandidate images) x (nscores are calculated) =O(n2) • Pruned by early abortion • ~O(n) ACM Multimedia 2011

  20. User’s time The time distribution for different users to complete a task • Single component: 30 sec • Failed trial: +20 sec • Extra component: +20 sec ACM Multimedia 2011

  21. Demo ACM Multimedia 2011

  22. Number of interactions (tap & drag) • Multi-touch takes up only 5% of all operations because the exemplary images are always too small on the screen. ACM Multimedia 2011

  23. Visual results ACM Multimedia 2011

  24. Visual results ACM Multimedia 2011

  25. Visual results ACM Multimedia 2011

  26. Discussions • Contributions • Introduce a new interactive visual search system on mobile • Propose a visual search method for this application • Deployed the system on a WP7 mobile phone • Future works • Improve the efficiency of visual search • Handle relative positions between objects ACM Multimedia 2011

  27. Thanks!

  28. Backup slides

  29. Similarity between image and exemplar ACM Multimedia 2011

  30. Similarity between image and exemplar We index the features in 9x9 cells Multiple cells are combined to approximate the region to be compared ACM Multimedia 2011

  31. Similarity between image and exemplar • Combine features covered by the region • Slide the window around • Calculate similarity e(i, j) • Calculate similarity e Save features in M x M cells ACM Multimedia 2011

  32. Consider more positions Draw the desired distribution from a Gaussian shape centered at the desired position Compare ekJ(i, j) with a desired distribution of the k-th component Dk(i, j) ACM Multimedia 2011

  33. Penalty • For cells outside R(k), penalty is pooled by • Instead of e, the feature of single cell o(i, j) is used to accelerate the matching speed. • The spatial relevance score between the candidate image and the k-th component: ACM Multimedia 2011

  34. Fusion scores count the average score Divergence penalty: ACM Multimedia 2011

  35. Ranking -1 < score(J) < 1 Rank the candidate images by their scores in descending order ACM Multimedia 2011

More Related