1 / 36

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos. Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo Nanyang Technological University & Kodak Research Lab. Motivation. Digital cameras and mobile phone cameras popularize rapidly:

franz
Télécharger la présentation

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Large-Scale Web Data to Facilitate Textual QueryBased Retrieval of Consumer Photos Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo Nanyang Technological University & Kodak Research Lab

  2. Motivation • Digital cameras and mobile phone cameras popularize rapidly: • More and more personal photos; • Retrieving images from enormous collections of personal photos becomes an important topic. ? How to retrieve?

  3. compare … … Previous Work • Content-Based Image Retrieval (CBIR) • Users provide images as queries to retrieve personal photos. • The paramount challenge -- semantic gap: • The gap between the low-level visual features and the high-level semantic concepts. query result Image with high-level concept Semantic Gap Feature vectors in DB Low-level Feature vector

  4. database annotate … compare … result rank A More Natural Way For Consumer Applications • Image annotation is used to classify images w.r.t. high-level semantic concepts. • Semantic concepts are analogous to the textual terms describing document contents. • An intermediate stage for textual query based image retrieval. • Let the user to retrieve the desirable personal photos using textual queries. query Annotation Result: high-level concepts Sunset

  5. Web Images Contextual Information building people, family information people, wedding sunset … … … … Consumer Photos Our Goal • Leverage information from web image to retrieve consumer photos in personal photo collection. • A real-time textual query based consumer photo retrieval system without any intermediate annotation stage. • Web images are accompanied by tags, categories and titles. No intermediate image annotation process. Web Images

  6. Large Collection of Web images (with descriptive words) Raw Consumer Photos Automatic Web Image Retrieval Relevant/ Irrelevant Images Consumer Photo Retrieval Relevance Feedback Classifier WordNet Top-Ranked Consumer Photos Refined Top-Ranked Photos System Framework • When user provides a textual query, Textual Query • Then, a classifier is trained based on these web images. • It would be used to find relevant/irrelevant images in web image collections. • And then consumer photos can be ranked based on the classifier’s decision value. • The user can also gives relevance feedback to refine the retrieval results.

  7. … … Relevant Web Images … … Inverted File boat … … Irrelevant Web Images barge ark … … … … dredger houseboat Semantic Word Trees Based on WordNet Automatic Web Image Retrieval “boat” • For user’s textual query, first search it in the semantic word trees. • The web images containing the query word are considered as “relevant web images”. • The web images which do not contain the query word and its two-level descendants are considered as “irrelevant web images”.

  8. Decision Stump Ensemble • Train a decision stump on each dimension. • Combine them with their training error rates.

  9. Why Decision Stump Ensemble? • Main reason: low time cost • Our goal: a (quasi) real-time retrieval system. • For basic classifiers: SVMs are much slower; • For combination: boosting is also much slower. • The advantage of decision stump ensemble: • Low training cost; • Low testing cost; • Very easy to parallelize;

  10. Asymmetric Bagging • Imbalance: count(irrelevant) >> count(relevant) • Side effects, e.g. overfitting. • Solution: asymmetric bagging • Repeat 100 times by using different randomly sampled irrelevant web images. 100 training sets … irrelevant images relevant images …

  11. Relevance Feedback • The user labels nl relevant or irrelevant consumer photos. • Use this information to further refine the retrieval results; • Challenge 1: Usually nl is small; • Challenge 2: Cross-domain learning • Source classifier is trained on the web image domain. • The user labels some personal photos.

  12. Method 1: Cross-Domain Combination of Classifiers • Re-train classifiers with data from both domain? • Neither effective nor efficient; • A simple but effective method: • Train an SVM on the consumer photo domain with user-labeled photos; • Convert the responds of source classifier and SVM classifier to probability, and add them up; • Rank consumer photos based on this sum value. • Referred as DS_S+SVM_T.

  13. Method 2: Cross-Domain Regularized Regression (CDRR) • Construct a linear regression function fT(x): • For labeled photos: fT(xi) ≈ yi; • For unlabeled photos: fT(xi) ≈ fs(xi); Source Classifier

  14. User-labeled images x1,…,xl f T(x) should be the user’s label y(x) f T(x) should be f s(x) Other images A regularizer to control the complexity of the target classifier fT(x) • Design a target linear classifier fT(x) = wTx. • This problem can be solved with least square solver.

  15. Hybrid Method • A combination of two methods. • For labeled consumer photos: • Measure the average distance davg to their 30 nearest unlabeled neighbors in feature space; • If davg < ε: Use DS_S+SVM_T; • Otherwise: Use CDRR. • Reason: • For consumer photos which are visually similar to user-labeled images, they should be influenced more by user-labeled images.

  16. Experimental Results

  17. Dataset and Experimental Setup • Web Image Database: • 1.3 million photos from photoSIG. • Relatively professional photos. • Text descriptions for web images: • Title, portfolio, and categories accompanied with web images; • Remove the common high-frequency words; • Remove the rarely-used words. • Finally, 21377 words in our vocabulary.

  18. Dataset and Experimental Setup • Testing Dataset #1: Kodak dataset • Collected by Eastman Kodak Company: • From about 100 real users. • Over a period of one year. • 1358 images: • The first keyframe from each video. • 21 concepts: • We merge “group_of_two” and “group_of_three_or_more” to one concept.

  19. Dataset and Experimental Setup • Testing Dataset #2: Corel dataset • 4999 images • 192x128 or 128x192. • 43 concepts: • We remove all concepts in which there are fewer than 100 images.

  20. Visual Features • Grid-Based color moment (225D) • Three moments of three color channels from each block of 5x5 grid. • Edge direction histogram (73D) • 72 edge direction bins plus one non-edge bin. • Wavelet texture (128D) • Concatenate all three kinds of features: • Normalize each dimension to avg = 0, stddev = 1 • Use first 103 principal components.

  21. Retrieval without Relevance Feedback • For all concepts: • Average number of relevant images: 3703.5.

  22. Retrieval without Relevance Feedback • kNN: rank consumer photos with average distance to 300-nn in the relevant web images. • DS_S: decision stump ensemble.

  23. Retrieval without Relevance Feedback • Time cost: • We use OpenMP to parallelize our method; • With 8 threads, both methods can achieve interactive level. • But kNN is expected to cost much time on large-scale datasets.

  24. Retrieval with Relevance Feedback • In each round, the user labels at most 1 positive and 1 negative images in top-40; • Methods for comparison: • kNN_RF: add user-labeled photos into relevant image set, and re-apply kNN; • SVM_T: train SVM based on the user-labeled images in the target domain; • A-SVM: Adaptive SVM; • MR: Manifold Ranking based relevance feedback method;

  25. Retrieval with Relevance Feedback • Setting of y(x) for CDRR: • Positive: +1.0; • Negative: -0.1; • Reason: • The top-ranked negative images are not extremely negative; • Positive: “what is”; Negative: “what is not”. negative images positive images

  26. Retrieval with Relevance Feedback • On Corel dataset:

  27. Retrieval with Relevance Feedback • On Kodak dataset:

  28. Retrieval with Relevance Feedback • Time cost: • All methods except A-SVM can achieve real-time speed.

  29. System Demonstration

  30. Query: Sunset

  31. Query: Plane

  32. The User is Providing The Relevance Feedback …

  33. After 2 pos 2 neg feedback…

  34. Summary • Our goal: (quasi) real-time textual query based consumer photo retrieval. • Our method: • Use web images and their surrounding text descriptions as an auxiliary database; • Asymmetric bagging with decision stumps; • Several simple but effective cross-domain learning methods to help relevance feedback.

  35. Future Work • How to efficiently use more powerful source classifiers? • How to further improve the speed: • Control training time within 1 seconds; • Control testing time when the consumer photo set is very large.

  36. Thank you! • Any questions?

More Related