1 / 39

Probabilistic Approaches to Video Retrieval The Lowlands Team at TRECVID 2004

Probabilistic Approaches to Video Retrieval The Lowlands Team at TRECVID 2004. Tzvetanka (‘Tzveta’) I. Ianeva Lioudmila (‘Mila’) Boldareva Thijs Westerveld Roberto Cornacchia Djoerd Hiemstra (the 1 and only) Arjen P. de Vries. P ( | M ). = P ( | M ). P ( | M, ).

damita
Télécharger la présentation

Probabilistic Approaches to Video Retrieval The Lowlands Team at TRECVID 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic ApproachestoVideo RetrievalThe Lowlands Team at TRECVID 2004 Tzvetanka (‘Tzveta’) I. Ianeva Lioudmila (‘Mila’) Boldareva Thijs Westerveld Roberto Cornacchia Djoerd Hiemstra (the 1 and only) Arjen P. de Vries TRECVID 2004

  2. P ( | M ) = P ( | M ) P ( | M, ) P ( | M, ) P ( | M, ) Generative Models… • A statistical model for generating data • Probability distribution over samples in a given ‘language’ aka ‘Language Modelling’ M TRECVID 2004 © Victor Lavrenko, Aug. 2002

  3. … in Information Retrieval • Basic question: • What is the likelihood that this document is relevant to this query? • P(rel|I,Q) = P(I,Q|rel)P(rel) / P(I,Q) • P(I,Q|rel) = P(Q|I,rel)P(I|rel) TRECVID 2004

  4. Retrieval (Query generation) Docs Models P(Q|M1) Query P(Q|M2) P(Q|M3) P(Q|M4) TRECVID 2004

  5. Not just ‘English’ But also, the language of author newspaper text document image Hiemstra or Robertson? ‘Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing.’ ‘Language Modelling’ TRECVID 2004

  6. Guardian or Times? Not just ‘English’ But also, the language of author newspaper text document image ‘Language Modelling’ TRECVID 2004

  7. or ? Not just English! But also, the language of author newspaper text document image ‘Language Modelling’ TRECVID 2004

  8. Application to Video Retrieval • Matching against multiple modalities gives robustness • GMM of shot (‘dynamic’) or key-frame (‘static’) • MNM of associated text (ASR) • Assume scores for both modalities independent • Merge multiple examples’ results RR fashion • Interactive search much more successful than manual: the role of user is very important TRECVID 2004

  9. TRECVID 2004:Research Questions Pursued • Modelling video content: • How to best model the visual content? • How to best model the textual content? • Does audio-visual content modelling contribute to better retrieval results? • Both in manual and interactive? • How to translate the topic into a query? TRECVID 2004

  10. Experimental Set-up • Build models for each shot • Static, Dynamic, Language • Build queries from topics • Automatic as well as manually constructed simple keyword text queries • Select visual example TRECVID 2004

  11. Modelling Visual Content TRECVID 2004

  12. Docs Models Static Model • Indexing • - Estimate Gaussian Mixture Models from images using EM • - Based on feature vector with colour, texture and position information from pixel blocks • - Fixed number of components TRECVID 2004

  13. Indexing Estimate a Gaussian Mixture Model from each keyframe (using EM) Fixed number of components (C=8) Feature vectors contain colour, texture, and position information from pixel blocks: < x,y,DCT > Static Model TRECVID 2004

  14. Dynamic Model • Indexing: • GMM of multipleframes around keyframe • Feature vectors extended with time-stamp normalized in [0,1]: <x,y,t,DCT> 1 .5 0 TRECVID 2004

  15. Examples TRECVID 2004

  16. Examples TRECVID 2004

  17. Examples TRECVID 2004

  18. Dynamic vs. Static • Dynamic model • Retrieves more relevant shots • 227 vs. 212 • Places these higher in the result lists • MAP 0.0124 vs. 0.0089 • Topic 142 (has example from collection) • Dynamic finds 15 relevant vs. static 3; TRECVID 2004

  19. Example: Topic 136 Dynamic rank 1-4 (8 found): Static rank 1-4 (4 found): TRECVID 2004

  20. Dynamic Model Advantages • More training data for models • Less sensitive to random initialization • Reduced dependency upon selecting appropriate keyframe • Spatio-temporal aspects of shot are captured TRECVID 2004

  21. Modelling Textual Content TRECVID 2004

  22. Hierarchical Language Model • MNM Smoothed over multiple levels Alpha * P(T|Shot) + Beta * P(T|‘Scene’) +Gamma * P(T|Video) + (1–Alpha–Beta–Gamma) * P(T|Collection) • Additional video level is beneficial • On 2003 data, 0.148 vs. 0.134 TRECVID 2004

  23. Using Video-OCR • ASR • MAP 0.0680 • ASR+OCR • MAP 0.0691 • Higher initial precision, more relevant • Difference is not statistically significant • Further improvements possible? • Pre-process OCR data? Add captions? TRECVID 2004

  24. MULTI: modalities, examples TRECVID 2004

  25. Multi-modal Retrieval • Combining visual and text scores (using independence assumption) gives better results than each modality on its own • Dynamic+ASR (manual) finds 18 additional relevant shots over ASR only (565 vs. 547) • Consistent with TRECVID 2003 finding! TRECVID 2004

  26. Query by Multiple Examples • Rank-based vs. Score-based • Round-robin (min{rank}) • CMS (mean{score}) • Results: • RR gives better MAP (0.0124 vs. 0.0089) • CMS finds more relevant (239 vs. 227) TRECVID 2004

  27. Query by Multiple Examples • A manually made selection of examples gave better results than using all • Order effect with RR • Dynamic: video examples first • Static: image examples first • Diffence results from the initial precision TRECVID 2004

  28. Interactive Search TRECVID 2004

  29. Interactive System • Based on pre-computed similarity matrix • ASR language model • Static key-frame model (using ALA) • Update probability scores from searcher’s feedback • See Boldareva & Hiemstra, CIVR 2004 • Select most informative modality automatically • Monitor marginal-entropy to indicate user-system performance, apply to choosing update strategy (text/visual/combined) for next iteration TRECVID 2004

  30. Marginal Entropy ~ MAP TRECVID 2004

  31. Interactive Results • Interactive strategy combining multiple modalities is in general beneficial (MAP=0.1900), even when one modality does not perform well • Monitoring marginal entropy not yet successful to decide between modalities for update strategy (but, still promising) TRECVID 2004

  32. Surprise, Surprise… TRECVID 2004

  33. Under the HoodWork in Progress • Back to the Future – DB+IR!!!  • All static model processing has been moved from customised Matlab scripts to MonetDB query plans (CWI’s open-source main-memory DBMS) • Parallel training process on Linux cluster • Next steps: • Integration with MonetDB’s XQuery front-end (Pathfinder) and the Cirquid project’s XML-IR system (TIJAH) TRECVID 2004

  34. Conclusions • For most topics, neither the static nor the dynamic visual model captures the user information need sufficiently… • …averaged over all topics however, it is better to use both modalities than ASR only Working hypothesis: Matching against both modalities gives robustness TRECVID 2004

  35. Conclusions • Visual aspects of an information need are best captured by using multiple examples • Combining results for multiple (good) examples in round-robin fashion, each ranked on both modalities, gives near-best performance for almost all topics TRECVID 2004

  36. Unfinished Research! • Analysis of TRECVID 2004 results • Q: Why is the dynamic model better? • More training data, spatio-temporal aspects in model, varying number of components, less dependent on keyframe, … • Q: Why does the audio not help? • Q: Why does the entropy-based monitoring of user-system performance not help? TRECVID 2004

  37. Unfinished Research! • Comparison to TRECVID 2003 results • Apply 2004 training procedure to 2003 data • Apply anchor-person detector • Apply 2003 topic processing (& vice-versa) • Static model • Full covariance matrices • Varying number of components TRECVID 2004

  38. Future Research • Retrieval Model • Apply document generation approach • How to properly model multiple modalities? • How to handle multiple query examples? • System Aspects • Integration INEX and TRECVID systems • Top-K query processing TRECVID 2004

  39. Thanks !!! TRECVID 2004

More Related