TRECVID: Promoting Research Via Community Technology Evaluations

TRECVID: Promoting Research Via Community Technology Evaluations Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA http://trecvid.nist.gov DMASM 2011

What is TRECVID? Workshop series (2001 – present)  http://trecvid.nist.gov • to promote research/progress in content-based video analysis/exploitation Foundation for large-scale laboratory testing Forum for the • exchange of research ideas • discussion of research methodology – what works, what doesn’t , and why. Focus: content-based approaches to • retrieval/detection/summarization/segmentation/… Aims for realistic system tasks and test collections • unfiltered data • focus on relatively high-level functionality (e.g. interactive search) • measurement against human abilities Provides data, tasks, and uniform, appropriate scoring procedures DMASM 2011

TRECVID Philosophy TRECVID is a modern example of the Cranfield tradition • Laboratory system evaluation based on test collections Emphasis on advancing the state of the art from evaluation results • TRECVID’s primary aim is notcompetitive product benchmarking • experimental workshop: sometimes experiments fail! Laboratory experiments (vs. e.g., observational studies) • sacrifice operational realism and broad scope of conclusions • for control and information about causality – what works and why • results tend to be narrow, at best indicative, not final • evidence grows as approaches prove themselves repeatedly, as part of various systems, against various test data, over years DMASM 2011

TRECVID Yearly Cycle Data Procurement Post-workshop experiments, final papers Call for Participation Task definitionscomplete TRECVID Workshop Results analysis and workshop paper/presentation preparation ~400 authors /year Search topic, ground truth development System building & experimentation; Community contributions (shots, training data, ASR, MT, etc.) Results Evaluation DMASM 2011

TRECVID’s Evolution …2003 2004 2005 2006 2007 2008 2009 2010 2011 New development or test data as added Data: (hours) English TV News BBC rushes S&V Shot boundaries ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■ Ad hoc search ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ Features/semantic indexing ■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■ Stories ■■■■■■■■■■■■■ Camera motion ■■■■■■■ BBC rushes - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Summaries■■■■■■■■■■■■ Copy detection - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ Surveillance events - - - - - - - - - - - - - - - - - - -■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ Known-item search - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Instance search pilot - - - - - - - - - - - - - - - - - - - - - - - - - - - -■■■■■■■■■■ ■■■■■■■■■■ Multimedia event detection (MED) pilot - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Tasks: Participanting teams: DMASM 2011

TRECVID 2010 Tasks and Data DMASM 2011

TV2010 Finishers TRECVID @ NIST

Brewster Kahle (Internet Archive's founder) and R. Manmatha (U. Mass, Amherst) suggested in December of 2008 that TRECVID take another look at the resources of the Archive. Cara Binder and Raj Kumar @ archive.org helped explain how to query and download automatically from the Internet Archive. Georges Quénot with Franck Thollard, Andy Tseng, BahjatSafadi from LIG and Stéphane Ayache from LIF shared coordination of the semantic indexing task and organized additional judging with support from the Quaero program Georges Quénot and Stéphane Ayache again organized a collaborative annotation of 130 features. Shin'ichi Satoh at NII along with Alan Smeaton and Brian Boyle at DCU arranged for the mirroring of the video data Support : • National Institute of Standards and Technology (NIST) • Intelligence Advanced Research Projects Activity (IARPA) • Department of Homeland Security (DHS) • Contributors: • Colum Foley and Kevin McGuinness (DCU) helped segment the instance search topic examples and set up the oracle at DCU for interactive systems in the known-item search task. • The LIMSI Spoken Language Processing Group and VexSys Research provided ASR for the IACC.1 videos. • Laurent Joyeux (INRIA-Roquencourt) updated the copy detection query generation code. • Matthijs Douze from INRIA-LEAR volunteered a camcorder simulator to automate the camcording transformation for the copy detection task. • Emine Yilmaz (Microsoft Research) and Evangelos Kanoulas (U. Sheffield) updated their xinfAP code (sample_eval.pl) to estimate additional values and made it available. TRECVID @ NIST

Some impacts … • Continuing improvement in feature detection (automatic tagging) • in the University of Amsterdam’s MediaMill system • Performance on 36 features doubled: 2006 –> 2009 • Within domain (train and test) MAP 0.22 -> 0.41 • Cross domains MAP 0.13 -> 0.27 • Bibliometric study of TRECVID’s scholarly impact: 2003 - 2009 • (Dublin City University & University College, Dublin) • 2073 peer-reviewed journal/conference papers • 2010 RTI International economic impact study of TREC/TRECVID • “…for every $1 that NIST and its partners invested in TREC[/TRECVID], at least $3.35 to $5.07 in benefits accrued to IR [Information Retrieval] • researchers” TRECVID @ NIST

TRECVID search types so far TRECVID search has modeled a user looking for video shots for reuse • of people, objects, locations, events • not just information (e.g., video of X, not video of someone talking about X) • independent of original intent, saliency, etc. • in video of various sorts (without metadata other than file names): • multilingual broadcast news (Arabic, Chinese, English) • Dutch “edutainment”, cultural, news magazine, historical shows using queries containing: • text only • text + image/video examples • image/video examples only in two modes: • fully automatic • human-in-the-loop search DMASM 2011

Panofsky/Shatford mode/facet matrix ** ** From Enser, Peter G. B. and Sandom, Chriss J. Retrieval of Archival Moving Imagery – CBIR Outside the Frame. CIVR2002. LNCS 2383 pp. 206-214. DMASM 2011

Find shots of a road taken from a moving vehicle through the front window. Find shots of a crowd of people, outdoors, filling more than half of the frame area. Find shots with a view of one or more tall buildings (more than 4 stories) and the top story visible. Find shots of a person talking on a telephone. Find shots of a close-up of a hand, writing, drawing, coloring, or painting. Find shots of exactly two people sitting at a table. Find shots of one or more people, each walking up one or more steps. Find shots of one or more dogs, walking, running, or jumping. Find shots of a person talking behind a microphone. Find shots of a building entrance. Find shots of people shaking hands. Find shots of a microscope. Find shots of two more people, each singing and/or playing a musical instrument. Find shots of a person pointing. Find shots of a person playing a piano. Find shots of a street scene at night. Find shots of printed, typed, or handwritten text, filling more than half of the frame area. Find shots of something burning with flames visible. Find shots of one or more people, each at a table or desk with a computer visible. Find shots of an airplane or helicopter on the ground, seen from outside. Find shots of one or more people, each sitting in a chair, talking. Find shots of one or more ships or boats, in the water. Find shots of a train in motion, seen from outside. Find shots with the camera zooming in on a person's face. 24 Topics from TRECVID 2009 DMASM 2011

Drilling down in the search landscape You want something to make you laugh Fan searches for favorite TV show episode Voter looks for video of candidate X at recent town hall meeting Doctor searches echocardiogram videos for instances like example Your mother searches home videos for shots of daughter playing with family pet. Security personnel searches surveillance video archive for suspicious behavior Student searches Web for new music video 10-yr old looks for video of tigers for school report Intelligence analyst searches multilingual open source video for background info on location X TRECVID Documentary producer searches TV archive for reusable shots of Berlin in 1920’s Data types Searcher abilities, needs, preferences, history DMASM 2011

Finding meaning in text (words) versus images (pixels) Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless. Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless. DMASM 2011

One image/video – many different (changing) views of content Possible content keywords, tags: women pigeons plaza buildings outdoors daytime running falling clapping …. Creator’s keywords: “stupid sister” www.archive.org/details/StupidSister DMASM 2011

One person/thing/location – many different (changing) appearances DMASM 2011

Can multimedia features serve as “words”? • Classroom • Chair • Infant • Traffic intersection • Doorway • Airplane-flying • Person-playing-a-musical-instrument • Bus • Person-playing-soccer • Cityscape • Person-riding-a-bicycle • Telephone • Person-eating • Demonstration-Or-Protest • Hand • People-dancing • Nighttime • Boat-Ship • Female-human-face-closeup • Singing Low-level • Color • Texture • Shape High-level • 449 annotated LSCOM features • 39 LSCOM-Lite • TRECVID 2009 • Text from • speech • video OCR DMASM 2011

LSCOM feature sample 000 – Parade 001 - Exiting_Car 002 – Handshaking 003 – Running 004 - Airplane_Crash 005 – Earthquake 006 - Demonstration_Or_Protest 007 - People_Crying 008 - Airplane_Takeoff 009 - Airplane_Landing 010 - Helicopter_Hovering 011 – Golf 012 – Walking 013 – Singing 014 – Baseball 015 – Basketball 016 – Football 017 – Soccer 018 – Tennis 019 - Speaking_To_Camera 020 – Riot 021 - Natural_Disasters 022 – Tornado 023 - Ice_Skating 024 – Snow 025 - Flood 026 – Skiing 027 – Talking 028 – Dancing 029 - Car_Crash 030 – Funeral 031 – Gymnastics 032 - Rocket_Launching 033 – Cheering 034 – Greeting 035 – Throwing 036 – Shooting 037 - Address_Or_Speech 038 - Bomber_Bombing 039 - Celebration_Or_Party 040 – Airport 041 – Barn 042 – Castle 043 – College 044 – Courthouse 045 - Fire_Station 046 - Gas_Station 047 – Grain_Elevator 048 – Greenhouse 049 – Hangar 050 – Hospital 051 – Hotel 052 - House_Of_Worship 053 - Police_Station 054 - Power_Plant 055 - Processing_Plant 056 – School 057 - Shopping_Mall 058 – Stadium 059 – Supermarket 060 - Airport_Or_Airfield 061 – Aqueduct 062 – Avalanche 063 - River_Bank 064 - Aircraft_Cabin . . . 810 - Still_Image_Composition_May_Include_Text 811 - Stock_Exchange 812 – Stockyard 813 - Storage_Tanks 814 - Store_Outside 815 - Street_Signs 816 - Street_Vendor 817 - Students_Schoolkids 818 – Suitcases 819 – Surgeons 820 – Sword 821 – Synagogue 822 – Tailor 823 – Tanneries 824 - Taxi_Driver 825 – Teacher 826 - Team_Organized_Group 827 – Technicians 828 – Teenagers 829 – Temples 830 – Terrorist 831 - Text_Only_Artificial_Bkgd 832 - Thatched_Roof_Buildings 833 – Theater 834 – Toddlers 835 - Town_Halls 836 - Town_Squares 837 – Townhouse 838 – Tractor 839 - Traffic_Cop 840 - Train_Station 841 - Tribal_Chief 842 – Twilight 843 – Uav 844 - Vacationer_Tourist 845 – Vandal 846 – Veterinarian 847 – Viaducts 848 – Vineyards 849 – Voter 850 - Waiter_Waitress 851 - Water_Mains 852 – Windmill 853 - Wooden_Buildings 854 - Worker_Laborer http://www.lscom.org DMASM 2011

Simulation study suggests …. “… ‘concept-based’ video retrieval with fewer than 5000 concepts, detected with minimal accuracy of 10% mean average precision is likely to provide high accuracy results, comparable to text retrieval on the web, in a typical broadcast news collection.” * • ? • * Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News. IEEE Transactions in Multimedia. Vol. 9, No. 5. August 2007 pp.958-966. DMASM 2011

A generic TRECVID search system (based on Snoek and Worring 2008 **) Best of Selection Basic Concept Detection Feature Fusion Classifier Fusion Shot-segmented video Modeling Relations Visualization Query results combination Learning from the searcher SEARCHER Database Query Methods Query requests Query Prediction Information need **Cees G. M. Snoek and Marcel Worring. Concept-Based Video Retrieval. in Foundations and Trends in Information Retrieval Vol. 2, No. 4 (2008) 215-322

Innovative search interfaces … U. Amsterdam MediaMill http://www-nlpir.nist.gov/projects/tvpubs/tv9.slides/mediamill1.slides.pdf DMASM 2011

Some results Keyframes from top 20 clips returned by a system to query for “shots of person seated at computer “ DMASM 2011

Variation in Average Precision by topic Closeup of hand writing … Dogs walking … Printer, typed… text … Crowds of people (270), Building entrance (278), People at desk with computer (287) each had automatic max better then interactive max DMASM 2011

Observations, questions … • One solution will not fit all. Investigations/discussion of video search must be related to the searcher‘s specific needs/capabilities/history and to the kinds data being searched. • The enormous and growing amounts of video require extremely large-scale approaches to video exploitation. Much of it has little or no metadata describing the content in any detail. • TREVCID participants have explored some automatic approaches to tagging and use of those tags in automatic and interactive search systems on a couple sorts of video. Much has been learned, some results may already be useful, but most of the territory is still unexplored. DMASM 2011

Observations, questions … • Within the focus of TRECVID experiments … • Multiple information sources (text, audio, video), each errorful, can yield better results when combined than used alone… • A human in the loop in search still makes an enormous difference. • Text from speech via automatic speech recognition (ASR) is a powerful source of information but: • Its usefulness varies by video genre • Not everything/one in a video is talked about, “in the news" • Audible mentions are often offset in time from visibility • Not all languages have good ASR • Machine learning approaches to tagging • yield seemingly useful results against large amounts of data when training data is sufficient and similar to the test data • but will they work well enough to be useful on highly heterogeneous video? DMASM 2011

Observations, questions … • Within the focus of TRECVID experiments … • A hierarchy of automatically derived features can help bridge the gap between pixels and meaning and can assist search - but problems abound: • What is the right set of features for a given application? • Given a query, how do you automatically decide which specific features to use? • Creating quality training data, even with active learning, is very expensive • Searchers (experts and non-experts) will use more than text queries if available: concepts, visual similarity, temporal browsing, positive and negative relevance feedback,… http://www.videolympics.org • Processing video using a sample of more than one frame per shot, yields better results but quickly pushes common hardware configurations to their limits DMASM 2011

Observations, questions … • Within the focus of TRECVID experiments … • TRECVID has only just started looking at combining automatically derived and manual-provided evidence in search • Systems have been using externally annotated video (e.g. Flickr) but results are not conclusive • Internet Archive video will provide titles, keywords, descriptions • Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that mean less useful for other people? • Need observational studies of real searching of various sorts using current functionality and identifying unmet needs • Need more access for researchers to much more multimedia data of varying kinds, mixtures, with and without human annotation DMASM 2011

Observations, questions … Time to take some of the ideas developed in the laboratory out for small scale testing with real users with real needs and real video collections ? DMASM 2011

TRECVID: Promoting Research Via Community Technology Evaluations