1 / 29

Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video

Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video. TRECVID 2003 Carnegie Mellon University

elyse
Télécharger la présentation

Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informedia at TRECVID 2003:Analyzing and Searching Broadcast News Video TRECVID 2003 Carnegie Mellon University A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel, P. Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji, N. Papernick, C.G.M. Snoek, G. Tzanetakis, J. Yang, R. Yang, and H.D. Wactlar

  2. Overview (1/3) • TRECVID 2003 • Shot boundary determination • identify the shot boundaries in the given video clip(s) • Story segmentation • identify the story boundary and types (miscellaneous or news) • High-level feature extraction • Outdoors, news subject face, People, Building, Road, Animal.. • Search • Given the search test collection, a multimedia statement of info. need (topic), return a ranked list of common reference shots from the test collection

  3. Overview (2/3) • Search • Interactive Search • Manual Search

  4. Overview (3/3) • Semantic Classifiers • most are trained on keyframes • Interactive Search • allow more effective browsing and visualization of the results of text queries using a variety of filter strategies • Manual Search • use multiple retrieval agents (color, texture, ASR, OCR and some of the classifiers, e.g. anchor, PersonX) • Negative Pseudo-relevance • Co-retrieval • Even text-based baseline using the OKAPI formula performed better other groups

  5. Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (1/3) • Audio Features • These features assist the extraction of the following medium-level audio-based features: music, male speech, female speech, and noise. • Based on the magnitude spectrum calculated using a Short Time Fourier Transform. • consist of features that summarize the overall spectral characteristics: • Spectral Centroid, Rolloff, Relative Subband energies and the Mel Frequency Cepstral Coefficients • male/female: using Average Magnitude Difference Function (AMDF)

  6. Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (2/3) • Low-level Image Features • The color feature is the mean and variance of each color channels in HSV (Hue-Saturation-Value) color space in a 5*5 image tessellation. • Another low-level feature is the canny edge direction histogram. • Face Features • Schneiderman’s face detector algorithm • Size and position of the largest face are used as additional face features

  7. Extracted Features and Non-TRECVID Metadata Classifiers for Anchors and Commercials (3/3) • Text-based features • the most reliable high-level feature • Automatic Speech Transcripts (ASR), Video Optical Character Recognition (VOCR) • Video OCR (VOCR) • Manber and Wu’s approximate string matching technique, e.g. “Clinton” may retrieve “Cllnton”, “Ciintonfi”, “Cltnton” and “Clinton” • However, incorrect text like “EIICKINSON” (for “DICKINSON”), and “Cincintoli” (for “Cincinnati”)

  8. Fisher Linear Discriminant for Anchors and Commercials (1/2) • Multimode combination approach: use FLD to every feature set and synthesize new feature vectors • Using these synthesized feature vectors to represent the content and then apply standard feature vector classification approaches. • Two different SVM-based classifiers: • anchor: color histogram, face info., and speaker info. • commercial: color histogram and audio feature

  9. Fisher Linear Discriminant for Anchors and Commercials (2/2) • FLD weights for anchor detection • Anchor and Commercial classifier result

  10. Feature Classifiers (1/7) • Baseline SVM Classifier with Common Annotation Data • SVM with the power=2 polynomial • use only image features (no face) • perform a video based cross validation with portions of the common annotation data

  11. Feature Classifiers (2/7) • Building Detection • explore a classifier by adapting man-made structure detection method by Kumar and Hebert • this method produces binary detection outputs for each of 22*16 grids, extract 5 features from the binary detection outputs. • number of positive grids; • area of the bounding box that includes all the positive grids; • x and y coordinates of the center of the mass of the bounding grids; • ratio of the width and height; • compactness • 462 are images used as positive examples, and 495 images are used as negative examples, by FLD, SVM • MAP 0.042 (man-made structures) vs. 0.071 (baseline SVM)

  12. Feature Classifiers (3/7) • Plane Detection using additional still image data • use image features described above • 3368 plane examples are selected from web, Corel data set and from the University of Oxford data set as positive examples • 3516negative examples • By FLD and SVM, MAP 0.008 vs. 0.059 (baseline) • Car Detection • modify the Schneiderman face detector algorithm • Outperform the baseline with MAP 0.114 vs. 0.040

  13. Feature Classifiers (4/7) • Zoom Detection • use MPEG motion vectors to estimate the probability of a zoom pattern • MAP 0.632 • Female Speech • use an SVM trained on the LIMSI provided speech features, together with the face characteristics • MAP 0.465

  14. Feature Classifiers (5/7) • Text and Timing for Weather News, Outdoors, Sporting Event, Physical Violence and Person X Classifiers • Model only based on text info. are better than random baselines on the development data

  15. Feature Classifiers (6/7) • Timing info. is the implicit temporal structure of the broadcast news, especially weather reports and sports.

  16. Feature Classifiers (7/7) • For each shot, both predictions from text-based and timing-based classifiers have to be considered • Except for weather news, the results suggest the text info. of the broadcast news in the shot may not be enough to detect these high-level features.

  17. News Subject Monologues (1/2) • Based on the LIMSI speech annotations they developed a voice over detector and a frequent speaker detector • VOCR is applied to extract overlaid text in the hoping of finding people names

  18. News Subject Monologues (2/2) • Another feature measures the average amount of motion in a camera shot, based on frame difference • also use commercial and anchor detectors • combine individual detectors and features by using two well-known classifier combination schemes, namely stacking and bagging • MAP 0.616

  19. Finding Person X in Broadcast News (1/3) • Use text info. from a transcript and face info. • Relationship between the name of person x and time • S: one shot; TS: key frame; TO: time of person namel;

  20. Finding Person X in Broadcast News (2/3) • More limited face recognition based on video shot • collect sample faces {F1, F2, …, Fn} for person X • and all faces {f1, f2, …, fm} of i-frames in the news shot which Ptext is larger than zero • build the eigenspace for those faces {f1, f2, …, fm, F1, F2, …, Fn} and represent them by the eigenfaces {eigf1, eigf2, …, eigfm, eigF1, …, eigFn} • combination rank score and estimate which shots has high possibility to contain that face

  21. Finding Person X in Broadcast News (3/3) • Using “Madeleine Albright” as person x, we obtained 20 faces from a Google image search as sample query faces.

  22. Learning Combination Weights in Manual Retrieval (1/5) • Shot-based video retrieval, • a set of features is extracted • each shot is associated with a vector of individual retrieval scores from different media search modules • finally, these retrieval scores are fused into a final ordered list via some aggregation algorithm

  23. Learning Combination Weights in Manual Retrieval (2/5) • use the weighted Borda fuse model as the basic combination approach for multiple search modules, i.e. for each shot its final score is • Similarity Measures • For video frame, a harmonic mean of the Euclidean distances from each query images (color, texture, edge) is computed to be the distance between query and video frames • For text, CC and OCR transcripts is done using the OKAPI BM-25 formula

  24. Learning Combination Weights in Manual Retrieval (3/5) • Negative Pseudo-Relevance Feedback (NPRF) • NPRF is effective at providing a more adaptive similarity measure for image retrieval • Propose a better strategy to sample negative examples, that is inspired by the Maximal Marginal Relevance • Maximal Marginal Irrelevance (MMIR)

  25. Learning Combination Weights in Manual Retrieval (4/5) • The Value of Intermediate-level Detectors • Text-based feature is good at global ranking and other features is useful in refining the ranking afterwards • Learning Weights for each Modality in Video Retrieval • Baseline: Setting weights based on query types • Person query: w=(text 2, face 1, color 1, anchor 0) • Non-person query: w=(text 2, face -1, color 1, anchor -1) • Aircraft and animal: w=(text 2, face -1, edge 1, anchor -1)

  26. Learning Combination Weights in Manual Retrieval (5/5) • Learning weights using training labeled set • Supervised learning algorithm in the development set • Co-Retrieval • a set of video shots are first labeled as relevant shots using text-based features, and the results are augmented by learning with the other visual and intermediate level features • Experimental results

  27. Interactive TREC Video Retrieval Evaluation for 2003 (1/2) • This interface has the following features: • Storyboards of images spanning across video story segments • Emphasizing matching shots to a user’s query to reduce the image count • Resolution and layout under the user control • Additional filtering provided through shot classifiers • Display of filter count and distribution to guide manipulation of storyboard views

  28. Interactive TREC Video Retrieval Evaluation for 2003 (2/2)

  29. Conclusions • We believe the browsing interfaces and image-based search improvements made for 2003 led to the increase in performance for the new system, as these strategies allowed relevant content to be found having no associated narrative or text metadata.

More Related