220 likes | 543 Vues
Associating Video Frames with Text. Pinar Duygulu and Howard D. Wactlar Informedia Project Carnegie Mellon University ACM SIGIR 2003. Abstract. Integration of visual and textual data in order to annotate video frames with more reliable labels and descriptions
E N D
Associating Video Frames with Text Pinar Duygulu and Howard D. Wactlar Informedia Project Carnegie Mellon University ACM SIGIR 2003
Abstract • Integration of visual and textual data • in order to annotate video frames with more reliable labels and descriptions • Correspondence problem between video frames and associated text • Using joint statistics • Better annotations can improve the performance of text based queries
Introduction (1/2) • Video retrieval • visual vs. textual features • A system that combines two features is more powerful • images/videos with some descriptive text • Corel data set, some museum collection and news photographs on the web with captions • Correspondence problem • Some methods are proposed by modeling the joint statistics of words and image regions
Introduction (2/2) • Correspondence problems in video data • Because transcripts and frames may not be co-occur in the same time • e.g. query = president to Informedia system • Goal • Determine the correspondence between the video frames and associated text to annotate the video frames with more reliable descriptions
Multimedia Translation (1/3) • Analogy • learning a lexicon for machine translation vs. learning acorrespondence model for associating words with image regions • Missing data problem • Assuming unknown one-to-one correspondence between words, missing data is the major problem by using joint probability distribution linking words in two languages • deal with by the EM algorithm
Multimedia Translation (2/3) • Method • a set of images and a set of associated words • image segmented into regions and from each region a set of feature (color, texture, shape and position and size) are extracted. • Vector-quantize the set of features representing an image region using k-means • Each region then gets a single label (blob token) • Then construct a probability table that links the blob tokens with word tokens
Multimedia Translation (3/3) • Method (cont.) • The table is initialized to the co-occurrence of blobs and words • The final translation probability table is constructed using EM algorithm which iterates between two steps: • use an estimate of the probability table to predict correspondence • then use the correspondences to refine the estimate of the probability table • Once learned, the table is used to predict words corresponding to particular image
Correspondences on Video (1/3) • Broadcast news is very challenging data • Due to its nature it is mostly based on people and requires person detection/ recognition. • Data set • Subsets of Chinese culture and TREC 2001 data sets which are relatively simpler • Consists of videoframes and the associated transcript extracted from the audio (Sphinx-III speech recognizer). • The frames and transcripts are associated on the shot-basis
Correspondences on Video (2/3) • Keyframe • Segmented into regions by fixed sized grids • A feature vector of size 46 is formed to represent each region • Position: (x, y) of the region center • Color: using the mean and variance of the HSV and RGB • Texture: using the mean and variance of 16 filter • Four difference of Gaussian filters with different sigmas and twelve oriented filters, aligned in 30 degree increments
Correspondences on Video (3/3) • Vocabulary • Consists of only nouns which are extracted from applying Brill’s tagger to the transcript • Contain noisy words in the vocabulary
TREC 2001 Data (1/3) • TREC 2001 Data set • 2232 keyframes and 1938 nouns • Difference between still images and video frames • Text for the surrounding frames is also considered by setting the window size to five • Process • Each image is divided into 7 * 7 blocks (49 regions) • Feature space is vector quantized using k-means (k=500) • apply EM to obtain final translation probability between 500 blob tokens and 1938 word tokens
TREC 2001 Data (2/3) • Example annotation results for TREC 2001 data
TREC 2001 Data (3/3) • Experiment results (Statue of Liberty) before after
Chinese Culture Data • Example of “great wall” • 3785 shots and 2597 words • After pruning process, 2785 shots and 626 words
Chinese Culture Data • Experimental results (panda, wall, emperor)
Chinese Culture Data • Evaluate the results on a larger scale 189 images for the word panda
Chinese Culture Data • The rank of the word panda as the predicted word for the corresponding frames • Red: test set • Green: training set • Problem: the woman frames highly co-occur with word panda
Chinese Culture Data • The effect of window size • a single shot, and window size is set to 1, 2 or 3 • Recall: # of correct predictions over the # of times that the word occurs in the data • Precision: # of correct predictions over all predictions
Chinese Culture Data • Experimental results of the effect of window size a single shot win size =3
Chinese Culture Data • Experimental results of the effect of window size (cont.) • For some selected words
Discussion and Future work • Discussion • Solve the correspondence problem between video frames and associated text • relatively simpler and smaller data sets are used • Broadcast news which is a harder dataset since there are terabytes of video and it requires focusing on people
Discussion and Future work • Better visual feature • some detectors (face detector) • motion information • segmenter • use the temporal information to segment the moving objects • Text • Noun phrases or compound words • More lexicalanalysis