1 / 15

Improved Video Categorization from Text Metadata and User Comments

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011: Research and development in Information Retrieval - Katja Filippova - Keith B. Hall . Presenter Viraja Sameera Bandhakavi. 1. Contributions.

heidi
Télécharger la présentation

Improved Video Categorization from Text Metadata and User Comments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - KatjaFilippova - Keith B. Hall • PresenterVirajaSameeraBandhakavi 1

  2. Contributions • Analyze sources of text information like title, description, comments, etc and show that they provide valuable indications to the topic • Show that a text based classifier trained on imperfect predictions of weakly supervised video content-based classifier is not redundant • Demonstrate that a simple model combining the predictions of the two classifiers outperforms each of them taken independently 2

  3. Research question not answered by related work • Can a classifier learn from imperfect predictions of a weakly supervised classifier?Is the accuracy comparable to the original one? Can a combination of two classifiers outperform either one? • Do the video and text based classifiers capture different semantics? • How useful is user provided text metadata? Which source is the most helpful? • Can reliable predictions be made from user comments? Can it improve the performance of the classifier? 3

  4. Methodology • Builds on top of the predictions of Video2Text • Uses Video2Text: • Requires no labeled data other than video metadata • Clusters similar videos and generates a text label for each cluster • The resulting label set is larger and better suited for categorization of video content on YouTube 4

  5. Video2Text • Starts from a set of weak labels based on the video metadata • Creates a vocabulary of concepts (unigrams or bigrams from the video metadata) • Every concept is associated with a binary classifier trained from a large set of audio and video signals • Positive instances- videos that mention the concept in the metadata • Negative instances-videos which don’t mention the concept in the metadata 5

  6. Procedure • Binary classifier is trained for every concept in the vocabulary • Accuracy is assessed on a portion of a validation dataset • Each iteration uses a subset of unseen videos from the validation set • The classifier and concept are retained if precision and recall are above a threshold (0.7 in this paper) • The remaining classifiers are used to update the feature vectors of all videos • Repeated until the vocabulary size doesn’t change much or the maximum number of iterations is reached • Finer grained concepts are learned from concepts added in the previous iteration • Group together labels related to news, sports, film, etc resulting in the final set of 75 two level categories 6

  7. Categorization with Video2Text • Use Video2Text to assign two-level categories to videos • Total number of binary classifiers (hence labels) limited to 75 • Output of Video2Text represented as a list of strings: (vi , cj,sij, ) 7

  8. Distributed MaxEnt • Approach automatically generates training examples for the category classifier • Uses conditional maximum entropy optimization criteria to train the classifiers • Results in a conditional probability model over the classes given the YouTube videos. 8

  9. Data and Models • Text models differ regarding the text sources from which the features are extracted: title, description, comments, etc • Features used are all token based • Infrequent tokens are filtered out to reduce feature space • Token frequencies are calculated over 150K videos • Every unique token is counted onceper video • Threshold token frequency of 10 is used • Tokens are prefixed with the first letter of where it was found • eg: T:xbox, D:xbox, U:xbox, C:xbox, etc 9

  10. Combined Classifier • Used to see if the combination of the two views – video and text based, is beneficial • A simple meta classifier is used, which ranks the video categories based on predictions of the two classifiers • Video based predictions are converted to a probability distribution • The distribution from the video based prediction and from MaxEnt(Maximum Entropy classifier) are multiplied • This approach proved to be effective • Idea: Each classifier has a veto power • The final prediction for each video is the one with the highest product score 10

  11. Experiments- Evaluation of Text Models • Training data set containing 100K videos which get high scoring prediction • Correct prediction – score of at least 0.85 from Video2Text • Text based prediction must be in the set of video-assigned categories • Evaluation was done on two sets of videos: • Videos with at least one comment • Videos with at least 10 comments 11

  12. Experiments- Evaluation of Text Models Contd… • The best model is TDU+YT+C for both sets • This model is used for comparison against Video2Text model with human raters • This model is also used in the Combination model 12

  13. Experiments with Human Raters • Total of 750 videos are extracted equally from the 15 YouTube categories • Human rater rates (video, category) as -fully correct (3), partially correct(2), somewhat related(1) or off topic (0) • Every pair received from 3 human raters • The three ratings are summed and normalized (by dividing by 9) and rounded off to get the resultant score 13

  14. Experiments with Human Raters Contd… • Score of at least 0.5 – correct category • Text based model performs significantly better than video model • Combination model improved accuracy • Accuracy of all models increases with number of comments 14

  15. Conclusion • Text based approach for assigning categories to videos • Competitive classifier trained on high-scoring predictions made by a weakly supervised classifier (video features) • Text and video models provide complementary views on the data • Simple combination model outperforms each model on its own • Accurate predictions from user comments • Reasons for impact of comments: • Substitute for a proper title • Disambiguate the category • Help correct wrong predictions • Future work: Investigate usefulness of user comments for other tasks 15

More Related