Summarization of ego-centric video -Object driven Vs. Story driven

Summarization of ego-centric video -Object driven Vs. Story driven Presented By: Elad Osherov Jan 2013

Today’s talk • Motivation • Related Work • Object driven summarization • Story driven summarization • Results • Future Development

What is Egocentric Video Anyway ? http://xkcd.com/1235/

What is Egocentric Video Anyway ?

Motivation • Goal - Generate a visual summary of an unedited egocentric video Input: Egocentric video of camera wearer’s day Output: Storyboard (or skim video) summary

Potential Applications of Egocentric Video Summarization Mobile robot discovery Law enforcement Memory aid

Egocentric Video Properties • Long unedited video • Constant head motion – blur • Moving camera – unstable background • Frequent changes in people and objects • Hand occlusion

Related Work • Object recognition in egocentric video [Egocentric Recognition of Handled Objects: Benchmark and Analysis X.Ren, M.Philipose -CVPR 2009] • Detection and recognition of first person actions [Detecting activities of daily living in first-person camera views H.Pirsiavash, D.Ramanan CVPR 2012] • Data summarization – Today ! [Rav-Acha, Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06]

Related Work [ Rav-Acha, Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06 ] http://www.vision.huji.ac.il/video-synopsis/

A Few Words About the Authors • Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012 • Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013 Dr. Yong Jae Lee UC Berkeley (departments of EE & CS) Prof. Joydeep Ghosh University of Texas at Austin. Director of IDEAL (Intelligent Data Exploration and Analysis Lab) Prof. ZhengLu City university of Hong Kong (department of CS) Prof. Kristen Grauman University of Texas at Austin (department of CS)

Object Driven Video Summarization • Goal - create a storyboard summary of a person’s day that is driven by the important people and objects • Important things - significant interaction • Several problems arise • Important is a subjective index ! • What does significant interaction really mean ? • No priors on People and objects

Algorithm Overview • Train a category-independent important person/object detector Test Test Test Train [Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012]

Annotating Important Regions in Training Video • Data collection – • 10 videos each 3-5 hours long-total of 37 hrs • 4 subjects • Crowd source annotations using Mturk • Object’s degree of importance will highly depend on what the camera wearer is doing before, while andafter the object/person appears • The object must be seen in the context of the camera wearer’s activity to properly gauge its importance www.looxcie.comwww.mturk.com/mturk/

Annotating Important Regions in Training Video Man wearing a blue shirt in a Café Yellow notepad on a table Coffee mug that cameraman drinks Smartphone the cameraman holds • For about 3-5 hours of video they get 700 object segmentations

Training a Regression Model • General purpose category-independentmodel predicts important regions in any egocentric video: • Segment each frame into regions • For each region, compute a set of candidate features that could describe it’s importance • Egocentric, Object & Region features • Train a regressor to predict region importance

Egocentric Features • Interaction feature – • Euclidean distance of the region’s centroid to the closest detected hand • Classify region as a hand according to color likelihoods and a naïve bayes classifier trained on ground-truth hand annotations Distance to hand

Egocentric Features • Gaze feature – • A coarse estimate of how likely the region is being focused upon • Euclidean distance of the region’s centroid to the frame center Distance to frame center

Egocentric Features • Frequency feature – Region matching - Color dissimilarity between the region and each region in surrounding frames Points matching - Match SIFT features between each region and frame in surrounding frames Frequency

Object Features • Object-like appearance • Using region ranking function that ranks each region according to Gestalt cues: [J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In CVPR, 2010.]

Object Features • Object-like motion • Rank each region according to the difference of motion patterns in comparison to the nearby regions • High scores to regions that “stand-out” of their surroundings during motion Object-like motion [Key-Segments for Video Object Segmentation Yong Jae Lee, Jaechul Kim, and Kristen Grauman ICCV 2011]

Object features • Likelihood of a person’s face • Compute the maximum overlap score between the region r and any detected face q in the frame Overlap with face detection

Train a regressor to predict region importance • Size, centroid, bounding box centroid, bounding box, width, bounding box height – Region features • Solve using least squares

Segmenting the video into temporal events Pair-wise distance matrix • Events allow summary to include multiple instances of the person or object that is central in multiple contexts in the video • Group frames until the smallest maximum inter-frame distance is larger than two STDs beyond the mean

Discovering an Event’s Key People and Objects • Score each frame region using the regressor • Group instances of the same object/person together • Set a pool of high scoring clusters • Remove clusters with affinity to a higher I(r) cluster • For each remaining cluster select the region with the highest importance as its representative

Generating a Storyboard Summary • Each event can display different number of frames, depending on how many unique important things the method discovers

Results Important region prediction accuracy

Results Which cues matter most for predicting importance ? Top 28 features with highest learned weights • Low scores on • Interaction and frequency pair • Object-like region that is frequent

Results Egocentric video summarization accuracy

Results User studies to evaluate summaries • Let the camera wearer answer 2 quality questions: • Important objects/people captured • Overall summary quality • Better results in ~69% of the summaries

Story Driven Video Summarization • Good summary captures the progress of the story! • Segment video temporally into subshots • Select chain of k subshots that maximize both weakest link’s influence and object importance • Each subshot”leads to” the next through some subset of influentialobjects [Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013]

Document – Document Influence[Shahaf & Guestrin, KDD 2010] Connecting the dots between news articles. D. Shahaf and C. Guestrin. In KDD, 2010.

Egocentric SubshotDetection • Define 3 generic ego-activities • Static • In transit • Head moving • Train classifiers to predict these activity types • Features based on Blur and Optical flow • Classify using SVM classifier

Temporal Subshot Segmentation Tailored to egocentric video – detects ego-activities Provides an over-segmentation - A typical subshot lasts ~15 Sec

Subshot Selection Objective • Given a set series of subshots segmented from the input video, our goal is to select the optimal K-node chain of subshots

Story Progress Between Subshots • A good story – a coherent chain of subshots, where each strongly influences the next one

Predicting Influence Between Subshots 0.2 0.003 0.01 0.1 0.1 0.2 0.1

Predicting Influence Between Subshots Sink node • Captures how reachable subshotj is from subshoti, via object o.

Subshot Selection Objective • Given a set series of subshots segmented from the input video, our goal is to select the optimal K-node chain of subshots

Predicting diversity among transitions • Compute GIST and color histograms for each frame in each subshot, quantize them into 55 scene types • Compute for each two adjacent subshots in the chain

Coherent Object Activation Patterns • Prefer activating few objects at once and, coherent (smooth) entrance/exit patterns • Solve with linear programing and priority queue Story driven Uniform sampling

Results 20 videos, each 20-60 minutes, daily activities in house 4 videos, each 3-5 hours long, uncontrolled setting

Results • Baselines • Uniform sampling of K subshots • Shortest path – K subshots with minimal bag-of-objects distance between each other • Object driven – Only for UTE set • Parameters • K=4...8 • Simultaneous active objects : 80-UTE 15-ADL

Results • Test methodology • 34 human subjects, ages 18-60 • 12 hours of original video • Each comparison done by 5 subjects • Total 535 tasks, 45 hours of subject time • Probably the most comprehensive egocentric summarization test ever established!

Summarization of ego-centric video -Object driven Vs. Story driven

Summarization of ego-centric video -Object driven Vs. Story driven

Presentation Transcript

Consumer Driven Healthcare: Myth vs. Reality

A Data-Driven Game Object System

Object Oriented Testing and Test-Driven Development

Procedure-Oriented vs Object-Oriented/Event-Driven

Driven Oscillator

Consumer-Driven Vs. Traditional Health Plans

Sequential vs. Event-driven Programming

Patron Driven Acquisition of Online Video

Distributed Object Visualization for Sensor Driven Systems

Discovery-Driven Graph Summarization

Data-driven methods: Video Textures

Data-driven methods: Video Textures

Story-driven development

User-Driven

Discussion of Model-Driven Development Model-Driven Architecture

Object-Oriented Theories for Model Driven Architecture

Driven Tooling

Driven Technology

Driven

Driven

Ai-driven Video Creation Services | Optimusrobo.com