Advanced Methods for Verb Frames Extraction in Holistic Scene Understanding

Extracting Simple Verb Framesfrom Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA CLLR Workshop December 2, 2008

Grand Goal: Scene Understanding Cigarette Backpack Man Dog “A cow walking through the grass on a pasture by the sea” “man wearing a backpack, smoking a cigarette, walking a dog on a sidewalk”

Understanding Verb Frames • Primitives • Objects • Parts • Surfaces • Regions • Interactions • Context • Actions Methods exist to extract these, but we need to both do a better job, and get them all at once “a man is walkingon a sidewalk” Frame: to walk “a dog is walkingon a sidewalk” Man Building Cigarette Modeling verb frames requires understanding the interactions between primitives, and which fit well into the framework of graphical models. Backpack Dog Sidewalk

Outline • Extracting the Primitives • Qualitative 3D Scene Layout • Modeling Relationships • Learning Frames • Refined Characterization of Objects

Computer View of a “Scene” BUILDING ROAD STREETSCENE

Object Detection Detection Window W = Car = Person = Motorcycle = Boat = Sheep = Cow Score(W) > 0.5

Finding the Primitives Jointly SKY GRASS SEASIDEPASTURE Grass = FlatSky = FarFG = Vertical 40% Grass,30% Sky… 1 cow, 2 boats… [Heitz et al., NIPS 2008a]

Results – TAS Model Contextual Detector Base Detector [Heitz et al., ECCV 2008]

Qualitative 3D Scene Layout Primitives imply a certain 3D layout of the scene, absolute depth may not be preserved For example: Sky is a far, vertical plane Water, road are horizontal planes Objects “popup” from the image

Modeling Relationships • We have explored how to model 2D relationships • We should be able to extend this to 3D relationships [Heitz et al., ECCV 2008] [Gould et al., IJCV 2008] Beside In front of On

Outline • Extracting the Primitives • Qualitative 3D Scene Layout • Modeling Relationships • Learning Frames • Refined Characterization of Objects

Learning Semantics: Verb Frames The [S][V] the [O]. [S],[O] CAR ROAD COW GRASS PERSON APPLE … [V] WALKS ON EATS DRIVES ON JUMPS OVER THROWS … Given primitives, rough layout, and relationships Let’s learn subjects, verb, and objects for frames:

TheCARDRIVES ON the ROAD

Refined Characterization We need to know that the white stick is a cigarette… and where the man’s mouth is… in order to determine that he’s smoking.

Refined Object Characterization Set of “keypoint” landmarks Outline shape defined by connecting contour [Heitz et al., NIPS 2008b, IJCV in submission]

Results Rhino Giraffe Llama

Mammals Running Standing Eating Standing [Heitz et al., NIPS 2008b, IJCV in submission]

Activity Recognition Drinking Eating 1) Localize the landmarks of the cow, including the head. Grass Eating Cow 2) Extract histogram of “stuff” in a window around the head landmark 3) Make a decision

Activity Recognition with People Running Walking Standing Hitting • Pose of person is one of the important factors • Also need to recognize objects person interacts with

How far can we take this? Front legs off ground = Jumping Apple near mouth = Eating Ball near hands = Throwing

Does phased learning help? Cartoon/Caricature Exaggerates the most salient features of the object class. Simple BG Real object with no confusing clutter. Cluttered BG Object in standard pose on natural background. Articulated Once we have built a strong appearance model, can we learn complicated articulations?

Our Related Papers • G. Elidan, B. Packer, G. Heitz, and D. Koller. Convex Point Estimation using Undirected Bayesian Transfer Hierarchies. UAI, 2008. • S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-Class Segmentation with Relative Location Prior.IJCV, 2008. • S. Gould, P. Baumstarck, M. Quigley, A. Ng, and D. Koller. Integrating Visual and Range Data for Robotic Object Detection.ECCV Workshop M2SFA2, 2008. • G. Heitz and D. Koller. Learning Spatial Context: Using Stuff to Find Things.ECCV, 2008. • G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded Classification Models: Combining Models for Holistic Scene Understanding.NIPS, 2008. • G. Heitz, G. Elidan, B. Packer, and D. Koller. Shape-based Object Localization for Descriptive Classification. NIPS, 2008.

Advanced Methods for Verb Frames Extraction in Holistic Scene Understanding

Advanced Methods for Verb Frames Extraction in Holistic Scene Understanding

Presentation Transcript

The Simple Present Verb Tense

Extracting Minimalistic Corridor Geometry from Low-Resolution Images

Extracting Videos from YouTube

Extracting structure from reactions

Extracting fact from fiction

Extracting Opinions from Reviews

Extracting an Inventory of English Verb Constructions from Language Corpora

Extracting Energy from Wind

Extracting Tables from ERD

Extracting Value from Waste

Extracting Value from SOA

A Simple Method for Extracting Models from Protocol Code

Extracting Objects from Range and Radiance Images

Images. Tables. Frames. Forms.

Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora

Extracting verb valency frames with Noo J

Extracting Metals from Ores

Present Simple Verb Tense

Present Simple Verb Tense

VERB TENSE: PRESENT PERFECT SIMPLE

Extracting Worth From Waste

Sea Ice

Sea Ice