Stochastic Grammars: Overview

Stochastic Grammars: Overview • Representation: Stochastic grammar • Terminals: object interactions • Context-sensitive due to internal scene models • Domain: Towers of Hanoi • Requires activities withstrong temporal constraints • Contributions • Showed recognition &decomposition with veryweak appearance models • Demonstrated usefulnessof feedback from high tolow-level reasoning components • Extended SCFG: parameters and abstract scene models

Expectation Grammars(CVPR 2003) • Analyze video of a person physically solving the Towers of Hanoi task • Recognize valid activity • Identify each move • Segment objects • Detect distracters / noise

System Overview

Low-Level Vision • Foreground/background segmentation • Automatic shadow removal • Classification based onchromaticity andbrightness differences • Background Model • Per pixel RGB means • Fixed mapping from CDand BD to foregroundprobability

ToH: Low-Level Vision Raw Video Background Model Foreground and shadow detection Foreground Components

Merge Enter Low-Level Features • Explanation-based symbols • Blob interaction events • merge, split, enter, exit, tracked, noise • Future Work: hidden, revealed, blob-part, coalesce • All possible explanations generated • Inconsistent explanations heuristically pruned

Expectation Grammars ToH -> Setup, enter(hand), Solve, exit(hand); Setup -> TowerPlaced, exit(hand); TowerPlaced -> enter(hand, red, green, blue), Put_1(red, green, blue); Solve -> state(InitialTower), MakeMoves, state(FinalTower); MakeMoves -> Move(block) [0.1] | Move(block), MakeMoves [0.9]; Move -> Move_1-2 | Move_1-3 | Move_2-1 | Move_2-3 | Move_3-1 | Move_3-2; Move_1-2 -> Grab_1, Put_2; Move_1-3 -> Grab_1, Put_3; Move_2-1 -> Grab_2, Put_1; Move_2-3 -> Grab_2, Put_3; Move_3-1 -> Grab_3, Put_1; Move_3-2 -> Grab_3, Put_2; Grab_1 -> touch_1, remove_1(hand,~) | touch_1(~), remove_last_1(~); Grab_2 -> touch_2, remove_2(hand,~) | touch_2(~), remove_last_2(~); Grab_3 -> touch_3, remove_3(hand,~) | touch_3(~), remove_last_3(~); Put_1 -> release_1(~) | touch_1, release_1; Put_2 -> release_2(~) | touch_2, release_2; Put_3 -> release_3(~) | touch_3, release_3; • Representation: • Stochastic grammar • Parser augmented with parameters and internal scene model

Forming the Symbol Stream • Domain independent blob interactions converted to terminals of grammar via heuristic domain knowledge • Examples: merge + (x ≈ 0.33) → touch_1split + (x ≈ 0.50) → remove_2 • Grammar rule can only fire if internal scene model is consistentwith terminal • Examples: can’tremove_2 if nodiscs on peg 2 (B) • Can’t move disc tobe on top of smallerdisc (C)

ToH: Example Frames Explicit noise detection Objects recognized by behavior, not appearance

ToH: Example Frames Detection of distracter objects Grammar can fill in for occluded observations

Finding the Most Likely Parse • Terminals and rules are probabilistic • Each parse has a total probability • Computed by Earley-Stolcke algorithm • Probabilistic penalty for insertion and deletion errors • Highest probability parse chosen as best interpretation of video

Contributions • Showed activity recognition and decomposition without appearance models • Demonstrated usefulness of feedback from high-level, long-term interpretations to low-level, short-term decisions • Extended SCFG representational power with parameters and abstract scene models

Lessons • Efficient error recover important for realistic domains • All sources of information should be included (i.e., appearance models) • Concurrency and partial-ordering are common, thus should be easily representable • Temporal constraints are not the only kind of action relationship (e.g., causal, statistical)

Representational Issues • Extend temporal relations • Concurrency • Partial-ordering • Quantitative relationships • Causal (not just temporal) relationships • Parameterized activities

Stochastic Grammars: Overview