Visual Search of Dance Archives

Oxford Robotics Group. May 2013. Visual Search of Dance Archives John Collomosse J.Collomosse@surrey.ac.uk Centre for Vision, Speech and Signal Processing University of Surrey

Motivation: Add value to curated Dance collections Dance archvies are currently searchable by text (curatedmetadata) What if you want to search on the content e.g. choreography itself? Metadata Metadata

Research Landscape: Visual Search of Dance Pose driven visual search iWeave: 3D Costume Archive Sketch driven Choreography

Research Landscape: Visual Search of Dance Pose driven visual search iWeave: 3D Costume Archive Sketch driven Choreography Visual sentences for pose retrieval over low-resolution cross-media dance collections. R. Ren and J. Collomosse. IEEE Trans. Multimedia 14(6). Dec 2012.

UK-NRCD Archival Dance Footage • Digital Dance Archive (DDA) spanning ~100 years of UK dance history • Videos transferred between several analogue formats prior to digitisation. Featureless bg Iluminationaritfacts Contrast bleaching Blur / poor definition Grainy/noisy Small performer e.g. 100 px Inter & intra-occlusion http://www.dance-archives.ac.uk

Characterizing HPE on Archival Footage Explicit Human Pose Estimation (HPE) fails on typical NRCD archival footage NRCD Footage Eichner et al. [CVPR’09] Andriluka et al. [CVPR’09]

Contributions • Cross-media pose retrieval on archival data • Match pose implicitly rather than explicitly • New representation “Visual Sentences” • Using Self-similarity (SSIM) and LDA • Built into Bag of Words framework • With tweaks e.g. stop-word removal • Fusing Vision and Information Retrieval concepts • Diversity re-ranking Contact Sheets (Photos) Performance videos

Performer Detection • Dalal/Triggs-like pedestrian detection [CVPR 2005] • Trained across six videos (~5hrs) • 5k positive annotations. • 5k negatives sampled randomly outside BBs. • Horizontal poses included but rotated (twice) • Output BBs rescaled to 64x128 for retrieval

Visual Sentence Representation • Based on Self-similarity (SSIM) descriptor* • Computes a correlation surface Sq local to (x,y) using SSD. • Bins Sq into a log-polar representation • using local maxima. • 3) Discards invalid (v. low/high variance) features. * “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.

SSIM for Dance Using star ensemble, SSIM showcased* results including Dance pose detection. Query • Ensemble approach scales at best O(n) • - we need to search >>100k BBs • SSIM not characterized well for our data • cross-domain, cross-performance • 3) However most promising approach tested • - vs. SIFT/SURF, HOG, Shape Context CVPR’05 CVPR’07 * “Matching Local Self-similarities across Images and Video”. E. Shechtman and M. Irani. CVPR 2007.

Implicit Pose Representation • Self-similarity (SSIM) codebooked (HKM, hard-assignment), aggregated over scale

Representations and strategies (PSF1 of 4) Given a dataset of ROIs ,and query ROI evaluate for all and rank • Pose similarity function (PSF) 1 serves as baseline – Multi-scale BoVW • where is the ith of n visual words present in • tf(.) yields word frequency within d, and |d| is word count

Representations and strategies (PSF2 of 4) • Pose similarity function (PSF) 2 is a variant of MS-BoVW that individually weights the importance of each layer (up to 5) • where are normalised weights bootstrapped via SVM over a small training set of 50 • in practice indicate ~linear increase with finer scales.

Representations and strategies (PSF3 of 4) • Visual sentence (VS) representation encodes fine-scale features + structural context • Semantic body zones unlikely to map explicitly to regions in structural hierarchy. • Set of VS capture membership implicitly over latent variables • Topic discovery via LDA Variable length • Topic set learned via Gibbs over 1k training samples using 48 topics (c.f. Choreutics) • Variable length sentences padded • Spatial relationships implicitly encoded via context

Representations and strategies (PSF4 of 4) • Explicit encoding spatial relationships via sliding window approach over and • 2 x 2 window (at coarest level i.e. = 4 x 8 pixels) over compare all VS within footprint • Reminiscent of text passage retrieval Similarity between window pair (50-100 VS) Randomly sample 1/3 of VS in window pair and search for pair of sentences minimising Where ||.|| is a count of in-place differences between VS

Comparative Results (PSF1-4) • Initial evaluation over 32 works over 4 cross-media collections • Video subsampled @ 5s = 6.3k video stills + 1.7k photos ~= 8k BBs • No stop-word identification at this stage • Independent treatment of scales sig. better • VS outperforming Layered by ~10% • PSF3 best (+4%) but drops sharply after 1k

Query set and Ground Truth • Mark-up task distribution over 3 professional archivists in UK-NRCD • 65 queries – single BB (2/3 contact sheet photos, 1/3 video frames) • 8k BB marked up as relevant/non-relevant with respect to each query contact sheets video

Comparative Results (PSF1-4) • Effect of stop-word removal on BoVW codebook • Comparing best performing VS (PSF3) and Layered (PSF2) strategies. • Stop-word identification via freq. distribution under Bernoulli or Poisson model • Indicates PSF3 (LDA) over k=1000, with Bernoulli stop-word removal at 0.85

Diversity Re-ranking (PSF5 = PSF3 re-ranked) • Direct presentation of results can lead to unsatisfactory visual repetition (e.g. temporally adjacent video frames) • Not ideal for archive discovery. • A run of poor results can also reduce precision. • Re-rank via Kruskal clustering of affinity graph A of top n results (scope of DR) • A computed pairwise using PSF4 (sliding window approach) • Spanning trees iteratively identified in graph to form cluster set – each is ranked independently under the PSF3 score. Ranks merged.

Results - Qualitative

Results - Qualitative Serendipitous recovery from failed BB isolation!

Results - Quantitative • Comparison vs. BoVW (single and multiple scales) and variants including SPK

Scaling the dataset • Initial dataset plus Siobhan Davies archive (200 videos, 562 contact sheets) ~= 68kBB • Inverse index used for PSF1,2,3,5 • Comparison to explicit HPE • Pictorial Structures Revisited [Andriluka ‘09] • Pose Search [Eichner ‘09]

Conclusions on Pose Search Pose driven visual search • SoA pose search relies on explicit HPE • - This is impractical on low-resolution, cross-domain footage. • Visual sentences + LDA (PSF3) reach ~32% MAP >> SoA • - Encode local appearance with a spatial context • - Sufficient level of abstraction to match diverse footage. • Diversity re-ranking improves results by ~4% • Query time <2s for 68k records • Given could pre-compute at this scale

ReEnact: Contributions • Major driver for the use of Dance Archives is the development of new choreography • ReEnact is a sketch based interface to the NRCD archives enabling this • Visual Narrative: A set of key-frame poses linked with gestures that describe a movement. • Conceptual extension of ‘storyboard’ sketches [Collomosse et al. ICCV 09]

Related Work • No prior work on sketch based pose retrieval • Several works on sketch based shape retrieval but these are aimed at inter- not intra-class variation. A Performance Evaluation of the Gradient Field HOG Descriptor for Sketch based Image Retrieval. R. Hu and J. Collomosse. Computer Vision and Image Understanding (CVIU). February 2003.

ReEnact: Pose retrieval pipeline Sketch parse Sketch parse Map Learn Manifold Mapping Training pairs Video parse Learn Geodesic k-NN Map All video Training Query

ReEnact: Sketch Parsing • Sketches are converted into stick figures (joint angle representation) • Ellipse detection for head • Torso detection • Proximity to extreme points of other strokes • Centre of mass • Intersections with torso are potential limbs • Heuristics select limb pairs for arms/legs • User may manipulate left/right labellings as these are ambiguous in sketch Sketch parse Skeletons from Sketches of Dancing Poses. M. Fonseca, S. James and J. Collomosse. Proc. VL/HCC. Nov 2012.

ReEnact: Performer Extraction • Extracting a silhouette of the performer with the bounding box Saliency FG/BG Texton MRF / Solve Motion diff. • Unary: weight sum of three fields • Pairwise: standard Boykov’01 term

ReEnact: Descriptor Formation • Skeleton -> Joint angle representation Silhouette -> Gridded Zernike moments Match • Concatenate 22-D moments • 2x2 grid • Affine invariant (each cell)

Learning a mapping between the manifolds • Geodesic distance as a shortest-path over the graph

Constructing the Graph (G) in space D Around 150 training pairs of sketches and video frames are gathered to seed Training frames: Training weights: • Test video frames are subsequently indexed by extending with new poses • attached to nearest N training nodes • N=1 for unconfident frames , >1 confident • Confidence determined by temporal coherence (covariance) of descriptors

Domain Transfer S -> D Given a training sketch s we can now infer similarity to any video pose in D (i.e. ) So given an arbitrary query q, and assuming local linearity in S: nx – candidate video pose nd – connection into D from S a,b pairs of nodes on shortest path through D

Retrieval Results Trained on 150 frames, tested over ~6k. AP @ [1,80] averaged 6 queries. - Training (Blueprint) MAP 60% - Test (ThreeD) MAP 47%

Choreography Synthesis • Web UI for generating visual narratives via sketch / semantic label annotation For free: Can run inference backward from D->S to produce stick men from video. Useful for visualizing / exploring alternative retrieval results

Video Synthesis • Inspired by Video Textures [Schodl’00] (a video form of Motion Graph [Kovar’02]) e.g. time

Video Path Optimization • The motion graph is formed by identifying transitions between frame pairs • Pose similarity via our geodesic distance • Down-weighted by poor optical flow correspondence [Brox’04] • Low-pass filtered to encourage motion coherence

Video Path Optimization The motion graph is duplicated and linked via “virtual” nodes (sketched poses)

Video Path Optimization • Shortest path across the graph a function of three costs: • Pose similarity • Gesture similarity • Duration of sequence (fidelity to visual narrative, visual smoothness) Mean gesture similarity over path Count of frames along path Penalise deviation from an idealised duration (user specified with action labels). or Sliding window SVM trained for gesture recognition (black box)

Video Synthesis: Results • Representative run for a 3 stage visual narrative over Three-D • Gradient domain compositing used against an “infinite background”

Video composited

ReEnact: Conclusion Sketch driven Choreography • Sketch based pose search using a learnable piecewise linear manifold mapping • Temporally coherent pose descriptor based on gridded Zernike moments • 47% MAP on unseen video • Visual narratives to generate archival choreography • Motion graph optimization fusing pose/action cost • Future work • Improve compositing of the performer • Unwanted scale changes due to BB detection • Alternative ways to specify intermediate gestures

iWeave – Interactive Wearable Archive • Ongoing project enabling users to experience costume and choreography from circa 1920s • Captured dance performance in 3D studio. • Create an animated character that is interactively controlled via human using Microsoft Kinect.

iWeave – Performance Capture (Raw) Daffodil dress from Natural Movement collection (1920s)

iWeave – Performance Capture (4D Video) Daffodil dress from Natural Movement collection (1920s)

Visual Search of Dance Archives