1 / 30

A Scalable Tree-based Approach for Joint Object and Pose Recognition

A Scalable Tree-based Approach for Joint Object and Pose Recognition. Kevin Lai 1 , Liefeng Bo 1 , Xiaofeng Ren 2 , Dieter Fox 1,2 1. University of Washington, Seattle WA, USA 2. Intel Labs, Seattle WA, USA. Motivation.

gayora
Télécharger la présentation

A Scalable Tree-based Approach for Joint Object and Pose Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Tree-based Approach for Joint Object and Pose Recognition Kevin Lai1, Liefeng Bo1, Xiaofeng Ren2, Dieter Fox1,2 1. University of Washington, Seattle WA, USA 2. Intel Labs, Seattle WA, USA

  2. Motivation • A unified framework for robust object recognition and pose estimation in real-time Object recognition Pose Estimation • Scale efficiently with the number of objects and poses • Incremental learning of new objects + Image sources: Deng et al, CVPR 2009; Muja 2009

  3. Object and Pose Recognition Category Object Instance Pose Query

  4. LEGO Augmented Reality • L. Bo, J. Fogarty, D. Fox, S. Grampurohit, B. Harrison, K. Lai, N. Landes, J. Lei, P.Powledge, X.Ren, R. Ziola

  5. Scalable Recognition • Standard object recognition: • k-nearest neighbor classifier – scales linearly with data • One-versus-all classifiers, e.g. SVMs – scales linearly with # of classes • Scalable recognition: • Datasets are getting bigger and bigger, e.g. ImageNet[Deng et al. CVPR 2009] • A tree-based approach could scale sublinearly with the number of classes [Bengio et al. NIPS 2010]

  6. ? Object-Pose Tree Category Apple Stapler Bowl Cereal . . . . . . Instance Chex Bran Flakes Striped Bowl Blue Bowl . . . . . . . . . View . . . . . . Category: Cereal Instance: Bran Flakes Pose: 18° Pose

  7. Object-Pose Tree Learning • Learn the parameters Wof all nodes in the tree. Given a set of N training samples (features , labels : • Extract features from both RGB and depth images: • Gradient and shape kernel descriptors over RGB and depth images[Bo, Ren, and Fox, NIPS 2010] • Overall objective function combines the category, instance, and pose label losses:

  8. Object-Pose Tree Learning • Category label loss RC is the standard multi-class SVM loss with slack variables and hinge loss constraints • Instance label loss RI is the max over the loss at the category level and the loss at the instance level [Bengio et al. NIPS 2010] • Pose labelloss RPis the max over the loss at the category, instance, view, and pose levels

  9. Object-Pose Tree Learning • Category and instance constraints (in RC ,RI) are the standard multi-class SVM hinge loss: • View and pose constraints (in RP): • Need to normalize view and pose labels • Δ(yi,y)maps the angle differences from [0,to [0,1] • Modified constraints: • Minimize the overall objective using Stochastic Gradient Descent (SGD):

  10. Incremental Learning • Stochastic gradient descent (SGD) makes it possible efficiently update the tree when adding new objects. • Setup: • Train a tree with 290 objects offline • Add 10 objects and update tree using SGD from scratch (SGD), or initialized with previous weights (warm SGD) • Result: • 5-10x speedup while yielding the same accuracy

  11. RGB-D Object Dataset • 250,000 RGB-Depth image pairs (640x480) • 300 objects in 51 categories annotated with ground truth poses A Large-Scale Hierarchical Multi-View RGB-D Object Dataset (Lai et al. ICRA 2011) Available at: http://www.cs.washington.edu/rgbd-dataset

  12. Evaluation • Object-Pose Tree • Joint learning of tree parameters • Learn parameters of each node as independent binary SVMs • K-Nearest Neighbor classifier • Exact and approximate • One-versus-all Support Vector Machine • SVM for category and instance recognition • Infeasible to train a binary SVM for every pose for every object • K-NN within each instance for pose estimation

  13. Results OPTree: Object-Pose Tree (our approach) NN: k-nearest neighbors FLANN: Approximate k-nearest neighbors [Muja and Lowe, VISSAPP 2009] 1vsA+NN: One-versus-all SVM for category and instance, k-nearest neighbor within instance for pose Indep Tree: Object-Pose Tree where each level of the tree is learned with a separate multi-class SVM optimization

  14. Example Results Matches Query 1 2 3 4 5

  15. Summary • Tree-based learning and classification framework • Jointly perform object category, instance, and pose recognition • Scales sub-linearly with the number of objects and poses • Online updating of parameters using stochastic gradient descent when adding new objects • Outperforms existing object recognition approaches in both accuracy and running time on the RGB-D Object Dataset, containing 300 everyday objects • Available at: http://www.cs.washington.edu/rgbd-dataset

  16. Generalization to Novel Objects

  17. Incremental Learning • Additional Experiment 1: • Train OPTree with 250 objects offline • Add 10 objects at a time for 5 rounds, updatingtree using 5000 iterations of SGD each round • Result: Warm SGD obtains within 1% test accuracy of SGD from scratch • Additional Experiment 2: • Train OPTree from scratch, adding 10 objects at a time for 30 rounds • Result: Warm SGD with 5000 iterations per round obtains within

  18. Evaluation • Category and Instance Recognition • Proportion of correctly labeled test samples • Pose Recognition • Difference between predicted and true poses, [0, π] → [0%, 100%] • 1. For all test images (0% pose accuracy if category or instance incorrect) • 2. Only for test images that were assigned the correct instance label) • Running time per test sample • Feature extraction: Same for all approaches (1s) • Compare running time (in seconds) for each approach

  19. Pose recognition breakdown

  20. Recognition Results • Data: 42000 cropped RGB+depth image pairs of 300 objects • Feature extraction: 1 second per test image NN: Nearest Neighbor FLANN: Approximate nearest neighbor [Muja and Lowe, VISSAPP 2009]1vsA: One-versus-all SVM [Shalev-Schwartz, ICML 2007] +NN: Nearest neighbor within instance for pose estimation +RR: Ridge regression within instance for pose estimation OPTree: Our approach

  21. Object-Pose Tree Learning • As in standard SVM optimization, we replace the dirac delta function with the hinge loss and introduce slack variables:

  22. Object Recognition Pipeline Bag of Words Visual and depth features Object Distance Classifier Recognition image patch features image feature

  23. RGB-D Object Dataset • 300 objects in 51 categories • 250,000 640x480 RGB-Depth frames total • 8 video sequences containing these objects (home and office environments) A Large-Scale Hierarchical Multi-View RGB-D Object Dataset (Lai et al. ICRA 2011) http://www.cs.washington.edu/rgbd-dataset

  24. Object Hierarchy (WordNet/ImageNet)

  25. RGB-D Scenes A Large-Scale Hierarchical Multi-View RGB-D Object Dataset (Lai et al. ICRA 2011) http://www.cs.washington.edu/rgbd-dataset

  26. Recognition Results • Data: 42000 cropped RGB+depth image pairs of 300 objects • Feature extraction: 1 second per test image OPTree: Object-Pose Tree (our approach) NN: k-nearest neighbor FLANN: Approximate k-nearest neighbor [Muja and Lowe, VISSAPP 2009]

  27. Recognition Results OPTree: Object-Pose Tree (our approach) 1vsA+NN: One-versus-all SVM for category and instance, nearest neighbor within instance for pose

  28. Recognition Results OPTree: Object-Pose Tree (our approach) OPTree+NN: OPTree for category and instance, nearest neighbor within instance for pose IndepTree: Object-Pose Tree where each level of the tree is learned with a separate multi-class svmoptimization

  29. Generalization to Novel Objects • Leave-sequence-out evaluation • 94.3% - Category recognition accuracy • 53.5% - penalize incorrect category and instance • 56.8% - category correct, penalize incorrect instance • 67.1% - category correct, estimate pose even if instance incorrect • 68.3% - instance correct • Leave-object-out evaluation (train on 249 objects, test on 51) • 84.4% - Category recognition accuracy • 52.8% - penalize incorrect category and instance • 62.5% - category correct, estimate pose even if instance incorrect

  30. Motivation • A unified framework for robust object recognition and pose estimation in real-time • Existing approaches have considered these tasks in isolation • Scale efficiently with the number of objects and poses • Incremental learning of new objects

More Related