Recognizing Indoor Scenes

Recognizing Indoor Scenes AriadnaQuattoni Antonio TorralbaCSAIL, MIT CSAIL, MIT UC Berkeley EECS & ICSI 32 Vassar St., Cambridge, MA 02139 ariadna@csail.mit.edutorralba@csail.mit.edu

OUTLINE • Introduction • Indoor Database • Scene Prototypes and ROIs (Region of Interest) • Model & Learning • Experiments & Result • Conclusion

Introduction • Indoor scene recognition is a challenging open problem in high level vision • Due to the variation of indoor scene • Hard to characterized • Ex : some of characterized by global spatial properties (e.g. corridors) • Others characterized by the object they contain (e.g. bookstore) • In this paper , • Propose a prototype based model that can successfully combine both sources of information

Indoor vs. Outdoor Indoorcategories • (RAW:26.5%,Gist:62.9%,Sift:61.9%) Outdoorcategories • (RAW:32.6%,Gist:78.1%,Sift:79.1%) Apply to a dataset of 15 scene categories [9,3,7]

Indoor scene • Why slow progress in this area? • The lack of a large testbed of indoor scenes in which train and test different approaches • Create a new dataset for indoor scene (consists of 67 scenes) • In order to improve indoor scene recognition performance , need to develop image representations specifically tailored for this task • Variation of indoor scene (hard to judge the scene characteristic) • The work is related to work on learning distance functions [4,6,8] for visual recognition

Average images for a sample of the Indoor scene

Indoor Database • Image resource: • Online image search tools • Google and Altavista • Online photo sharing sites • Flickr • LabelMe dataset • The database contain 15620 images and minimum resolution of 200x200 pixels

Image Database 12+14 + 14 +11 +15 = 66 + Closet = 67

Prototype & ROI • Each prototype Tk will be composed of mk ROIs • To define a set of candidate ROIs for a given prototype • Asked a human annotator to segment the objects contained in it • Segmented prototype images in a total 2000 manually segmented images • Selected 10 for each prototype that occupy at least 1% of the image size • Produce candidate ROIs from a segmentation obtained using graph-cuts [13]

Image descriptors • Represent prototype image • Use the Gist descriptor (code available online[9]) • results in a vector of 384 dimensions • Comparison between two Gist descriptors is computed using Euclidean distance • Represent ROI • Use pyramid of visual words [14] • Create vector quantized Sift descriptors by applying K-means to a random subset of image (following [7], we use 200 clusters) • Each ROI decomposed into a 2x2 grid and histograms of visual word are computed for each window [7,1,12] • Distances between two regions are computed using histogram intersection as in [7]

ROI Search and Matches Histograms of visual words can be computed efficiently using integral images We assume that if two images are similar their respective ROIs will be toughly aligned (i.e in similar spatial locations)

Model Formulation • Goal: learn a mapping from images x to scene labels y. • For simplicity: • Each is binary label indicating • For model multiclass case: • Training one-versus-all classifier for each scene For each prototypeTk, define a set of features functions: Include a global feature gk(x)which is computed as the L2 norm between the Gist representation of image x and the Gist representation of prototype k Combine all these feature to define a global mapping:

Model Formulation • β:how relevant the similarity to a prototype k • λ:captures the importance of a particular ROI inside a given prototype Define the standard regularized classification objective: Hinge loss function:

Learning • L(β ,λ) from a training set D • Use a gradient-based method for each optimization step • The subgradient with respect to β : • The subgradient with respect to λ :

Experiments • All these method are trained with 331 prototypes • Also trained a one versus all classifier for category d and combine their score into a single prediction, taking the label with Max confidence score • Sample n positive examples and 3n negative examples

Results 67 indoor scene sorted by multiclass average precision (training with 80 images per class and test is done on 20 images per class)

Average images for a sample of the Indoor scene

Prototype

Conclusion • Attract attention to the CV community working on scene recognition for which current algo. seem to perform poorly • Propose a prototype based model that can successfully combine both source (local & global) of information • However, the performances presented in this paper are close to the performance of the first attempts on Caltech 101 [2]

Recognizing Indoor Scenes

Recognizing Indoor Scenes

Presentation Transcript

Recognizing arguments

Recognizing

CRIME SCENES

CRIME SCENES

SURREAL SCENES

“Recognizing Integers”

Casket Scenes

Othello Scenes

Recognizing fractures

Creating Scenes

Map Scenes

RECOGNIZING PROPAGANDA

Deleted Scenes

River scenes

Nativity Scenes

Crime Scenes

Nature Scenes

Spring scenes

Childhood Scenes

Recognizing Disfluencies

Recognizing Diversity

MEMORABLE SCENES