180 likes | 302 Vues
This study details the development of a corpus tailored for enhancing conversational assistants in everyday tasks. It focuses on fine-grained activity recognition through the integration of speech input and RGB-D vision, allowing for the learning and recognition of multi-step activities from one-shot instructions. By employing a combination of overhead microphones, video monitoring, and sensors, the project tracks user activities such as making tea, sandwiches, and more. It also explores labeling methodologies for supervised learning and how to improve models with human-in-the-loop feedback.
E N D
Creating a Corpus forA Conversational Assistant for Everyday TasksHenry Kautz, Young Song, Ian Pereira, Mary Swift, Walter Lasecki, Jeff Bigham, James AllenUniversity of Rochester
Goals • Fine-grained activity recognition combining speech and RGBD vision • Learning and recognizing multi-step activities from (one-shot) instruction • Learning names and properties of objects from instruction • Tracking and assistance using task model
overhead mics power meter lapel mic video kinect open/close sensors RFID sensors
Language Logical Form “I’m going to make a cup of tea.”
Extracted Events “I put it on the stove.” :event ont::put :agent user :theme v123 :start 0 :end 32 :utt 2 :speechtime/eventtime reln: overlap
Domains • Making tea - 12 subjects x 3 episodes • Making sandwiches • Building things with blocks • Coarse-grained home activities • Snack bar surveillance
Labeling Corpus • Need to label data for • Supervised learning methods • Evaluating supervised or unsupervised methods • “Gold standard” • Define event ontology • Hand label • Review / correct by second investigator • 1 hour per 2 minutes • Alternative?
Crowd AR • Idea • Try to recognize activities using current model • When confidence is low, ask human workers to label video segment • Mediate response • Update model with new labels
Worker Interface • Workers watch a live video stream of an activity and enter open-ended text labels into the bottom text field • They can see the responses of other workers and the learningmodel (HMM) on to the right of the video, and agree with them by clicking on them.
Mediator • An example of the graph created by the input mediator • Green nodes represent sufficient agreement between multiple workers (here N = 2). • The final sequence matches the baseline despite incorrect (over-specific) submissions by 2 out of the 3 workers, and a spelling error by one worker on “walk”the word ‘walk’.
Interactive Recognition and Labeling Experiments • Domain: coarse-grained activities • Model: HMM
Monitoring Multi-Agent Scenarios • Surveillance of department honor snack bar • 85% correct on 11 trials
Parameterized & Complex Activities • Average number of objects and actions correctly labeled by worker groups of different sizes over two different activity sequences. • As the group size increases, more objects and actions are labeled.