Systematically Grounding Language through Vision in a Deep, Recurrent Neural Network

Systematically Grounding Language throughVision in a Deep, Recurrent Neural Network Paper authored by Derek D. Monner and James A. Reggia published in AGI-2011 Introduction What is Systematically Grounded Language ? The system should “understand” the language rather than just give correct output. Ways like symbol manipulation do not require the machine to “understand” the language. “Understanding” a language – Mapping between internal model of world and input The machine should relate all its senses to each other and its state ( already learned knowledge). A step towards general machine intelligence Systematicity Weak - The system can label known objects whose attributes have also been previously encountered Categorical - The system can label a new object with attributes it already earned before. Descriptive - The system can learn new attributes for a familiar object. Strong - The system can learn to label a new object with new attributes. Task Description The network operates over a micro –language and in a micro-universe. It receives as input two streams – visual and auditory. From these it should output the learn the intended meaning. The language used for the sentences is mildly context sensitive – 5 shapes, 3 colours, 3 different sizes, and a few prepositions . Description Input to the system – Visual scene and Auditory sentence. Visual scene Collection of objects and adjectives/attributes describing them Denoted using words , with fixed width fonts Ex : [large blue cylinder 4] The number is to differentiate between identical objects Auditory Stream Each visual scene is followed by a sentence describing partially or fully the scene. Ex : “ Large blue cylinder “ Word -> phonemes - Sentence is converted into an uninterrupting sequence of phonemes. We would want the system to learn to combine these phonemes into morphemes and words and to learn relation between various words - Intention Stream The network should integrate both the visual steam and the auditory stream and must generate a sequence of predicates. Ex : (cylinder 4), (blue 4) The system would consider only those attributes which are present in both the streams Example Network Architecture We use a generalized Long short-term memory( LTSM-g) which is deep recurrent network. Memory cells - LTSM-g’s neural units which input, output and forget gates. Can store data between large time gaps. Trained by gradient descent, back propagated error signals With all these features these memory cells are more close to biological neurons than the stateless neurons we know. The last internal layer has a reccurent connection, from previously learned memory( output ). Evaluation Evaluation done for all levels of systematicity Weak - Only 10% of the data is used for testing while the rest is used to train the network Categorical - One object is excluded from training. Check if the network can correctly categorize it ( Its description is familiar). Descriptive - An object is not described fully in the training data. Check if the network can include the added attributes of the object in its memory. Strong Condition - A completely new object with a altogether new description is given to the network while testing.

Systematically Grounding Language throughVision in a Deep, Recurrent Neural Network Paper authored by Derek D. Monner and James A. Reggia published in AGI-2011 Results A network is trained over 3 million randomly generated scene-sentence pairs. For each such pair the network must produce a predicate sequence. The accuracy for weak, categorical, descriptive, strong conditions are 93%, 95%, 95%, 97% respectively. Accuracy Conclusions and Questions The network is using the grounded language systematically. Further work – What are the internal representations of the network ? Are they same as the ones we know ? Or are they different ?

Systematically Grounding Language through Vision in a Deep, Recurrent Neural Network

Systematically Grounding Language through Vision in a Deep, Recurrent Neural Network

Presentation Transcript

Recurrent neural networks (I)

III. Recurrent Neural Networks

Neural Network

Viewing Vision-Language Integration as a Double-Grounding case

Neural Network Language NNL

EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL

RECURRENT NEURAL NETWORKS

Deep Neural Network Language Models

Neural Network Models in Vision

DEEP Vision

A Dynamic System Analysis of Simultaneous Recurrent Neural Network

NEURAL NETWORK

Deep Learning and Neural Network

Deep Convolutional Neural Network for Hyperspectral Image Classification

Deep Neural Network and Its Elements

Recurrent Neural Networks

Self-Organized Recurrent Neural Learning for Language Processing reservoir-computing

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks