30 likes | 206 Vues
Systematically Grounding Language through Vision in a Deep, Recurrent Neural Network. Paper authored by Derek D. Monner and James A. Reggia published in AGI-2011. Introduction. What is Systematically Grounded Language ?
E N D
Systematically Grounding Language throughVision in a Deep, Recurrent Neural Network Paper authored by Derek D. Monner and James A. Reggia published in AGI-2011 Introduction What is Systematically Grounded Language ? The system should “understand” the language rather than just give correct output. Ways like symbol manipulation do not require the machine to “understand” the language. “Understanding” a language – Mapping between internal model of world and input The machine should relate all its senses to each other and its state ( already learned knowledge). A step towards general machine intelligence Systematicity Weak - The system can label known objects whose attributes have also been previously encountered Categorical - The system can label a new object with attributes it already earned before. Descriptive - The system can learn new attributes for a familiar object. Strong - The system can learn to label a new object with new attributes. Task Description The network operates over a micro –language and in a micro-universe. It receives as input two streams – visual and auditory. From these it should output the learn the intended meaning. The language used for the sentences is mildly context sensitive – 5 shapes, 3 colours, 3 different sizes, and a few prepositions . Description Input to the system – Visual scene and Auditory sentence. Visual scene Collection of objects and adjectives/attributes describing them Denoted using words , with fixed width fonts Ex : [large blue cylinder 4] The number is to differentiate between identical objects Auditory Stream Each visual scene is followed by a sentence describing partially or fully the scene. Ex : “ Large blue cylinder “ Word -> phonemes - Sentence is converted into an uninterrupting sequence of phonemes. We would want the system to learn to combine these phonemes into morphemes and words and to learn relation between various words - Intention Stream The network should integrate both the visual steam and the auditory stream and must generate a sequence of predicates. Ex : (cylinder 4), (blue 4) The system would consider only those attributes which are present in both the streams Example Network Architecture We use a generalized Long short-term memory( LTSM-g) which is deep recurrent network. Memory cells - LTSM-g’s neural units which input, output and forget gates. Can store data between large time gaps. Trained by gradient descent, back propagated error signals With all these features these memory cells are more close to biological neurons than the stateless neurons we know. The last internal layer has a reccurent connection, from previously learned memory( output ). Evaluation Evaluation done for all levels of systematicity Weak - Only 10% of the data is used for testing while the rest is used to train the network Categorical - One object is excluded from training. Check if the network can correctly categorize it ( Its description is familiar). Descriptive - An object is not described fully in the training data. Check if the network can include the added attributes of the object in its memory. Strong Condition - A completely new object with a altogether new description is given to the network while testing.
Systematically Grounding Language throughVision in a Deep, Recurrent Neural Network Paper authored by Derek D. Monner and James A. Reggia published in AGI-2011 Results A network is trained over 3 million randomly generated scene-sentence pairs. For each such pair the network must produce a predicate sequence. The accuracy for weak, categorical, descriptive, strong conditions are 93%, 95%, 95%, 97% respectively. Accuracy Conclusions and Questions The network is using the grounded language systematically. Further work – What are the internal representations of the network ? Are they same as the ones we know ? Or are they different ?