140 likes | 257 Vues
Viewing Vision-Language Integration as a Double-Grounding case. Katerina Pastra. Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Language Technology Group, Institute for Language and Speech Processing (ILSP), Athens, Greece.
E N D
Viewing Vision-Language Integration as a Double-Grounding case Katerina Pastra Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K. Language Technology Group, Institute for Language and Speech Processing (ILSP), Athens, Greece
still, lack of an AI study of V-L integration Vision-Language Integration Content integration vs. technical integration • Small part within cognitive architectures • Small part of the integration story • What is computational V-L integration? (definition) • How is it achieved? (state of the art, trends, needs) • Why is it needed in AI agents? (explanatory/theoretical ground) • How far can we go? (implementation suggestions, the VLEMA prototype)
The notion of integration Descriptive Definition • Computational V-L integration is a process of associating visual and corresponding linguistic representations • It may take the form of one of four integration processes according to the integration purpose to be served State of the art V-L integration prototypes: Deal with blocksworlds or miniworlds (scalability issues) Work with already abstracted/analysed visual input Rely on integration resources for V-L association
Why do agents need V-L integration ? • Inherent characteristics of integrated media: Does each one of them lack something that the other can compensate for? • Gains for an agent in communication: Are agents with V-L integration abilities more intelligent? Why do we need to know ? • to decide on the significance of such a mechanism for an artificial agent • to get the theoretical ground needed for research that is currently done mostly ad hoc and in isolation within different AI sub-areas
Inherent Characteristics • Images: - reference object: physical or mental - lack inherent means of indicating focus/salience (cf. indexical-deictic mechanisms in vision theories) - lack inherent means of indicating type:token distinctions (i.e. level of abstraction) • Language: - reference object: mental - has subtle mechanisms for indicating level of abstraction - has mechanisms for controlling attendance to details, focus etc. - lacks direct access to the physical world (cf. indexicals)
From Symbol Grounding… From the Symbol Grounding debate we get the following : • Language lacks direct access to the physical world • Language needs such access to express intentionality • Symbol grounding is a process of associating • symbols/language with percepts (visual percepts) • - Symbol grounding provides language direct access to the • physical world • An agent must perform symbol grounding on its own to be • intrinsically intentional (must go beyond instantiation of • associations to inference)
Mental World Visual Perception Representations Linguistic Representations Physical World Association Direct Access Grounding
Shifting the focus from symbols… Relying on the inherent characteristics of images, one may argue that : - Images lack controlled access to mental aspects of the world • - Images need such access to express intentionality • Image grounding is a process of associating images with • language • - Image grounding provides images controlled access to the • mental world • An agent must perform image-grounding on its own to be • intrinsically intentional
Mental World Linguistic Representations Visual Representations Association Direct Access Uncontrolled Access Grounding Physical World
From Symbol Grounding to Double Grounding • The Double-Grounding Theory: • Double-grounding is a process of associating symbolic with • iconic representations • Double-grounding provides language a direct access to the • physical world, and at the same time it provides vision a • controlled access to mental aspects of the world • - Vision-language integration is a case of double-grounding • V-L integration compensates for features images and • language inherently lack on their own – it is necessary for • expressing and understanding intentionality in V-L MM • situations
Viewing integration as a double-grounding case • V-L integration abilities are needed for an agent to be intentional in MM situations • Exploring how V-L integration can be achieved computationally, one realises that this research issue involves not only the perceptual and linguistic modules of a cognitive architecture, but also the learning and reasoning ones. • The corresponding AI communities need, therefore, to join forces for addressing the challenges in endowing agents with their own V-L integration abilities
Challenging current practices means that a prototype should: • work with real visual scenes • analyse its visual data automatically • associate images and language automatically Is it feasible to develop such a prototype ??? The AI quest for V-L Integration In relying on human created data, state of the art V-L integration systems avoid core integration challenges and therefore fail to perform real integration Can we do better? How far can we go ???
An optimistic answer VLEMA: A Vision-Language intEgration MechAnism • Input: automatically re-constructed static scenes in 3D (VRML format) from RESOLV (robot-surveyor) • Integration task: Medium Translation from images (3D sitting rooms) to text (what and where in EN) • Domain: estates surveillance • Horizontal prototype • Implemented in shell programming and ProLog