A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap

A neglected problem in the computational theory of mindObject Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science

Before I begin I would like you to see a ‘video game’ that will figure in the last part of my talk • The demonstration shows a task called “Multiple Object Tracking” • Track the initially-distinct (flashing) items through the trial (here 10 secs) and indicate at the end which items are the “targets” • After each example I’d like you to ask yourself, “How do I do it?” • If you are like most of our subjects you will have no idea, or a false idea…

Keep track of the objects that flash512x6.83 172x 169

How do we do it? What properties of individual objects do we use?

Going behind occluding surfaces does not disrupt tracking Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290.

Not all well-defined features can be tracked:Track endpoints of these linesEndpoints move exactly as the squares did!

The basic problem of cognitive science • What determines our behavior is not how the world is, buthow we represent it as being • As Chomsky pointed out in his review of Skinner, if we describe behavior in relation to the objective properties of the world, we would have to conclude that behavior is essentially stimulus-independent • Every naturally-occurring behavioral regularity is cognitively penetrable • Any information that changes beliefs can systematically and rationally change behavior

Representation and Mind Why representations are essential • Do representations only come into play in “higher level” mental activities, such as reasoning? • Even at early stages of perception many of the states that must be postulated are representations (i.e. what they are about plays a role in explanations).

Examples from vision (1): Intrapercept constraints Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83.

Examples from vision (2):The Pogendorf iIlusion depends on perceived contours – they need not be physical edges

The rules of color mixing apply to perceived color • ‘Red light and yellow light mix to produce orange light’ • This ‘law” holds regardless of how the red light and yellow light are produced; • The yellow may be light of 580 nanometer wavelength, or it may be a mixture of light of 530 nm and 650 nm wavelengths. • So long as one light looks yellow and the other looks red the “law” will hold – the mixture will look orange.

Another example of a classical representation

Other forms of representation…. • Lines FG, BC are parallel and equal. • Lines EH, AD are parallel and equal. • Lines FB, GC are parallel and equal. • Lines EA, HD are parallel and equal. • Vertices EF, HG, DC and AB are joined.... • Part-Of{Cube, Top-Face(EFGH), Bottom-Face(ABCD), Front-Face(FGCB), Back-Face(EHDA)} • Part-Of{Top-Face(Front-Edge(FG), Back-Edge(EH), Left-Edge(EF), Right-Edge(HG)},…

What’s wrong with this picture? What’s wrong is that the CTM is incomplete — it does not address a number of fundamental questions • It fails to specify how representations connect with what they represent – it’s not enough to use English words in the representation (that’s been a common confusion in AI) or to draw pictures (a common confusion in theories of mental imagery) • English labels and pictures may help the theorist recall which objects are being referred to … • But what makes it the case that a particular mental symbol refers to one thing rather than another? • How are concepts grounded? (Symbol Grounding Problem)

Another way to look at what the Computational Theory of Mind lacks • The missing function in the CTM is a mechanism that allows perception to refer to individual things in the visual field directly and nonconceptually: • Not as “whatever has properties P1, P2, P3, ...”, but as a singular term that refers directly to an individual and does not appeal to a representation of the individual’s properties. • Such a reference is like a proper name or a pointer in a computer data structure, or like a demonstrative term (like this or that) in natural language. • Note that in a computer a pointer does not refer via a location, despite what the term “pointer” suggests

An example from personal history: Why we need to pick out individual things without referring to their properties • We wanted to develop a computer system that would reason about geometry by actually drawing a diagram and noticing adventitious properties of the diagram from which it would conjecture lemmas to prove • We wanted the system to be as psychologically realistic as possible so we assumed that it had a narrow field of view and noticed only limited, spatially-restricted information as it examined the drawing • This immediately raised the problem of coordinating noticings and led us to the idea of visual indexes to keep track of previously encoded parts of the diagram.

Begin by drawing a line…. L1

Now draw a second line…. L2

And draw a third line…. L3

Notice what you have so far….(noticings are local – you encode what you attend to) L1 V6 L2 There is an intersection of two lines… But which of the two lines you drew are they? There is no way to indicate which individual things are seen again without a way to refer to individual (token) things

Look around some more to see what is there …. L5 L2 V12 Here is another intersection of two lines… Is it the same intersection as the one seen earlier? Without a special way to keep track of individuals the only way to tell would be to encode unique properties of each of the lines. Which properties should you encode?

In examining a geometrical figure one only gets to see a sequence of local glimpses

The incremental construction of visual representations requires solving a correspondence problem over time • We have to determine whether a particular individual element seen at time t is identical to another individual element seen at a previous time t- .This is one manifestation of the correspondence problem. • Solving the correspondence problem is equivalent to picking out and tracking the identity of token individuals as they change their appearance, their location or the way they are encoded or conceptualized • To do that we need the capacity to refer to token individuals (I will call them objects) without doing so by appealing to their properties. This requires a special form of demonstrative reference I call a Visual Index.

A note about the use of labels in this example • There are two purposes for figure labels. One is to specify what type of individual it is (line, vertex,..). The other is to specify which individual it is so it is individuated and thus can be selected or bound to the argument of a predicate. • The second of these is what I am concerned with because indicating which individual it is is essential in vision. • Many people (e.g., Marr, Yantis) have suggested that individuals may be marked by tags, but that won’t do since one cannot literally place a tag on an object and even if we could it would not obviate the need to individuate and index just as labels don’t help. • Labeling things in the world is not enough because to refer to the line labeled L1 you would have to be able to think “this is line L1” and you could not think that unless you had a way to first picking out the referent of this.

The difference between a direct (demonstrative) and a descriptive way of picking something out has produced many “You are here” cartoons. It is also illustrated in this recent New Yorker cartoon…

The difference between descriptive and demonstrative ways of picking something out (illustrated in this New Yorker cartoon by Sipress )

‘Picking out’ • Picking out entails individuating, in the sense of separating something from a background (what Gestalt psychologists called a figure-ground distinction) • This sort of picking out has been studied in psychology under the heading of focal or selective attention. • Focal attention appears to pick out and adhere to objects rather than places • In addition to a unitary focal attention there is also evidence for a mechanism of multiple references (about 4 or 5), that I have called a visual index ora FINST • Indexes are different from focal attention in many ways that we have studied in our laboratory (I will mention a few later) • A visual index is like a pointer in a computer data structure – it allows access but does not itself tell you anything about what is being pointed to

The requirements for picking out and keeping track of several individual things reminded me of an early comic book character called Plastic Man

Imagine being able to place several of your fingers on things in the world without recognizing their properties while doing so. You could then refer to those things (e.g. ‘what finger #2 is touching’) and could move your attention to them. You would then be said to possess FINgers of INSTantiation (FINSTs)

FINST Theory postulates a limited number of pointers in early vision that are elicited by certain events in the visual field and that enable vision to refer to those things without doing so under concept or a description

Information (causal) link FINST Demonstrative reference link FINSTs and Object Files form the link between the world and its conceptualization The only nonconceptual contents in this picture are FINST indexes! Object File contents are conceptual!

A note on terminology • A FINST provides a reference to an individual visible ‘thing’ • I sometimes call this referent a FING by analogy with FINST and sometimes an object to conform with usage in psych, but FINGs are nonconceptual so they do not pick out something as an object, because OBJECT us a concept. Maybe “proto object”? • I have also called it a pointer, but that erroneously suggests that it “points to” the location of an object, as opposed to the object itself. In a computer, a pointer is the name of a stored datum. • I have said that a FINST is a visual demonstrative like ‘this’ or ‘that’, but that too is misleading because the reference of a demonstrative depends on the intentions of the speaker • I have also noted that a FINST is like a proper name but that won’t do since a name can pick out something not in sensory contact whereas a FINST can only refer to a visible item (or one that is briefly out of sight).

A quick tour of some evidence for FINSTs • The correspondence problem • The binding problem • Evaluating multi-place visual predicates (recognizing multi-element patterns) • Operating over several visual elements at once without having to search for them first • Subitizing • Subset search • Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head

A quick tour of some evidence for FINSTs • The correspondence problem (mentioned earlier) • The binding problem • Evaluating multi-place visual predicates (recognizing multi-element patterns) • Operating over several visual elements at once without having to search for them first • Subitizing • Subset selection • Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head

1 2 Individual objects and the binding problem • We can distinguish scenes that differ by conjunctions of properties, so early vision must somehow keep track of how properties co-occur – conjunction must not be obscured. This is the called the binding problem • The most common proposal is that vision keeps track of properties according to their location and binds together co-located properties.

The proposal of binding conjunctions by the location of conjuncts does not work when feature location is not punctate and becomes even more problematic if they are co-located – e.g., if their relation is “inside”

Binding as object-based • The proposal that properties are conjoined by virtue of their common location has many problems • In order to assign a location to a property you need to know its boundaries, which requires distinguishing the object that has those properties from its background (figure-ground individuation) • Properties are properties of objects, not of locations – which is why properties move when objects move. Empty locations have no causal properties. • The alternative to conjoining-by-location is conjoining byobject. According to this view, solving the binding problem requires first selecting individual objects and then keeping track of each object’s properties (in its object file) • If only properties of selected objects are encoded and if those properties are recorded in object files specific to each object, then all conjoined properties will be recorded in the same object file, thus solving the binding problem

Attention spreads over perceived objects Spreads to B and not C Spreads to C and not B * Spreads to B and not C Spreads to C and not B Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads to other parts of the same visual object compared to equally distant parts of different objects.

Being able to pick out and refer to individual distal elementsis essential for encoding patterns • Encoding relational predicates; e.g., Collinear (x,y,z,..); Inside (x, C); Above (x,y); Square (w,x,y,z), requires simultaneously binding the arguments of n-place predicates to n elements in the visual scene • Evaluating such visual predicates requires individuating and referring to the objects over which the predicate is evaluated: i.e., the arguments in the predicate must be bound to individual elements in the scene.

Several objects must be picked out at once in making relational judgments When we judge that certain objects are collinear, we must first pick out the relevant objects while ignoring their properties

Several objects must be picked out at once in making relational judgments • The same is true for other relational judgments like inside or on-the-same-contour… etc. We must pick out the relevant individual objects first. Are dots Inside-same contour? On-same contour?

A quick tour of some evidence for FINSTs • The correspondence problem • The binding problem • Evaluating multi-place visual predicates (recognizing multi-element patterns) • Operating over several visual elements at once without first having to search for them • Subitizing • Subset selection • Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head

More functions of FINSTsFurther experimental explorationsusing different paradigms • Recognizing the cardinality of small sets of things: Subitizing vs counting (Trick, 1994) • Searching through subsets – selecting items to search through (Burkell, 1997) • Selecting subsets and maintaining the selection during a saccade (Currie, 2002) • Application of FINST index theory to infant cardinality studies (Carey, Spelke, Leslie, Uller, etc) • Indexes explain how children are able to acquire words for objects by ostension without suffering Quine’s Gavagai problem.

Another example of MOT: With self occlusion 5 x 5 1.75 x 1.75

Self occlusion dues not seriously impair tracking

Some findings of Multiple Object Tracking • Basic finding: Most people can track at least 4 targets that move randomly among identical non-target objects (even 5 year old children can track 3 objects) • Object properties do not appear to be recorded during tracking and tracking is not improved if all objects are visually distinct (no two objects have the same color, shape or size) • How is it done? • We showed that it is unlikely that the tracking is done by keeping a record of the targets’ locations and updating them by serially visiting the objects (Pylyshyn & Storm, 1998) • Other strategies may be employed (e.g., tracking a single deforming pattern), but they do not explain tracking • Hypothesis: FINST Indexes get assigned to targets. At the end of the trial these pointers can be used to move attention to the targets and hence to select them

What role do visual properties play in MOT? • Certain properties may have to be present in order for an object to be indexed, and certain properties (probably different properties) may be required in order for the index to keep track of the object, but this does not mean that such properties are encoded, stored, or used in tracking. • Compare this with Kripke’s distinction between properties that fix the referent of a proper name and the property that the name refers to. The former only plays a role at the name’s initial “baptism.” • Is there something special about location? Do we record and track properties-at-locations? • Location in time & space may be essential for individuating objects, but locations need not be encoded or made cognitively available • The fact that an object is actually at some location or other does not mean that it is represented as such. Representing property ‘P’ (where P happens to be at location L) ≠ Representing property ‘P-is-at-L’.

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap

Presentation Transcript

Mind the Gap

Mind the Gap:

MIND THE GAP

Mind the Gap!

Mind the Gap

Mind the Gap!

Mind the Gap

The Computational Theory of Mind

Mind the Gap!

Mind the gap

Mind the Gap!

Mind the Gap!

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap

Mind the Gap

MIND THE GAP

Mind the Gap

Mind the gap!

MIND THE GAP

Mind the gap

Mind the Gap

Mind the Gap!

Mind The Gap