Salmon-Alt & Romary on Reference Annotation

Salmon-Alt & Romary on Reference Annotation Fourth workshop on multimodal semantic representation, Tilburg 10-11 Jan 2005

J.L.Borges quoting a (spurious) Chinese encyclopaedia Animals are divided into (slightly abbreviated) • those that belong to the emperor • embalmed ones • those that are trained • stray dogs • those that tremble as if they were mad • those drawn with a very fine camelhair brush • those that have just broken a flower vase • those that from a long way off look like flies

How I understand Alt & Romary • Goal: establish set of attibutes + values for reference-related annotations that can help NLP • Without such a goal, any classification would do: • classify NPs according to the number of characters • classify animals according to number of their legs • But surely, these distinctions would be irrelevant to NLP …

… or is it? • “The man was chasing Fido. His hind legs were hurting” [man has no hind legs] • Johnny: `shit’. Mom: `That’s a four-letter word!’ [number of characters does count] • Perhaps ultimately, reference annotation is AI-complete • Proposal is about which properties of markables matter systematically for NLP • Similar for links (= relations) between an NP and the surrounding text

We discuss • Markables • Links between markables

I. Markables • What is a markable? • Not necessarily a referring NPE.g., `un village … son eglise’ • Does a markable even have to be an NP?No: an NP can refer back to something that’s not an NP: - `John hit Bill. He’ll regret having done this’- `Do not swallow. This is important / dangerous’ • MUC: annotators find it difficult to decide what’s a markable, e.g. in compound nouns (e.g., `corrugated steel plating’)

Markables (ctd.) I’ll focus on one set of concepts, called “Referential Descriptors” (section 3.1.2) NB: These are properties of the referent! • Cardinality (0,1,2,3,…?) • Natural gender (male, female, …) • Definiteness (identifiable, indefinite/generic term/nonspecific term) • Information status (old, mediated, new)

Markables (ctd.) • Cardinality (0,1,2,3,…) “specifies exact quantity”How about `Some people were sneezing’, `Most people were sneezing’, `The people were sneezing • Natural gender (male, female, …) • Definiteness (identifiable, indefinite/generic term/nonspecific term)Is this a property of the referent? • Information status (old, mediated, new)Is this a property of the referent?

Markables (ctd.) Why not … • Animacy (treated as lexical, but `the dog’ is inanimate when referring to a sculpture) • Count/mass (treated as lexical, but`the wine’ can refer to object or mass) • Collective/distributive (treated as lexical, but consider `three men lifted a piano’ ) • Abstract/concrete. Consider `The Decline and Fall of the Roman Empire’ • Human/nonhuman(`Fido bit Ben. He said ..’) • Q scope(`Every man loves a woman; she …’)

These are just examples • The issue is: do we have a principled way of deciding which concepts belong in a categorisation of this kind? • Presumably, `usefulness for NLP’ should somehow play a role

II. Links • Van Deemter & Kibble 2000: Existing annotation schemes confuse coreference, anaphora, and a function taking a value • `The temperature is 90 degrees; it is going up’  90 is going up (MUC) • Alt & Romary: distinguish between Object Relations and Linguistic Relations

II. Links (ctd.) Distinction looks sensible. But consider: • `Tony Blair is the PM’Linguistic relation: predicationObject relation: identity • `Tony Blair was the PM’Linguistic relation: predication?Object relation: past identity?? • `Gordon Brown may be the (next) PM’Linguistic relation: predication??Object relation: possible (future) identity???

Under modality/tense/attitudes, simple concepts like identity start bifurcating. • Should this be reflected in reference annotation schemes? (And if so, how?)

II. Links (ctd.) • `The PM is not highly regarded in his party these days: Tony is George’s pony.’ Linguistic relationship: none (?) Object relationship: identity (How wrong would it be to say that the linguistic relation is anaphora?) Is a Linguistic Relation one that can be observed at string level?

Summing up 1. Is the notion of a markable clear enough to make sense to an annotator? Is the notion of a referring NP relevant here, and do we know what that is? E.g., in attitude contexts, it is not clear: `Jaime claims that the monster of Loch Ness is restless, and that it …’

Concluding questions (ctd.) 2. How about anaphora to VPs or sentences? - `John hit Bill. He’ll regret having done this!’- `Do not swallow. This is important / dangerous’ 3. How about reference to text? E.g., `the concluding section of this paper’, or simply `this word’. (See PhD thesis I.Paraboni) 4. Is the notion of a `referential descriptor’ coherent? Do we have a principled way of deciding which referential descriptors are legitimate? (Similar for other descriptors)

Concluding questions (ctd.) 5. Links: Identity is a key `object relation’. But how about past identity, possible identity, etc? Compare earlier remark about reference: everyday concepts become muddy in modal contexts!

Concluding questions (ctd.) 6. I take it that this is not an annotation scheme (cf., MUC, MATE), but a set of concepts that such a scheme could use(cf., Bunt & Romary 2004) What adaptations can a particular scheme make? E.g., are additions allowed?

Time to let others have their say!

Salmon-Alt & Romary on Reference Annotation