Evaluating Algorithms for GRE: Furniture and People Domains Study

Evaluating Algorithms for GRE Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen, Scotland, UK

Outline • GRE: Generation of Referring Expressions • TUNA project: Corpus and Annotation • Evaluation of Algorithms • Furniture Domain • People Domain • [ Evaluation in the real world: STEC ]

TUNA project (ended Feb. 2007) • TUNA: Towards a UNified Algorithm for Generating Referring Expressions. • Extend coverage of GRE algorithms (plurals, negation, gradable properties,…) • Improve empirical foundations of GRE • Focus on • Content Determination • “First mention” NPs (no anaphora!)

Background • Dale and Reiter hypothesised that the Incremental Algorithm (IA) led to “better” output than other algorithms • “better”: more human-like • other algorithms: see below

Other GRE Algorithms • Full Brevity (FB; Dale 1989) • Generation of minimal descriptions • For example, by first trying all descriptions of length 1, then length 2, and so on. • Greedy Algorithm (GR; Dale 1989) • Always add property that removes the most distractors

Elicitation experiment • Participants were told that we wanted to test an AI program that interprets referring expressions • Participants were shown a series of domains • Each domain included 1 or 2 target objects • Participants entered their descriptions, then the referents were removed • To make the interaction seem real, we sometimes removed the wrong object! (25% of trials) • The experiment was later repeated without this feature • Essentially the same outcomes were found • For generality: two types of domains (furniture, people)

Furniture trial

People trial

Method (overview) • Experiment leads to transparent corpus of referring expressions: • referent and distractors are known • Domain attributes are known • Transparent corpora can be used for many purposes This talk: Compare some classic algorithms • giving each algorithm the same input as subjects • computing how similar algorithm’s output is to subjects’ output • We count semantic content only

Elicitation Experiment the green desk facing backwards the sofa and the desk which are red • Furniture (simple domain) • TYPE, COLOUR, SIZE, ORIENTATION • People (complex domain) • Nine annotated properties in total Location: • Vertical location (Y-DIMENSION) • Horizontal location (X-DIMENSION) the young man with a white shirt the man with the funny haircut the man on the left the chair in the top right

Corpus setup • Each corpus was carefully balanced, e.g. between singulars and plurals. • Between-subjects design: -Location: Subjects discouraged from using locative expressions. +Location: Subjects not discouraged. -FaultCritical: Subjects could correct their utterances +FaultCritical: Subjects could not correct their utterances • After discounting outliers and (self-reported) non-fluent speakers, 45 subjects were left

Experiment design: Furniture (-Location) • 18 trials (C=Colour, O=orientation, S=size) • 1 referent: minimal identification uses{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials] • 2 “similar” referents{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials] • 2 “dissimilar” referents{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]

Other evaluation studies Limitations: • Limited numbers of subjects/referents • Few attempts at balancing the corpus • IA: no teasing apart of preference orders NB Some of these studies were more ambitious in some respects, looking at context, and going beyond identification

Other evaluation studies • Jordan 2000, Jordan & Walker 2005 • More than just identification (Jordan 2000) • Siddharthan & Copestake 2004 • References in linguistic context • Gupta & Stent 2005 • Realisation mixed with Content Determination • Viethen & Dale 2006 • Only Colour and Location

Extensions to the classics • Plurality: (van Deemter 2002) • Extend each algorithm to search through disjunctions of increasing length • Location: (van Deemter 2006) • Locatives treated as gradable: “the leftmost table/person” • E.g., suppose the referent x is located in column 3 => “x is left of column 4”, “x is left of column 5” … => “x is right of column 2”, “x is right of column 1”… • Type: • People tend to use TYPE (Dale & Reiter 1995) • Here: All algorithms added TYPE.

Evaluation aims • Hypothesis in Dale & Reiter 1995: • IA resembles human output most • Our main questions: • Is this true? • How important are parameters (PO) for the IA? • More generally: • assess ‘quality’ of classic GRE algorithms : • calculate average match between the description generated by an algorithm and the descriptions produced by people (for the same referent)

Evaluation metric • Dice Coefficient: 2 x |Common properties| |total properties| corpus: {A,B,C} algorithm: {B,C}  Dice = … corpus: {A,B,C} algorithm: {A,B,C,D}  Dice = …

Evaluation metric • Dice Coefficient: 2 x |Common properties| |total properties| corpus: {A,B,C} algorithm: {B,C}  Dice = (2*2)/5 = 4/5 corpus: {A,B,C} algorithm: {A,B,C,D}  Dice = (2*3)/7 = 6/7

Evaluation metric • Dice Coefficient: 2 x |Common properties| |total properties| • A coefficient result of 1 indicates identical sets. 0 means no common terms • We also used this to measure agreement between annotators of the corpus

Assumptions behind DICE • The discriminatory power of a description does not matter • All properties are equidistant See Gatt & Van Deemter 2007, “Content Determination in GRE: evaluating the evaluator”

Evaluation (I): Furniture • Which preference orders for the IA? • Psycholinguistic evidence: • COLOUR >> {ORIENTATION, SIZE} (Pechmann 89; Eikmeyer & Ahlsen 96; Belke & Meyer 02) • Y-DIMENSION >> X-DIMENSION (Bryant et al, 1992; Arts 2004) • Split data: +LOCATION vs –LOCATION This talk: focus on –LOCATION –LOCATION = approx. 800 descriptions • Compare algorithms to a randomized IA (RAND)

Significant Significant Furniture: -LOCATION FB/GR

Beyond Toy Domains • More on Furniture corpus: Gatt et al. (ENLG-2007) • With complex real-world objects: • Many different attributes can be used • Number of PO’s explodes • Few psycholinguistic precedents • People domain attributes: • { hasBeard, hasGlasses, age, hasTie, hasSuit, hasSuit, hasHair, hairColour, orientation } • 9 Attributes, so 9! = 362880 possible POs

IA: Preference Orders for People Domain • Little psycholinguistic evidence for choosing between all 362880 possible PO’s • Focus on the most frequent Attributes:G=hasGlasses, B=hasBeard, H=hasHair, C=haircolour • Assumption: H and B must precede C • This leaves us with eight POs: { GBHC, GHBC,HBGC,HBCG, HGBC,BHGC, BHCG, BGHC }

Preference Orders and frequency • For attributes other than {G,C,H,B}, we let corpus frequency determine the order • E.g, IA-GBHC uses type, G,B,H,C, age, hasTie, hasSuit,hasShirtas its PO

Significant Results People Domain Significant by subjects IA-BASE GR

Results People domain • IA_base performs very badly now • So much about the best IA’s that start with {B,H,G,C} and end with <age,hasTie,hasSuit,hasShirt> • Some of these did much worse: • IA_BHCG had DICE=0.6, making it significantlyworse (by subjects) than GR!

Summary • People domain gives much lower DICE scores than Furniture domain • Difference between “good” and “bad” POs was • small (but significant) in the Furniture domain, • big (and significant) in the People domain

Summary • The “Incremental Algorithm” (IA): • not an algorithm but a class of algorithms • The best IA beats all other algorithms, but the worst is very bad ... • GR performs remarkably well. • How to choose a suitable PO? • Furniture: few attributes; psycholinguistic precedent • Still, there is variation. • People: more attributes; no precedents • Variation even greater!

Discussion • Suppose you want to build a GRE algorithm for a new and complex domain, for whichno transparent corpus is available. • Psycholinguistic principles are unlikely to help you much • If corpus is also not balanced, then frequency may not say much either …

Other uses of this method: STEC • Summer 2007: First NLG Shared task Evaluation Challenge (STEC) • STEC involved GRE only, focussing on Content Determination • 22 GRE Algorithms were submitted and evaluated (6 teams) • Reported in UCNLG+MT workshop, Copenhagen, Sept 2007

Other uses of this corpus: STEC • An even bigger STEC one year later • Each algorithm was compared with the TUNA corpus (minus 40% training set) • Both Furniture and People domain • DICE measured “humanlikeness” • Singulars only • Each algorithm was also tested in terms of identification time (by human reader)

Some STEC results • The more minimal the descriptions generated by these 22 systems were, the worse their DICE scores were

2. No relation between humanlikeness and identification time • Best system in terms of DICE was worst-but-one in terms of identification time • More research needed on the different criteria for judging NLG output

Thank you

Annotator agreement • Semantic markup was applied manually to all descriptions in the corpus. • 2 annotators were given a stratified random sample • Comparison used Dice.

Evaluating Algorithms for GRE: Furniture and People Domains Study