Optimizing Dictionary Examples with GDEX: A New Approach to Lexicography

GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing Ltd, UK Masaryk University, Czech Rep A&C Black Publishers Ltd., UK Macmillan Education, UK Lexicography MasterClass Ltd., UK

Users appreciate examples • Paper: space constraints • Electronic: no space constraints • Give lots of examples Constraint: • Cost of selection, editing

Project • Macmillan English dictionary • Licensing arrangement with A&C Black • Already had 1000 collocation boxes • See collocationality paper, ELX 2006 • Average 8 per box • New electronic version • All 8000 collocations need examples • Authentic; from corpus

Old method • Lexicographer • Gets concordance for collocation • Reads through until they find a good example • Cut, paste, edit

New method • Lexicographer • Gets sorted concordance • 20 best examples in spreadsheet • Less reading through • Tick the first good one, edit

What makes a good example? • Readable • EFL users • Informative • Typical, for the collocation • Gives context which helps user understand the target word/phrase

Readability • 70 years research • Not just (or mainly) EFL • Educational theory • Teaching children to read • Instruction manuals • Publishing

Readability tests • Fleish Reading Ease test (1948) • Ave sentence length, ave word length • In some word processing software • Many similar measures • Recent work • Language modelling from training data • Target levels • US grades • Common European Framwork

GDEX • Get concordance for collocation • For each sentence • Score it • Sort • Show best ones

GDEX heuristics • Sentence length (10-26 words) • Mostly common words: good • Rare words: bad • Sentences • Start with capital, end with one of .!? • No [, ], <, >, http, \ • Penalise: • Other punctuation, numbers • More than 2 or 3 capitals • Typicality: third collocate is a plus

Weighting • For each sentence • Score on each heuristic • Weight scores • Add together weighted score • How to set weights?

Machine learning • Two students: • Manually judged 1000 “good examples” • Weights • set to mimic students´ choices

Was it successful? • Did it save lexicographer time? • Definitely (says project manager) • Corpus choice • Started with BNC but • Too old • Not enough examples • If no good examples in corpus, GDEX can’t help • Changed to UKWaC • 20 times bigger; from web; contemporary • Better • Most web junk filtered out • Usually a good example in top twenty

GDEX and TALC • TALC • Teaching and Language Corpora • Goal: bring corpora into lg teaching • Usual problem • Concordances are tough for learners to read • Way forward • GDEX examples • Half way between dictionary and corpus

GDEX: Models for use • More examples for dictionaries • Speed up, as with MED or • Fully automatic “more examples” • Corpus query tool • Sort concordances, best first • Now an option in the Sketch Engine • Automatic collocations dictionary • http://forbetterenglish.com

Optimizing Dictionary Examples with GDEX: A New Approach to Lexicography