150 likes | 267 Vues
The GDEX project introduces an innovative method for generating high-quality dictionary examples from a corpus, allowing lexicographers to streamline their work. By automating the selection of examples through a scoring system based on readability and contextual relevance, GDEX significantly reduces the time spent on manual editing. The approach leverages large, modern corpora such as UKWaC to ensure that examples are current and representative, ultimately enhancing the learning experience for EFL users. This project aims to bridge the gap between dictionaries and corpora, facilitating better language education.
E N D
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing Ltd, UK Masaryk University, Czech Rep A&C Black Publishers Ltd., UK Macmillan Education, UK Lexicography MasterClass Ltd., UK
Users appreciate examples • Paper: space constraints • Electronic: no space constraints • Give lots of examples Constraint: • Cost of selection, editing
Project • Macmillan English dictionary • Licensing arrangement with A&C Black • Already had 1000 collocation boxes • See collocationality paper, ELX 2006 • Average 8 per box • New electronic version • All 8000 collocations need examples • Authentic; from corpus
Old method • Lexicographer • Gets concordance for collocation • Reads through until they find a good example • Cut, paste, edit
New method • Lexicographer • Gets sorted concordance • 20 best examples in spreadsheet • Less reading through • Tick the first good one, edit
What makes a good example? • Readable • EFL users • Informative • Typical, for the collocation • Gives context which helps user understand the target word/phrase
Readability • 70 years research • Not just (or mainly) EFL • Educational theory • Teaching children to read • Instruction manuals • Publishing
Readability tests • Fleish Reading Ease test (1948) • Ave sentence length, ave word length • In some word processing software • Many similar measures • Recent work • Language modelling from training data • Target levels • US grades • Common European Framwork
GDEX • Get concordance for collocation • For each sentence • Score it • Sort • Show best ones
GDEX heuristics • Sentence length (10-26 words) • Mostly common words: good • Rare words: bad • Sentences • Start with capital, end with one of .!? • No [, ], <, >, http, \ • Penalise: • Other punctuation, numbers • More than 2 or 3 capitals • Typicality: third collocate is a plus
Weighting • For each sentence • Score on each heuristic • Weight scores • Add together weighted score • How to set weights?
Machine learning • Two students: • Manually judged 1000 “good examples” • Weights • set to mimic students´ choices
Was it successful? • Did it save lexicographer time? • Definitely (says project manager) • Corpus choice • Started with BNC but • Too old • Not enough examples • If no good examples in corpus, GDEX can’t help • Changed to UKWaC • 20 times bigger; from web; contemporary • Better • Most web junk filtered out • Usually a good example in top twenty
GDEX and TALC • TALC • Teaching and Language Corpora • Goal: bring corpora into lg teaching • Usual problem • Concordances are tough for learners to read • Way forward • GDEX examples • Half way between dictionary and corpus
GDEX: Models for use • More examples for dictionaries • Speed up, as with MED or • Fully automatic “more examples” • Corpus query tool • Sort concordances, best first • Now an option in the Sketch Engine • Automatic collocations dictionary • http://forbetterenglish.com