1 / 22

Corpus Evaluation

Adam Kilgarriff Lexical Computing Ltd. Corpus Evaluation. Now Corpora to spec Choice Need to evaluate. Then Very few corpora Use what ’ s there. Intrinsic See what it looks like Extrinsic Embed in a task How well do you do at the task Better It all depends what you want it for.

ursala
Télécharger la présentation

Corpus Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adam Kilgarriff Lexical Computing Ltd Corpus evaluation Corpus Evaluation

  2. Corpus evaluation • Now • Corpora to spec • Choice • Need to evaluate • Then • Very few corpora • Use what’s there

  3. Corpus evaluation • Intrinsic • See what it looks like • Extrinsic • Embed in a task • How well do you do at the task • Better • It all depends what you want it for

  4. Corpus evaluation it all depends what you want it for but • ‘general English (/French/Chinese/ …)’ • Many purposes • Not specialist sublanguage • A decent construct? • Not sure but it has form • General language dictionaries • “how good is a corpus, for making them?”

  5. Corpus evaluation General truths • Duplicates bad • Noise bad • Big good • Diverse (good coverage of varieties within research scope, not dominated by any one variety) good

  6. Corpus evaluation word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour

  7. Corpus evaluation Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

  8. Corpus evaluation • 11 years • 1999-2010 • Feedback • Good but anecdotal • Formal evaluation

  9. Corpus evaluation Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?”

  10. Corpus evaluation Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

  11. Corpus evaluation Precision and recall • We tested precision • Recall is harder • How do we find all the collocations that the system should have found?

  12. Corpus evaluation Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus

  13. Corpus evaluation User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work

  14. Corpus evaluation Components Corpus NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics

  15. Corpus evaluation Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance

  16. Corpus evaluation Results Dutch 66% English 71% Japanese 87% Slovene 71% Two thirds of a collocations dictionary can be gathered automatically

  17. Corpus evaluation <world, final> problem • Is it good? • Superficially no • Look at concordances: • World cup finals • Solution • ‘Commonest string’

  18. Corpus evaluation Next step • Recall • 200 collocates per headword • Selected from • All the corpora we have • Various parameter settings • Plus just-in-time evaluation for 'new' collocates • Then • For a sample of headwords • These are the collocations we should get

  19. Corpus evaluation From sketches to corpora • Hold other inputs constant • Just one varies • Evaluate that one • Hold tools, stats, grammar constant • evaluate corpora

  20. Corpus evaluation Criteria • Duplicates bad • Noise bad • Big good • Diverse (good coverage of varieties within research scope, not dominated by any one variety) good • We think so

  21. Corpus evaluation Over next year • Build test sets • Textbook cases • English • BNC vsUKWaCvs OEC vsGigaword • Dutch • ANW corpus vs web corpus • web crawling, deduplication • Which parameters give best results?

  22. Corpus evaluation Thank you http://www.sketchengine.co.uk

More Related