1 / 16

Infrastructures and Evaluation

Infrastructures and Evaluation. Donna Harman National Institute of Standards and Technology Gaithersburg, Maryland http://trec.nist.gov. TREC Tasks. Workshop on Cross-Linguistic Information Retrieval, SIGIR 1996.

canavan
Télécharger la présentation

Infrastructures and Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Infrastructures and Evaluation Donna Harman National Institute of Standards and Technology Gaithersburg, Maryland http://trec.nist.gov

  2. TREC Tasks

  3. Workshop on Cross-Linguistic Information Retrieval, SIGIR 1996 • Paper “Building a Large Multilingual Test Collection from Comparable News Documents” by Páraic Sheridan, Jean Paul Ballerini and Peter Schäuble • Used Swiss news agency (SDA) data in French, German and Italian

  4. TREC-6 Cross-Language Track • In cooperation with the Swiss Federal Institute of Technology (ETH) • Task Summary: retrieval of English, French, and German documents, both in a monolingual and a cross-lingual mode • Guidelines: ad hoc task guidelines, plus all groups had to submit a monolingual baseline • Documents: • Neue Zürcher Zeitung (1994): German (200 MB) • SDA (1988-1990): French (250 MB), German (330 MB) • AP (1988-1990): English (759 MB) • Topics and relevance assessments all done at NIST

  5. TREC-6 Cross-Language Results - revised 01/20/98

  6. Major issues with language resources • No public domain stopword lists, stemmers, etc. for German and French • Jacques Savoy contributed a Porter-like stemmer for French and a stopword list • Martin Braschler and Paul Over from NIST built a simple German stemmer and decompounder • Questions from participants about how much of the final result was based on having access to “better” resources

  7. Major issues in CLIR resources • Major lack of machine-readable bilingual dictionaries • Resulted in the use of limited dictionaries • Resulted in the use of assorted mapped word lists that were found on the web • Major lack of parallel corpora • Resulted in the use of comparable corpora • (Later) resulted in the mining of the web for parallel text • Heavy use of SYSTRAN in query translation

  8. Lessons learned from TREC-6 • Importance of basic corpora • Difficulty in locating public domain tools • Problems of building multilingual testing data in the U.S.; this led to European cooperation in later TRECs

  9. Importance of Basic Corpora • The public availability of corpora, including text, speech and other multimedia data, is the most critical infrastructure • Newspapers (and their multimedia counterparts) are particularly valuable • Large volume readily available • Available in most languages • General purpose domain • Other genre also important

  10. Uses of this Corpora • The basic building block for IR test collections • A rich source of vocabulary and language structure information for many tasks • Use of comparable corpora, e.g. corpora from the same time period, allows statistical mining of cross-language, cross-media “word” pairs

  11. Importance of Basic Tools • For IR – stopword lists, stemmers, decompounders, segmenters, etc. • For other NLP tasks, add parsers, part-of-speech taggers, noun phrase detectors, named entity recognizers, etc. • For MT, add sentence aligners, etc. • These need to be readily available for all languages

  12. Other Basic Infrastructures • Parallel text • WordNets • Treebanks • Thesaurii (often domain specific) • Machine readable dictionaries • Knowledge bases such as CYC • Gazetteers, etc.

  13. Critical Issues for Infrastructure • Widespread availability of what already exists; this is both an issue of good dissemination and reasonable costs • Serious examination of the cost/benefit ratio of building any new infrastructure by the funding agencies • A clearer relationship between infrastructure, tools, and evaluation

  14. Proposal: Widespread availability • Set up a central worldwide site with links to a site in each country that catalogs publicly available corpora and tools • Be realistic about the costs of corpora; the costs of building corpora should be paid by funding agencies and therefore should be available at a TRULY minimal cost

  15. Proposal: Cost/Benefit Model • Look at basic corpora first • Prime target – a worldwide newspaper collection with at least 250 MB per language; look for publishing locations with multiple languages • Look at simple infrastructures also • Examples: lists of proper nouns, “crude” bilingual dictionaries, stemmers • Continue support of basic infrastructures like WordNets

  16. Proposal: Role of Evaluation • Evaluation forums are critical to making progress in language technology • Encourage “friendly” competition; provide a common task focal point for research groups worldwide • Enable identification of good tools for broader dissemination • Identify what the real issues are; what are the most useful types of new infrastructure needed

More Related