1 / 17

Reference Collections: Task Characteristics

Reference Collections: Task Characteristics. TREC Collection. Text REtrieval Conference (TREC) sponsored by NIST and DARPA (1992-?) Comparing approaches for information retrieval from large text collections: Uniform scoring procedures Large corpus of news and technical texts

sumi
Télécharger la présentation

Reference Collections: Task Characteristics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reference Collections:Task Characteristics

  2. TREC Collection • Text REtrieval Conference (TREC) • sponsored by NIST and DARPA (1992-?) • Comparing approaches for information retrieval from large text collections: • Uniform scoring procedures • Large corpus of news and technical texts • Texts tagged in SGML (includes some metadata and document structure) • Specified tasks

  3. Example Task • <top> • <num> Number: 168 • <title> Topic: Financing AMTRAK • <desc> Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK). • <narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuous government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant. • </top>

  4. Deciding What is Relevant • Pooling method • Set (pool) of potentially relevant documents is obtained by combining top N results from various retrieval systems. • Humans then examine these to determine which are truly relevant • Assumes relevant documents will be in the pool and that documents not in the pool are not relevant. • Assumptions have been verified (at least for evaluation purposes)

  5. Types of TREC Tasks • Ad hoc tasks: • New queries against static collection • IR systems return ranked results • Systems get task and collection • Routing tasks: • Standing queries for changing collection • Basically a batch-mode filtering task • Example: identifying topic from AP newswire • Results must be ranked • Systems get task and two collections, one for training and one for evaluation

  6. Secondary Tasks at TREC • Chinese • Documents and queries in Chinese • Filtering • Determine whether each new document is relevant (no rank order) • Interactive • Human searcher interacts with system to determine relevant (no rank order) • NLP • Examining value of NLP in IR

  7. Secondary Tasks at TREC • Cross Languages • Documents in one language while tasks in another • High Precision • Retrieve 10 documents that answer a given information request in 5 minutes. • Spoken Document Retrieval • Documents are transcripts of radio broadcasts • Very Large Corpus • > 20 GB collection

  8. Evaluation Measures • Summary Table Statistics • # of requests in task, # of documents retrieved, # of relevant docs retrieved, total # of relevant docs • Recall-Precision Averages • 11 standard recall levels • Document Level Averages • Avg. precision for specified # of retrieved docs (R) • Average Precision Histogram • Graph showing how algorithm did for each request compared to average of all algorithms

  9. Reference Collections:Collection Characteristics

  10. CACM Collection • 3204 Communications of the ACM articles • Focus of collection: computer science • Structured subfields: • Author names • Date information • Word stems from title and abstract • Categories from hierarchical classification • Direct references between articles • Bibliographic coupling connections • Number of co-citations for each pair of articles

  11. CACM Collection • 3204 Communications of the ACM articles • Test information requests: • 52 information requests in natural language with two Boolean query expressions • Average of 11.4 terms per query • Requests are rather specific with an average of about 15 relevant documents • Result in relatively low precision and recall

  12. ISI Collection • 1460 documents from the Institute of Scientific Information • Focus of collection: information science • Structured subfields: • Author names • Word stems from title and abstract • Number of co-citations for each pair of articles

  13. ISI Collection • 1460 documents from the Institute of Scientific Information • Test information requests: • 35 information requests in natural language with Boolean query expressions • Average of 8.1 terms per query • 41 information requests in NL without Boolean query expression • Requests are fairly general with an average of about 50 relevant documents • Higher precision and recall

  14. Observation Number of terms increases slowly with number of documents

  15. Cystic Fibrosis Collection • 1239 articles with “Cystic Fibrosis” index in MEDLINE • Structured subfields: • MEDLINE accession number • Author • Title • Source • Major subjects • Minor subjects • Abstract (or extract) • References in the document • Citations to the document

  16. Cystic Fibrosis Collection • 1239 articles with “Cystic Fibrosis” index in MEDLINE • Test information requests: • 100 information requests • Relevance assessed by four experts with a scale of 0 (not relevant), 1 (marginal relevance), and 2 (high relevance) • Overall relevance is sum (0-8)

  17. Discussion Questions • In developing a search engine: • How would you use metadata (e.g. author, title, abstract)? • How would you use document structure? • How would you use references, citations, co-citations? • How would you use hyperlinks?

More Related