1 / 12

Progress Report on SMART Information Retrieval System Acquisition and Setup - March 2000

This report details the acquisition and setup of the SMART information retrieval system as of March 8, 2000. Developed initially at Harvard and maintained at Cornell University, SMART emphasizes automatic document retrieval with vector-based analysis and tf.idf weighting. The system architecture is extensive, comprising 350 source files and 45,000 lines of code. Challenges faced include minimal documentation and the complexity of customization, highlighting the importance of previous users' insights for effective implementation. Further steps are outlined for query completion and feature adjustment.

Télécharger la présentation

Progress Report on SMART Information Retrieval System Acquisition and Setup - March 2000

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The SMART System: Progress Report on System Acquisition and Set-Up March 8, 2000 IS 240: Principles of Information Retrieval Danyel Fisher Jonathan Henke Jason Hong Jonathan Huang Jeane Stetson

  2. Background • Developed 1961-64 at Harvard • Maintained at Cornell University • Tested at every TREC conference • Emphasis: automatic retrieval (rather than interactive) • Vector-based analysis, tf x idf weighting • Current version: 13.3 (we have 11.0)

  3. Bibliography • Salton, Gerard. The SMART retrieval system; experiments in automatic document processing. Englewood Cliffs, N.J., Prentice-Hall. 1971 • Salton, Gerard. “Developments in Automatic Text Retrieval.” Science, 1991 Aug 30, v253 n5023:974-980. • TREC Proceedings • SMART Staff, “User's Manual for the SMART Information Retrieval System’”. Technical Report 71-95, Revised April 1974. Cornell University (1974). • C. Buckley, Implemetation of the SMART Information Retrieval System. Technical Report 85-686, Cornell University (1985).

  4. Indexing (Creating a Collection) • Document pre-parsing • recognize document structure and convert to a standard format • Finding & handling indexable information • parsing, stopword removal, stemming, term clustering, synonym dictionaries, etc. • Query handling • parsing, stopword removal, stemming, etc. (parallel to document handling)

  5. Indexing (Creating a Collection) • Retrieval methods • term weighting and similarity evaluation • Default: standard tf x idf weighting, vector inner product • Output format & display

  6. Indexing: Customizable Elements • Document location & format • Indexable information & index format • Query format • Retrieval method (document/query comparison) • Output/display format

  7. System Architecture • 350 source files • 45,000 lines of code • Can include user-programmed modules

  8. Set-up Procedure • Download source code • ftp://ftp.cs.cornell.edu/pub/smart • Compile • Look for documentation • Indexing completed using default settings • Unable to complete query yet • Unable to examine index • Cannot verify success of indexing!

  9. System Documentation • Minimal • Poorly explained • Cryptic • Uses their own specific terminology

  10. Problems Faced • Virtually every feature is customizable • Somewhere there are people who know how to do the customization….. • “SMART suffers from the advantages and disadvantages of most academic research software. It's designed to be extremely flexible (as long as you know what you're doing!)” - SMART manual • Documentation is too high level.

  11. Further Steps • Complete a query using default settings. • Identify specific files for adjusting each customizable feature. • Determine how to modify each feature.

  12. Recommendations & Advice • Find someone who has actually worked with the system before. • Understanding operation requires examination of C source code. • Customization requires modifying / creating C code.

More Related