120 likes | 293 Vues
This report details the acquisition and setup of the SMART information retrieval system as of March 8, 2000. Developed initially at Harvard and maintained at Cornell University, SMART emphasizes automatic document retrieval with vector-based analysis and tf.idf weighting. The system architecture is extensive, comprising 350 source files and 45,000 lines of code. Challenges faced include minimal documentation and the complexity of customization, highlighting the importance of previous users' insights for effective implementation. Further steps are outlined for query completion and feature adjustment.
E N D
The SMART System: Progress Report on System Acquisition and Set-Up March 8, 2000 IS 240: Principles of Information Retrieval Danyel Fisher Jonathan Henke Jason Hong Jonathan Huang Jeane Stetson
Background • Developed 1961-64 at Harvard • Maintained at Cornell University • Tested at every TREC conference • Emphasis: automatic retrieval (rather than interactive) • Vector-based analysis, tf x idf weighting • Current version: 13.3 (we have 11.0)
Bibliography • Salton, Gerard. The SMART retrieval system; experiments in automatic document processing. Englewood Cliffs, N.J., Prentice-Hall. 1971 • Salton, Gerard. “Developments in Automatic Text Retrieval.” Science, 1991 Aug 30, v253 n5023:974-980. • TREC Proceedings • SMART Staff, “User's Manual for the SMART Information Retrieval System’”. Technical Report 71-95, Revised April 1974. Cornell University (1974). • C. Buckley, Implemetation of the SMART Information Retrieval System. Technical Report 85-686, Cornell University (1985).
Indexing (Creating a Collection) • Document pre-parsing • recognize document structure and convert to a standard format • Finding & handling indexable information • parsing, stopword removal, stemming, term clustering, synonym dictionaries, etc. • Query handling • parsing, stopword removal, stemming, etc. (parallel to document handling)
Indexing (Creating a Collection) • Retrieval methods • term weighting and similarity evaluation • Default: standard tf x idf weighting, vector inner product • Output format & display
Indexing: Customizable Elements • Document location & format • Indexable information & index format • Query format • Retrieval method (document/query comparison) • Output/display format
System Architecture • 350 source files • 45,000 lines of code • Can include user-programmed modules
Set-up Procedure • Download source code • ftp://ftp.cs.cornell.edu/pub/smart • Compile • Look for documentation • Indexing completed using default settings • Unable to complete query yet • Unable to examine index • Cannot verify success of indexing!
System Documentation • Minimal • Poorly explained • Cryptic • Uses their own specific terminology
Problems Faced • Virtually every feature is customizable • Somewhere there are people who know how to do the customization….. • “SMART suffers from the advantages and disadvantages of most academic research software. It's designed to be extremely flexible (as long as you know what you're doing!)” - SMART manual • Documentation is too high level.
Further Steps • Complete a query using default settings. • Identify specific files for adjusting each customizable feature. • Determine how to modify each feature.
Recommendations & Advice • Find someone who has actually worked with the system before. • Understanding operation requires examination of C source code. • Customization requires modifying / creating C code.