120 likes | 277 Vues
The SMART System:. Progress Report on System Acquisition and Set-Up. March 8, 2000 IS 240: Principles of Information Retrieval. Danyel Fisher Jonathan Henke Jason Hong. Jonathan Huang Jeane Stetson. Background. Developed 1961-64 at Harvard Maintained at Cornell University
E N D
The SMART System: Progress Report on System Acquisition and Set-Up March 8, 2000 IS 240: Principles of Information Retrieval Danyel Fisher Jonathan Henke Jason Hong Jonathan Huang Jeane Stetson
Background • Developed 1961-64 at Harvard • Maintained at Cornell University • Tested at every TREC conference • Emphasis: automatic retrieval (rather than interactive) • Vector-based analysis, tf x idf weighting • Current version: 13.3 (we have 11.0)
Bibliography • Salton, Gerard. The SMART retrieval system; experiments in automatic document processing. Englewood Cliffs, N.J., Prentice-Hall. 1971 • Salton, Gerard. “Developments in Automatic Text Retrieval.” Science, 1991 Aug 30, v253 n5023:974-980. • TREC Proceedings • SMART Staff, “User's Manual for the SMART Information Retrieval System’”. Technical Report 71-95, Revised April 1974. Cornell University (1974). • C. Buckley, Implemetation of the SMART Information Retrieval System. Technical Report 85-686, Cornell University (1985).
Indexing (Creating a Collection) • Document pre-parsing • recognize document structure and convert to a standard format • Finding & handling indexable information • parsing, stopword removal, stemming, term clustering, synonym dictionaries, etc. • Query handling • parsing, stopword removal, stemming, etc. (parallel to document handling)
Indexing (Creating a Collection) • Retrieval methods • term weighting and similarity evaluation • Default: standard tf x idf weighting, vector inner product • Output format & display
Indexing: Customizable Elements • Document location & format • Indexable information & index format • Query format • Retrieval method (document/query comparison) • Output/display format
System Architecture • 350 source files • 45,000 lines of code • Can include user-programmed modules
Set-up Procedure • Download source code • ftp://ftp.cs.cornell.edu/pub/smart • Compile • Look for documentation • Indexing completed using default settings • Unable to complete query yet • Unable to examine index • Cannot verify success of indexing!
System Documentation • Minimal • Poorly explained • Cryptic • Uses their own specific terminology
Problems Faced • Virtually every feature is customizable • Somewhere there are people who know how to do the customization….. • “SMART suffers from the advantages and disadvantages of most academic research software. It's designed to be extremely flexible (as long as you know what you're doing!)” - SMART manual • Documentation is too high level.
Further Steps • Complete a query using default settings. • Identify specific files for adjusting each customizable feature. • Determine how to modify each feature.
Recommendations & Advice • Find someone who has actually worked with the system before. • Understanding operation requires examination of C source code. • Customization requires modifying / creating C code.