Enhancements in Harvest System Architecture and Functionality
120 likes | 258 Vues
This document outlines the core objectives and improvements planned for the Harvest system, emphasizing advanced communication through SOIF, increased search speed, and enhanced scalability and availability. Key strategies include a shift to HTTP and HTML priorities, better integration with other search systems, and improved user access control. Notable upgrades involve creating multiple Gatherers, refining HTML summarizers, and adopting new technologies like Indexdata's Zebra. The focus is on fostering internationalization, improving document handling, and maintaining compatibility with existing standards.
Enhancements in Harvest System Architecture and Functionality
E N D
Presentation Transcript
What won’t change • Harvest’s basic design • SOIF for inter-component communication • Development model http://harvest.sourceforge.net/
General Goals • Increase search speed • Shift focus to HTTP and HTML • Internationalisation • Improve scalability • Increase availability • Improve access control http://harvest.sourceforge.net/
General Goals • Integration of other search systems into Harvest system • Remove all non GPLed components • Improve ranking • Promote Harvest to attract more users and developers http://harvest.sourceforge.net/
Gatherer • Shift focus to HTTP • Improve gathering over slow connection • Improve HTTP gatherer • Create multiple Gatherers “on the fly” where possible • Evaluate larbin and curl • Migrate from GDBM to Sleepycat’s DB for local disc cache management http://harvest.sourceforge.net/
Gatherer • Remove local disc cache • Implement candidate selection filter for HTTP enumerator based on mime type • Trust mime type sent by HTTP servers • Add HTTPS support • Evaluate improvements of HTTP 1.1 over HTTP 1.0 • Replace unnesters with exploders http://harvest.sourceforge.net/
Gatherer • Improve object storage system • Improve expiring objects • Evaluate viability of an expire daemon • Split file: and news: rootnodes into leafnodes • Remove All-Templates • Make SOIF objects shareable between Gatherer and Broker if possible http://harvest.sourceforge.net/
Summarizer • Shift focus to HTML • Improve existing HTML summarizers • Create HTML summarizer which “understands” HTML • Improve support for Microsoft Office documents http://harvest.sourceforge.net/
Broker • Add Indexdata’s Zebra as fulltext indexer • Implement method to retrieve an SOIF object by URL • Improve temporary file/directory handling used for paging search results • Improve SOIF object storage • Extend “shell indexer” functionality http://harvest.sourceforge.net/
Broker • Implement an user interface in PHP • Separate data from metadata when storing SOIF objects • Minimise size of Registry • Use cookies to save user preferences of the search interface • Evaluate and write SOIF filter for Namazu http://harvest.sourceforge.net/
Broker • Evaluate RDBMS (Postgresql, MySQL) • Evaluate Xquery and SOAP http://harvest.sourceforge.net/
Documentation • Switch from linuxdoc to docbook for manual and FAQ http://harvest.sourceforge.net/
Problems • PostScript and PDF summarizers • Apache’s multiviews • IMS Gathering • Stemming and Soundex are language dependant • Language recognition • No free thesauri available http://harvest.sourceforge.net/