220 likes | 333 Vues
Explore the innovative possibilities of the GOAT Search Solution powered by Apache Lucene. This session covers the importance of effective search engines, comparing GOAT with traditional solutions like Verity, and its advantages in handling large datasets efficiently. With over a decade of development experience, learn about indexing processes, query syntax, and the benefits of utilizing Lucene's high-performance capabilities. Discover alternative search solutions and understand how GOAT can revolutionize your organization's data retrieval strategies seamlessly.
E N D
Goat search Revorg GOAT Search Solution (Powered by Lucene)
About Me Grover Fields • Revorg, LLC (Owner) • M.S. Information System (Troy University) • B.S. Industrial Engineering (Florida A&M University) • Stanford Project Management Courses
About Me • 10+ years of development, analysis, and implementation • 10+ years of ColdFusion experience • 2+ years of Java experience • Commonspot, Strongmail, ClickFix (Developer) • Email: grover_fields@yahoo.com • Web site: http://www.groverfields.com
Agenda • What? • What can we do with GOAT? • Why? • Why do we want to use GOAT and not Verity? • How? • How do we do that? • Conclusion and alternative solutions
What • What is a Search Engine? • Builds an index on text • Answers queries using that index, a la Verity • Existing database already • A search engine offers? • Scalability • Reliance Ranking • Tweaking • Integrates different sources (email, web pages, files, DATABASES)
What is a search engine? (cont.) • Works on words, not on substrings • Auto != automatic, automobile • Indexing process: • Convert document • Extract text and meta data • Normalize text • Write (inverted) index
Apache Lucene Overview • Lucene Java 2.4 • A high-performance, full-featured text search engine library written entirely in Java. • It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. • No GUI • http://lucene.apache.org
Apache Lucene Overview • Java library for indexing and searching • No dependencies • Works with Java 1.4 or later • Input for indexing: Document objects • Each document: set of Fields, field name, field content • Stores its index as files on disk or memory • No document converters • No web crawler
Lucene Java users • HBCU.info • LinkedIn • IBM OmniFind Yahoo! Edition • Techorati.com • Eclipse • Monster.com • …
Lucene Java Summary • Java Library for indexing and searching • Lightweight /no dependencies • Powerful and fast and tested! • No document conversion • No GUI
Why? • Cost of Enterprise Search Solution • Need for search speed • Java projects to work on • Things to do
Verity Limitations • 10,000 documents for ColdFusion Developer Edition • 125,000 documents of ColdFusion Standard Edition • 250,000 documents for ColdFusion Enterprise Edition • What do developers do in a shared hosting environment? • Is it possible for the hosting company to limit the number of documents per Web site?
T-SQL Limitations? • Search for “Yahoo” on my blog • SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC • Multiply that time 10, 100, 500, or 1000 users/hr?
T-SQL Limitations? • Full table scan = 1 THING • PERFORMANCE KILLER!!! • No search sorting • RDBMS isn’t designed to do this but allows it • Use the right tools!
How? • GOAT Search Solution • Lucene 2.4.0 • ColdFusion MX 8 • MX is fine but GUI needs to be rolled back • Commons IO 1.4 • Simply package .jar files • Simply Web based GUI
How? • Macromedia JDBC Drivers • Same drivers that ColdFusion uses • No additional drivers to install • Supports RDBMS ONLY • MSSQL • MySQL • Oracle • No File system support (Yet)
Basics? • Indexing extracts both meaning and structure from unstructured information by indexing each document • Contains a complete list of all the words used in a given document along with metadata about that document • Lucene creates a collection that normalizes both the structured and unstructured data. • Search requests then check these collections rather than scanning the actual documents and database fields. • This provides a faster search of information, regardless of the file type and whether the source is structured or unstructured.
Basics? • Collection • A special database created by Lucene that contains metadata that describes the documents • Documents • A sequence of fields • Similar to a row in a database table • Row 1 • Row 2, etc • Fields • A named sequence of terms • Similar to a column in a table • Primary Key • Column 1 • Terms • Is a string
Knowledge? • Index • A special database created by Lucene that contains metadata that describes the documents • Query Syntax • Similar to Google’s advanced search: • field:value • E.G. resume: coldfusion • http://lucene.apache.org/java/2_4_0/queryparsersyntax.html • Results • Primary Key list of values • XML based on the document • CFX Tag integration
Alternative Solutions for Search • Commercial vendors: • FAST, $100k • Autonomy, $80k • Google, $50k • Commercial search engines based on Lucene • IBM OmniFind Yahoo Edition • RDBMS with Integrated Search • Oracle • MySQL • MSSQL • PERFORMANCE KILLERS
RoadMap Road Map A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials. • Overhaul Java programming (still novice) • Integrate with other products • Aperture • Nutch • Solr • File system integration • .txt, .pdf, .doc, .ppt, etc. • Geospatial based searches • E.G. All jobs within a 50 mile radius
References • Apache.org • Adobe.com • Ben Forta’s Blog • Slideshare.net • Multiple authors • Other references