Advanced Search Solutions with GOAT: Leveraging Apache Lucene for Optimized Performance

Goat search Revorg GOAT Search Solution (Powered by Lucene)

About Me Grover Fields • Revorg, LLC (Owner) • M.S. Information System (Troy University) • B.S. Industrial Engineering (Florida A&M University) • Stanford Project Management Courses

About Me • 10+ years of development, analysis, and implementation • 10+ years of ColdFusion experience • 2+ years of Java experience • Commonspot, Strongmail, ClickFix (Developer) • Email: grover_fields@yahoo.com • Web site: http://www.groverfields.com

Agenda • What? • What can we do with GOAT? • Why? • Why do we want to use GOAT and not Verity? • How? • How do we do that? • Conclusion and alternative solutions

What • What is a Search Engine? • Builds an index on text • Answers queries using that index, a la Verity • Existing database already • A search engine offers? • Scalability • Reliance Ranking • Tweaking • Integrates different sources (email, web pages, files, DATABASES)

What is a search engine? (cont.) • Works on words, not on substrings • Auto != automatic, automobile • Indexing process: • Convert document • Extract text and meta data • Normalize text • Write (inverted) index

Apache Lucene Overview • Lucene Java 2.4 • A high-performance, full-featured text search engine library written entirely in Java. • It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. • No GUI • http://lucene.apache.org

Apache Lucene Overview • Java library for indexing and searching • No dependencies • Works with Java 1.4 or later • Input for indexing: Document objects • Each document: set of Fields, field name, field content • Stores its index as files on disk or memory • No document converters • No web crawler

Lucene Java users • HBCU.info • LinkedIn • IBM OmniFind Yahoo! Edition • Techorati.com • Eclipse • Monster.com • …

Lucene Java Summary • Java Library for indexing and searching • Lightweight /no dependencies • Powerful and fast and tested! • No document conversion • No GUI

Why? • Cost of Enterprise Search Solution • Need for search speed • Java projects to work on • Things to do

Verity Limitations • 10,000 documents for ColdFusion Developer Edition • 125,000 documents of ColdFusion Standard Edition • 250,000 documents for ColdFusion Enterprise Edition • What do developers do in a shared hosting environment? • Is it possible for the hosting company to limit the number of documents per Web site?

T-SQL Limitations? • Search for “Yahoo” on my blog • SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC • Multiply that time 10, 100, 500, or 1000 users/hr?

T-SQL Limitations? • Full table scan = 1 THING • PERFORMANCE KILLER!!! • No search sorting • RDBMS isn’t designed to do this but allows it • Use the right tools!

How? • GOAT Search Solution • Lucene 2.4.0 • ColdFusion MX 8 • MX is fine but GUI needs to be rolled back • Commons IO 1.4 • Simply package .jar files • Simply Web based GUI

How? • Macromedia JDBC Drivers • Same drivers that ColdFusion uses • No additional drivers to install • Supports RDBMS ONLY • MSSQL • MySQL • Oracle • No File system support (Yet)

Basics? • Indexing extracts both meaning and structure from unstructured information by indexing each document • Contains a complete list of all the words used in a given document along with metadata about that document • Lucene creates a collection that normalizes both the structured and unstructured data. • Search requests then check these collections rather than scanning the actual documents and database fields. • This provides a faster search of information, regardless of the file type and whether the source is structured or unstructured.

Basics? • Collection • A special database created by Lucene that contains metadata that describes the documents • Documents • A sequence of fields • Similar to a row in a database table • Row 1 • Row 2, etc • Fields • A named sequence of terms • Similar to a column in a table • Primary Key • Column 1 • Terms • Is a string

Knowledge? • Index • A special database created by Lucene that contains metadata that describes the documents • Query Syntax • Similar to Google’s advanced search: • field:value • E.G. resume: coldfusion • http://lucene.apache.org/java/2_4_0/queryparsersyntax.html • Results • Primary Key list of values • XML based on the document • CFX Tag integration

Alternative Solutions for Search • Commercial vendors: • FAST, $100k • Autonomy, $80k • Google, $50k • Commercial search engines based on Lucene • IBM OmniFind Yahoo Edition • RDBMS with Integrated Search • Oracle • MySQL • MSSQL • PERFORMANCE KILLERS

RoadMap Road Map A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials. • Overhaul Java programming (still novice) • Integrate with other products • Aperture • Nutch • Solr • File system integration • .txt, .pdf, .doc, .ppt, etc. • Geospatial based searches • E.G. All jobs within a 50 mile radius

References • Apache.org • Adobe.com • Ben Forta’s Blog • Slideshare.net • Multiple authors • Other references

Advanced Search Solutions with GOAT: Leveraging Apache Lucene for Optimized Performance