1 / 22

Goat search

Goat search. Revorg GOAT Search Solution (Powered by Lucene). About Me. Grover Fields Revorg, LLC (Owner) M.S. Information System (Troy University) B.S. Industrial Engineering (Florida A&M University) Stanford Project Management Courses. About Me.

montana
Télécharger la présentation

Goat search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Goat search Revorg GOAT Search Solution (Powered by Lucene)

  2. About Me Grover Fields • Revorg, LLC (Owner) • M.S. Information System (Troy University) • B.S. Industrial Engineering (Florida A&M University) • Stanford Project Management Courses

  3. About Me • 10+ years of development, analysis, and implementation • 10+ years of ColdFusion experience • 2+ years of Java experience • Commonspot, Strongmail, ClickFix (Developer) • Email: grover_fields@yahoo.com • Web site: http://www.groverfields.com

  4. Agenda • What? • What can we do with GOAT? • Why? • Why do we want to use GOAT and not Verity? • How? • How do we do that? • Conclusion and alternative solutions

  5. What • What is a Search Engine? • Builds an index on text • Answers queries using that index, a la Verity • Existing database already • A search engine offers? • Scalability • Reliance Ranking • Tweaking • Integrates different sources (email, web pages, files, DATABASES)

  6. What is a search engine? (cont.) • Works on words, not on substrings • Auto != automatic, automobile • Indexing process: • Convert document • Extract text and meta data • Normalize text • Write (inverted) index

  7. Apache Lucene Overview • Lucene Java 2.4 • A high-performance, full-featured text search engine library written entirely in Java. • It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. • No GUI • http://lucene.apache.org

  8. Apache Lucene Overview • Java library for indexing and searching • No dependencies • Works with Java 1.4 or later • Input for indexing: Document objects • Each document: set of Fields, field name, field content • Stores its index as files on disk or memory • No document converters • No web crawler

  9. Lucene Java users • HBCU.info • LinkedIn • IBM OmniFind Yahoo! Edition • Techorati.com • Eclipse • Monster.com • …

  10. Lucene Java Summary • Java Library for indexing and searching • Lightweight /no dependencies • Powerful and fast and tested! • No document conversion • No GUI

  11. Why? • Cost of Enterprise Search Solution • Need for search speed • Java projects to work on • Things to do

  12. Verity Limitations • 10,000 documents for ColdFusion Developer Edition • 125,000 documents of ColdFusion Standard Edition • 250,000 documents for ColdFusion Enterprise Edition • What do developers do in a shared hosting environment? • Is it possible for the hosting company to limit the number of documents per Web site?

  13. T-SQL Limitations? • Search for “Yahoo” on my blog • SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC • Multiply that time 10, 100, 500, or 1000 users/hr?

  14. T-SQL Limitations? • Full table scan = 1 THING • PERFORMANCE KILLER!!! • No search sorting • RDBMS isn’t designed to do this but allows it • Use the right tools!

  15. How? • GOAT Search Solution • Lucene 2.4.0 • ColdFusion MX 8 • MX is fine but GUI needs to be rolled back • Commons IO 1.4 • Simply package .jar files • Simply Web based GUI

  16. How? • Macromedia JDBC Drivers • Same drivers that ColdFusion uses • No additional drivers to install • Supports RDBMS ONLY • MSSQL • MySQL • Oracle • No File system support (Yet)

  17. Basics? • Indexing extracts both meaning and structure from unstructured information by indexing each document • Contains a complete list of all the words used in a given document along with metadata about that document • Lucene creates a collection that normalizes both the structured and unstructured data. • Search requests then check these collections rather than scanning the actual documents and database fields. • This provides a faster search of information, regardless of the file type and whether the source is structured or unstructured.

  18. Basics? • Collection • A special database created by Lucene that contains metadata that describes the documents • Documents • A sequence of fields • Similar to a row in a database table • Row 1 • Row 2, etc • Fields • A named sequence of terms • Similar to a column in a table • Primary Key • Column 1 • Terms • Is a string

  19. Knowledge? • Index • A special database created by Lucene that contains metadata that describes the documents • Query Syntax • Similar to Google’s advanced search: • field:value • E.G. resume: coldfusion • http://lucene.apache.org/java/2_4_0/queryparsersyntax.html • Results • Primary Key list of values • XML based on the document • CFX Tag integration

  20. Alternative Solutions for Search • Commercial vendors: • FAST, $100k • Autonomy, $80k • Google, $50k • Commercial search engines based on Lucene • IBM OmniFind Yahoo Edition • RDBMS with Integrated Search • Oracle • MySQL • MSSQL • PERFORMANCE KILLERS

  21. RoadMap Road Map A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials. • Overhaul Java programming (still novice) • Integrate with other products • Aperture • Nutch • Solr • File system integration • .txt, .pdf, .doc, .ppt, etc. • Geospatial based searches • E.G. All jobs within a 50 mile radius

  22. References • Apache.org • Adobe.com • Ben Forta’s Blog • Slideshare.net • Multiple authors • Other references

More Related