1 / 16

Nutch Search Engine Tool

Nutch Search Engine Tool. Nutch overview. A full-fledged web search engine Functionalities of Nutch Internet and Intranet crawling Parsing different document formats (PDF, HTML, XML, JS, DOC,PPT etc.) ‏ Web interface for querying the index Management of Recrawls. Nutch Architecture.

nevina
Télécharger la présentation

Nutch Search Engine Tool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nutch Search Engine Tool

  2. Nutch overview • A full-fledged web search engine • Functionalities of Nutch • Internet and Intranet crawling • Parsing different document formats (PDF, HTML, XML, JS, DOC,PPT etc.)‏ • Web interface for querying the index • Management of Recrawls

  3. Nutch Architecture • 4 main components • Crawler • Web Database (WebDB, LinkDB, segments)‏ • Indexer • Searcher • Crawler and Searcher are highly decoupled enabling independent scaling • Highly modular, Plugin based architechture

  4. Nutch Architecture Doug Cutting, "Nutch: Open Source Web Search", 22 May 2004, WWW2004, New York

  5. Steps in a Crawl+Index cycle • Create a new WebDB (admin db -create). • Inject root URLs into the WebDB (inject). • Generate a fetchlist from the WebDB in a new segment (generate). • Fetch content from URLs in the fetchlist (fetch). • Update the WebDB with links from fetched pages (updatedb). • Repeat steps 3-5 until the required depth is reached. • Update segments with scores and links from the WebDB (updatesegs). • Index the fetched pages (index). • Eliminate duplicate content (and duplicate URLs) from the indexes (dedup). • Merge the indexes into a single index for searching (merge).

  6. Crawling (cont.) • Can effectively crawl upto ~100M pages • Crawl Statistics on KReSIT site (it.iitb) • Took 153 mins for a deep crawl (depth = 10) • Crawled 4171 documents • Size of crawl on disk 168MB • Size of index ~25MB

  7. Web Database (WebDB)‏ • Persistent data structure for mirroring the structure and properties of the web graph being crawled • The WebDB stores two types of entities • Pages • Links • Optimised for frequent updation

  8. Crawl Structure of it.iitb

  9. Page DB • Page Database • used for fetch scheduling • Contains: • pages indexed and sorted by MD5 and URL • outlinks, fetch information, page score • A set of APIs are provided to perform the various operations

  10. Page 1: Version: 4 URL: http://keaton/tinysite/A.html ID: fb8b9f0792e449cda72a9670b4ce833a Next fetch: Thu Nov 24 11:13:35 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 1 Score: 1.0 NextScore: 1.0 Page 2: Version: 4 URL: http://keaton/tinysite/B.html ID: 404db2bd139307b0e1b696d3a1a772b4 Next fetch: Thu Nov 24 11:13:37 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 3 Score: 1.0 NextScore: 1.0 Sample data of PageDB

  11. Link DB • Link Database • Contains: • links sorted by MD5 • links sorted by URL • Represents full link graph. • Stores anchor text associated with each link • Used for: • Link analysis; • Anchor text indexing.

  12. Segments • Collection of pages fetched and indexed by the crawler in a single run • One segment dir for each crawl-fetch-update cycle at a particular depth • Contains raw text and parsed data of the files crawled • Used to return the cached copy of a page and in snippet generation in results page

  13. Segments • segread tool gives a useful summary of all segments. (Parsed, Started, Finished, Dir) • It can also be used to dump the segment data in raw text format. The dump switch gives the following details • Fetcher Output: <url, hash, fetch-date ..>. Entries that go into the WebDB • Content: Raw content including http-headers and other meta-data. stored cached copy of a page • ParseData & ParseText: appropriate parser plugin by looking at the Raw content, is used to generate this data

  14. Nutch API

  15. Plugins • Provide extensions to extension-points • Each extension point defines an interface that must be implemented by extension • Some core extension points • IndexingFilter: add meta-data to indexed fields • Parser: to parse a new type of document • NutchAnalyzer: language specific analyzers

  16. References • Nutch Docs: http://lucene.apache.org/nutch/ • Nutch Wiki: http://wiki.apache.org/nutch/ • Prasad Pingali, CLIA consortium, Nutch Workshop, 2007 • Tom White, Introduction to Nutch, java.net website (http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html)

More Related