1 / 20

Nutch in a Nutshell (part I)

Nutch in a Nutshell (part I). Presented by Liew Guo Min Zhao Jin. Outline. Overview Nutch as a web crawler Nutch as a complete web search engine Special features Installation/Usage (with Demo) Exercises. Overview. Complete web search engine

evan
Télécharger la présentation

Nutch in a Nutshell (part I)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin

  2. Outline • Overview • Nutch as a web crawler • Nutch as a complete web search engine • Special features • Installation/Usage (with Demo) • Exercises

  3. Overview • Complete web search engine • Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) • Java based, open source • Features: • Customizable • Extensible (Next meeting) • Distributed (Next meeting)

  4. Injector Fetcher Parser CrawlDBTool Generator Web Nutch as a crawler Initial URLs CrawlDB Webpages/files update get read/write generate read/write Segment

  5. CrawlDB LinkDB Indexer Searcher Nutch as a complete web search engine Segments (Lucene) Index (Lucene) GUI (Tomcat)

  6. Special Features • Customizable • Configuration files (XML) • Required user parameters • http.agent.name • http.agent.description • http.agent.url • http.agent.email • Adjustable parameters for every component • E.g. for fetcher: • Threads-per-host • Threads-per-ip

  7. Special Features • URL Filters (Text file) • Regular expression to filter URLs during crawling • E.g. • To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ • To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/ • Plugin-information (XML) • The metadata of the plugins (More details next week)

  8. Installation & Usage • Installation • Software needed • Nutch release • Java • Apache Tomcat (for GUI) • Cgywin (for windows)

  9. Installation & Usage • Usage • Crawling • Initial URLs (text file or DMOZ file) • Required parameters (conf/nutch-site.xml) • URL filters (conf/crawl-urlfilter.txt) • Indexing • Automatic • Searching • Location of files (WAR file, index) • The tomcat server

  10. Demo time!

  11. Exercises • Questions: • What are the things that need to be done before starting a crawl job with Nutch? • What are the ways tell Nutch what to crawl and what not? What can you do if you are the owner of a website? • Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? • What do you think are good crawling behaviors? • Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? • What are the advantages of using Nutch instead of commercial search engines?

  12. Answers • What are the things that need to be done before starting a crawl job with Nutch? • Set the CLASSPATH to the Lucene Core • Set the JAVA_HOME path • Create a folder containing urls to be crawled • Amend the crawl-urlfilter file • Amend the nutch-site.xml file to include the user parameters

  13. What are the ways tell Nutch what to crawl and what not? • Url filters • Depth in crawling • Scoring function for urls • What can you do if you are the owner of a website? • Web Server Administrators • Use the Robot Exclusion Protocol by adding the following in /robots.txt • HTML Author • Add the Robots META tag

  14. Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? • To ensure accountability (although tracing is still possible without them) • What do you think are good crawling behaviors? • Be Accountable • Test Locally • Don't hog resources • Stay with it • Share results

  15. Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? • True but one can always make changes in Nutch to minimize the effect. • What are the advantages of using Nutch instead of commercial search engines? • Open-source • Transparent • Able to define the what are to be returned in searches and the index ranking

  16. Exercises • Hands-on exercises • Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI • Repeat the crawling process without using the crawl command • Modify your configuration to perform each of the following crawl jobs and think when they would be useful. • To crawl only webpages and pdfs but not anything else • To crawl the files on your harddisk • To crawl but not to parse • (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state

  17. Q&A?

  18. Next Meeting • Special Features • Extensible • Distributed • Feedback and discussion

  19. References • http://lucene.apache.org/nutch/ -- Official website • http://wiki.apache.org/nutch/ -- Nutch wiki (Seriously outdated. Take with a grain of salt.) • http://lucene.apache.org/nutch/release/ Nutch source code • www.nutchinstall.blogspot.com Installation guide • http://www.robotstxt.org/wc/robots.html The web robot pages

  20. Thank you!

More Related