190 likes | 323 Vues
This document provides a comprehensive introduction to Nutch, an open-source application written in Java designed for web crawling and search engine capabilities. It explains Nutch's architecture, which utilizes components such as the fetcher, indexer, and searcher, as well as its integration with Lucene for text indexing and search functions. The document also outlines how to set up and use Nutch, including configuration tips, URL filtering, and crawling processes. It emphasizes the technical goals of Nutch, such as fetching billions of pages per month and maintaining high-quality search results.
E N D
Zhao Dongsheng 2008.9.29 Introduction to Nutch
Summary • What's Nutch • Nutch's architecture • How to use Nutch • About the first homework
What's Nutch • Written in java • Open-source project • An Application that can build SE • Behind a lot of web sites
What's Nutch • Lucene and Nutch • Nutch grow out of Lucene • Both open-source project • Both written in java • But Lucene is a Java library for text indexing and search • Nutch is an Application • Nutch uses lucene for indexing
Nutch's core components • Fecher • Requests web pages • Parses and extracts links • Web DB • Page DB • Used for fetch sheduling • Link DB • Store link gragh • Store anchor text with each link • Link-analysis and Anchor text indexing
Nutch's core components (cont.) • Indexer • Creates inverted index • Uses Lucene • Searcher • Finds relelant docs quickly • Ranks the docs • Summarizing
Functions Nutch supports • Politeness when crawling • Duplicates removing • PageRank analysis • Distributed searching • Summarizing • ......
Nutch's Technical Goals • Fetch several billion pages per month • Maintain an index of these pages • Search that index up to 1000 times per second • Provide very high quality search results • Operate at minimal cost
Source code & API • Source Dirs • analysis crawl html plugin scoring segment tools fetcher indexer net parse protocol searcher ... • crawl/Crawl.java • fetcher/Fetcher.java
How to use Nutch • Download & unpack • Nutch required JVM • Set environment variables • Configure • Specify root URLs • Specify URLs filters • Optionally specify • Number of threads • Levels to crawl • Fetch delay
How to use Nutch (cont.) • Root URLs Example • http://www.pku.edu.cn • URL Filter Example • crawl-urlfilter.txt • -^(file|ftp|mailto): • -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ • -[?*!@=] • +^http://([a-z0-9]*\.)*pku.edu.cn/
How to use Nutch (cont.) • Run Nutch • Just a command line • bin/nutch crawl myurl.txt -dir mycrawl -depth 4 >& crawl.log • Use Tomcat to experience!
About the first Homework • About web crawling • Familiar with Nutch & java • Fetch blog/bbs etc ? • Need your advice!
Q & A thanks!