Introduction to Nutch: Architecture, Features, and Usage Guide

Zhao Dongsheng 2008.9.29 Introduction to Nutch

Summary • What's Nutch • Nutch's architecture • How to use Nutch • About the first homework

What's Nutch • Written in java • Open-source project • An Application that can build SE • Behind a lot of web sites

What's Nutch • Lucene and Nutch • Nutch grow out of Lucene • Both open-source project • Both written in java • But Lucene is a Java library for text indexing and search • Nutch is an Application • Nutch uses lucene for indexing

Nutch's architecture

Nutch's core components • Fecher • Requests web pages • Parses and extracts links • Web DB • Page DB • Used for fetch sheduling • Link DB • Store link gragh • Store anchor text with each link • Link-analysis and Anchor text indexing

Nutch's core components (cont.)‏ • Indexer • Creates inverted index • Uses Lucene • Searcher • Finds relelant docs quickly • Ranks the docs • Summarizing

Functions Nutch supports • Politeness when crawling • Duplicates removing • PageRank analysis • Distributed searching • Summarizing • ......

Nutch's Technical Goals • Fetch several billion pages per month • Maintain an index of these pages • Search that index up to 1000 times per second • Provide very high quality search results • Operate at minimal cost

Source code & API • Source Dirs • analysis crawl html plugin scoring segment tools fetcher indexer net parse protocol searcher ... • crawl/Crawl.java • fetcher/Fetcher.java

How to use Nutch • Download & unpack • Nutch required JVM • Set environment variables • Configure • Specify root URLs • Specify URLs filters • Optionally specify • Number of threads • Levels to crawl • Fetch delay

How to use Nutch (cont.)‏ • Root URLs Example • http://www.pku.edu.cn • URL Filter Example • crawl-urlfilter.txt • -^(file|ftp|mailto): • -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ • -[?*!@=] • +^http://([a-z0-9]*\.)*pku.edu.cn/

How to use Nutch (cont.)‏ • Run Nutch • Just a command line • bin/nutch crawl myurl.txt -dir mycrawl -depth 4 >& crawl.log • Use Tomcat to experience!

Home page

Search result

Score Explanation

Anchor texts with a link

About the first Homework • About web crawling • Familiar with Nutch & java • Fetch blog/bbs etc ? • Need your advice!

Q & A thanks!

Introduction to Nutch: Architecture, Features, and Usage Guide