Web Crawler

Workshop, 3rd Semester Web Crawler

What is a Web Crawler ? • A crawler could be a part of: • a search engine, where every page is searched for a search string • a spam agent that visits all sites and collects email addresses • a site map creation application

Goal for workshop • Develop a web crawler that takes an URI as input and visits all pages at the identified site. • The web crawler must search HTML documents for links and build a graph containing documents as vertices (nodes) and links as edges.

Subjects • Graphs • HTML • Simple parsing • http and tcp • Design patterns

Graphs • To represent the site you can use a directed graph: • HTML documents being represented by vertices (nodes) • links being represented by edges

HTML • Every page the crawler visits is a HTML document. • The HTML document should be investigated for link tags • <a href = ”http://www.prenhall.com”>Prentice Hall</a> • Frames (frame sets) are to be considered as individual and independent HTML pages

Simple parsing • The simple version searches the HTML documents for links by using ”brute force” • It is obvious to investigate other techniques • state machines • regular expressions • recursive descend

Communication • For communication use the .NET sockets class • HTTP and TCP • URI, URL etc.

Design patterns • Design patterns should be applied wherever appropriate

The program • Overall structure • Graph • HTML • HTTP • Crawler

The Crawler

The Graph Package

The Communication Package

The HTML Package

Testsite • http://public.noea.dk/fen/testsite/

Schedule for the Week • Day 1 – Monday • .Net socket class and graph implementation. (PQC) • Day 2 – Tuesday • Working with practical html parsing, coupling with the http class. Parsing techniques. (CHJO) • Day 3 – Wednesday • Continue working on your own. • Day 4 – Thursday • Putting it all together: graph-, html- and the http part. The crawler algorithm. (PQC) • Day 5 – Friday • Finishing up and preparing the presentation (CHJO). • Presentation and evaluation • Thursday, April 29, 8.30 – 10.00

Evaluation • Nothing in writing, but every team gives a presentation: • Results are presented in Powerpoint presentations: • class model • design issues etc. • A demo • Review and discussion of interesting code fragments • Approximately 20 minutes per team • Participation in the evaluation is mandatory. • Need to hand in report if you do not attend evaluation.

Web Crawler

Web Crawler

Presentation Transcript

Creating a Web Crawler in 3 Steps

Web Crawler & Distributed IR

Web Categorization Crawler – Part I

What is a Web Crawler

Web crawler

Building a Web Crawler in Python

Chapter 5 Web Crawler & Search Engine

Gnutella Crawler

A Web Crawler Design for Data Mining

iRobot: An Intelligent Crawler for Web Forums

Crawler policy document

Mercator: A scalable, extensible Web crawler

Web Crawler Agent (WCA)

Focused Crawler

Smart Crawler A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Semalt Islamabad Expert What Is A Web Crawler

Mercator: A Scalable, Extensible Web Crawler

Crawler Excavator Market

Crawler manuals

Web Crawler

Web Crawler

Presentation Transcript

Creating a Web Crawler in 3 Steps

Web Crawler &amp; Distributed IR

Web Categorization Crawler – Part I

What is a Web Crawler

Web crawler

Building a Web Crawler in Python

Chapter 5 Web Crawler &amp; Search Engine

Gnutella Crawler

A Web Crawler Design for Data Mining

iRobot: An Intelligent Crawler for Web Forums

Crawler policy document

Mercator: A scalable, extensible Web crawler

Web Crawler Agent (WCA)

Focused Crawler

Smart Crawler A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Semalt Islamabad Expert What Is A Web Crawler

Mercator: A Scalable, Extensible Web Crawler

Crawler Excavator Market

Crawler manuals

Web Crawler & Distributed IR

Chapter 5 Web Crawler & Search Engine