A Web Crawler Design for Data Mining

A Web Crawler Design for Data Mining Mike Thelwall University of Wolverhampton, Wolverhampton, UK Journal of Information Science 2001 27 April 2011 Presentation @ IDB Lab Seminar Presented by Jee-bum Park

Outline • Introduction • Architecture • Implementation • System Testing • Conclusion

Introduction- Motive • The importance of the web has guaranteed academic interest in it, not only for affiliated technologies, but also for its content

Introduction- Motive • Information scientists and others wish to perform data mining on large numbers of web pages • They will require the services of a web crawler, • To extract patterns from the web • To extract meaning from the link structure of the web • The necessity of an effective paradigm for a web mining crawler

Introduction- Web Crawler • A web crawler, robot or spider • A program that is capable of iteratively and automatically, • Downloading web pages • Extracting URLs from their HTML • Fetching them

Introduction- Web Crawler: Workflow http://idb.snu.ac.kr/ Web Crawler

Introduction- Web Crawler: Architecture

Introduction- Web Crawler: Roles • A sophisticated web crawler may also perform, • Identifying pages judged relevant to the crawl • Rejecting pages as duplicates of ones previously visited • Supporting the action of search engines • For example, constructing the searchable index

Introduction- Web Crawler: Issue • In the normal course of operation,a simple crawler will spend most of its time awaiting data • Requesting a web page • Receiving a web page • For this reason, crawlers are normally multi-threaded • If the crawling task requires more complex processing,the speed of the crawler will be reduced • A distributed approach for crawlers is needed

Introduction- Distributed Systems • Using idle computers connected to the internet • To gain extra processing power • To distribute processing power • For personal site-specific crawlers, a single personal computer solution may be fast enough • An alternative is a distributed model • A central control unit • Many crawlers operating on individual personal computers

Architecture • The crawler/analyzer units • The control unit • Four constraints • Almost all processing should be conducted on idle computers • The distributed architecture should not increase network traffic • The system must be able to operate through a firewall • The components must be easy to install and remove

Architecture

Architecture- The Crawler/Analyzer Units • The program • Crawl a site or set of sites • Analyze the pages • Report its results • It can execute on the type of computers on whichthere will be spare time, normally personal computers

Architecture- The Crawler/Analyzer Units: Data Management • Accessing permanent storage space to save the web pages • Linking to a database • Using the normal file storage system • Pages must be saved on each host computer,in order to minimize network traffic • If the system is capable of handling enough data,a large-scale server-based database can be used • It must provide a facility for the user to delete all saved data

Architecture- The Crawler/Analyzer Units: Interface • Immediate stop • Clear all data from the computer

Architecture- The Control Unit • The control unit will live on a web server • When a crawler unit requests a job or sends some data,It will be triggered • It will need to store the commands • The owner wishes to be executed • Indicating status • Completed • In progress • Unallocated

Architecture

Implementation- The Crawler/Analyzer Units • The architecture was employed to create a system for analyzing the link structure of university web sites

Implementation- The Crawler/Analyzer Units • Previous system • Running a single crawler/analyzer program • Issues • Not run quickly enough • Individually set up and run on a number of computers • Inefficient in terms of both human time and processor use! • New system • The existing stand-alone crawler was used as the basis • Communication and easy installation features added • Buttons to instantly close the program and remove any saved data • Processed by compressor for easy distribution

Implementation- The Crawler/Analyzer Units • Choice of the types of checking for duplicate pages • No page checking • HTML page checking • Weak HTML page checking • Comparing methods • Comparing each page against all of the others • Naive • Various numbers were calculated from the text of each page • For example, the length of the page, MD5 or SHA-1 hash, etc.

Implementation- The Control Unit • Entirely new! • It was given a reporting facility • Statistics • To deliver a summary of crawlers

System Testing • In June and July of 2000 • A set of sites or web pages to download • An analysis to perform on the downloaded sites

System Testing- Result • The total number of crawler units • Peaked at just over 100 with three rooms of computers • 9112 tasks completed by the system • Over 100,000 pages downloaded • Each crawler used approximately 1 GB of hard disk space • The system had become a virtual computer with over100 GB of disk space and over 100 processors

System Testing- Limitations • The system was not able to run fully automatically • The problem was randomly generated web pages • For example, a huge set of web pages containing usage statistics for electronic equipment with one page per device per day • The solution was • To manually check the root cause of the problem • To add their URLs to a banned list operated by the control unit • There is the alternative of designing a heuristic to avoid problems • For example, a maximum crawl depth

Conclusion • The distributed architecture has shown itself • Capable of crawling a large collection of web sites • By using idle processing power and disk space • The testing of the system has shown that • It cannot operate fully automatically • Without an effective heuristic for identifying duplicate pages

Conclusion • The architecture is particularly suited to situations • Where a task can be decomposed into a collection of crawling based tasks • It would be unsuitable if • The crawls had to cross-reference each other • The data mining had to be performed in an integrated way • The architecture is an effective way to use idle computing resources in order to perform large-scale web data mining tasks

Thank You! Any Questions or Comments?

A Web Crawler Design for Data Mining