CS 502: Computing Methods for Digital Libraries

Lecture 16 Web search engines CS 502: Computing Methods for Digital Libraries

Administration Modem cards for laptops Collect from Upson 311 Assignment 3 Due April 4 at 10 p.m.

Web Crawlers • Builds an index of web pages by repeating a few basic steps: • Maintains a list of known URLs, whether or not the corresponding pages have yet been indexed. • Selects the URL of an HTML page that has not been indexed. • Retrieves the page and brings it back to a central computer. • Automatic indexing program creates an index record, which is added to the overall index. • Hyperlinks from the page to other pages are added to the list of URLs for future exploration.

Web Crawlers Design questions: What to collect Complex web sites Dynamic pages How fast to collect Frequency of sweep How often to try How to manage parallel crawlers

Robots Exclusion Example file: /robots.txt # robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear Disallow: /foo.html # Cybermapper knows where to go. User-agent: cybermapper Disallow:

Automatic Indexing • Automatic indexing at its most basic: • millions of pages • created by thousands of people • with different concepts of how information should be • structured. • Typical web pages provide meager clues for automatic indexing. • Some creators and publishers are even deliberately misleading; they fill their pages with terms that are likely to be requested by users.

An Example: AltaVista 1997 • Digital library concepts • Key Concepts in the Architecture of the Digital • Library. William Y. Arms Corporation for • National Research Initiatives Reston, Virginia... • http://www.dlib.org/dlib/July95/07arms.html - • size 16K - 7-Oct-96 - English • Repository References • Notice: HyperNews at union.ncsa.uiuc.edu will • be moving to a new machine and domain very • soon. Expect interruptions. Repository • References. This is a page. • http://union.ncsa.uiuc.edu/HyperNews/get/www/repo/references.html • - size 5K - 12-May-95 - English

Meta Tags • Elements within the HTML <head> • <meta name="publisher" content="OCLC"> • <meta name="creator" content="Weibel, Stuart L."> • <meta name="creator" content="Miller, Eric J."> • <meta name="title" content="Dublin Core Reference Page"> • <meta name="date" content="1996-05-28"> • <meta name="form" content="text/html"> • <meta name="language" content="en">

Searching the Web Index • Web search programs use standard methods of • information retrieval: • Index records are of low quality. • Users are untrained • -> search programs identify all records that vaguely match the query • -> supply them to the user in ranked order • Indexes are organized for efficient searching by large numbers of simultaneous users.

Searching the Web Index • Difficulties: • User interface • Duplicate elimination • Ranking algorithms

Page Ranks (Google) Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 P4 1 1 1 1 P5 1 P6 1 1 Cited page Number 2 1 4 1 2 2

Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 1  P2  P3 P4 1  P5 P6 = B Cited page Number 2 1 4 1 2 2

Weighting of Pages Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = Iterate until 1   2   1 1 1 1 1 1 w = Bw

Google Ranks • w is the high order eigenvector of B • It ranks the pages by links to them normalized by the number of citations from each page and weighted by the ranking of the cited pages • Google: • calculates the ranks for all pages (about 450 million) • lists hits in rank order

Computer Science Research Academic research Industrial R&D Entrepreneurs

Example: Web Search Engines • Lycos (Mauldin, Carnegie Mellon) • Technical basis: • Research in text-skimming (Ph.D. thesis) • Pursuit free text retrieval engine (TREC) • Robot exclusion research (private interest) • Organizational basis: • Center for Machine Translation • Grant flexibility (DARPA)

Example: Web Search Engines • Google (Page and Brin, Stanford) • Technical basis: • Research in ranking hyperlinks (Ph.D. research) • Organizational basis: • Grant flexibility (NSF Digital Libraries Initiative) • Equipment grant (Hewlett Packard)

The Internet Graph • Theoretical research in graph theory • Six degrees of separation • Pareto distributions • Algorithms • Hubs and authorities (Kleinberg, Cornell) • Empirical data • Commercial (Yahoo!, Google, Alexa, AltaVista, Lycos) • Not-for-profit (Internet Archive)

Google Statistics • The central system handles 5.5 million searches daily, increasing 20% per month. • 2,500 PCs running Linux; 80 terabytes of spinning disk; an average of 30 new machines per day. • The cache holds about 200 million html pages. • The aim is to crawl the web once per month. • 85 people; half are technical; 14 have a Ph.D. in computer science. • Comparison: Yahoo! has 100,000,000 registered users and • dispatches 1/2 billion pages to users per day.

CS 502: Computing Methods for Digital Libraries