How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

How to Crawl the WebHector Garcia-MolinaStanford UniversityJoint work with Junghoo Cho

Stanford InterLib Technologies Physical Barriers • Mobile Access Economic Concerns • IP Infrastructure Information Loss • Archival Repository Information Overload • Value Filtering Service Heterogeneity • Interoperability

Web Crawler Web Crawler Web Crawler Web Crawlers WebBase Architecture Client Client Webbase API WWW Retrieval Indexes Feature Repository Repository Multicast Engine Client Client Client Client

What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages

Crawling at Stanford • WebBase Project • BackRub Search Engine, Page Rank • Google • New WebBase Crawler

Crawling Issues (1) • Load at visited web sites • robots.txt files • space out requests to a site • limit number of requests to a site per day • limit depth of crawl • multicast

? initial urls init init to visit urls get next url get next url get page get page extract urls extract urls visited urls web pages Crawling Issues (2) • Load at crawler • parallelize

Solution: Visit “important” pages microsoft visited pages microsoft Crawling Issues (3) • Scope of crawl • not enough space for “all” pages • not enough time to visit “all” pages

Focus of this talk... Crawling Issues (4) • Incremental crawling • how do we keep pages “fresh”? • how do we avoid crawling from scratch?

Web Evolution Experiment • How often does a web page change? • What is the lifespan of a page? • How long does it take for 50% of the web to change? • How do we model web changes?

Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “page rank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests

Is this correct? changes page visited 1 day How Often Does a Page Change? • Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

Average Change Interval fraction of pages

Average Change Interval — By Domain fraction of pages

page lifetime page lifetime experiment duration experiment duration page lifetime page lifetime experiment duration experiment duration How Long Does a Page Live?

Page Lifespans fraction of pages

Page Lifespans fraction of pages Method 1 used

Time for a 50% Change fraction of unchanged pages days

Modeling Web Evolution • Poisson process with rate  • T is time to next event • fT(t) = e-t (t > 0)

Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days

web database ei ei • Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) ... ... N 1  N i=1 Change Metrics • Freshness • Freshness of element ei at time t is F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

web database ei ei • Age of the database S at time t is A( S ; t ) = A( ei ; t ) ... ... N 1  N i=1 Change Metrics • Age • Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise

Time averages: t t 1 1   F( ei ) = lim F(ei ; t ) dt F( S ) = lim F(S ; t ) dt t t t t 0 0 similar for age... Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update

Crawler Types • In-place vs. shadow • Steady vs. batch shadow database web database ei ei ei ... ... ... crawler on crawler off time

Comparison: Batch vs. Steady batch mode in-place crawler crawler running steady in-place crawler

Shadowing Steady Crawler crawler’s collection without shadowing current collection

Shadowing Batch Crawler crawler’s collection without shadowing current collection

0.63 0.50 1 2 Experimental Data  Freshness • Pages change on average every 4 months • Batch crawler works one week out of 4

Refresh Order • Fixed order • Example: Explicit list of URLs to visit • Random Order • Example: Start from seed URLs & follow links • Purely Random • Example: Refresh pages on demand, as requested by user database web ei ei ... ...

Freshness vs. Order steady, in-place crawler, pages change at same average rate r =  / f = average change frequency / average visit frequency

Age vs. Order = Age / time to refresh all N elements r =  / f = average change frequency / average visit frequency

g(  ) Non-Uniform Change Rates fraction of elements 

Trick Question • Two page database • e1 changes daily • e2 changes once a week • Can visit pages once a week • How should we visit pages? • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • e1 e2 e1 e2 e1 e2 ... [uniform] • e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional] • ? e1 e1 e2 e2 web database

Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2 is a better deal!

Selecting Optimal Refresh Frequency • Analysis is complex • Shape of curve is the same in all cases • Holds for any distribution g(  )

Optimal Refresh Frequency for Age • Analysis is also complex • Shape of curve is the same in all cases • Holds for any distribution g(  )

Comparing Policies In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.]

Building an Incremental Crawler • Steady • In-place • Variable visit frequencies • Fixed order improves freshness! • Also need: • Page change frequency estimator • Page replacement policy • Page ranking policy (what are ‘important” pages?) • See papers....

The End • Thank you for your attention... • For more information, visithttp://www-db.stanford.eduFollow “Searchable Publications” link, andsearch for author = “Cho”

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

Presentation Transcript

Joint work with :

Garcia-Molina, Ullman, and Widom

Garcia-Molina, Ullman, and Widom

Ankur Taly Stanford University Joint work with

Mohammad Alizadeh Stanford University Joint with:

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Joint work with:

(Slides by Hector Garcia-Molina, www-db.stanford/~hector/cs245/notes.htm)

Hector Garcia-Molina Zoltan Gyongyi

Joint work with :

Hector Garcia-Molina

Joint work with :

Joint work with and

Joint work with :

Joint work with :

L3: DDBS Design (Tailored from the Slides by Prof. Hector Garcia-Molina )

Joint work with:

Combating Web Spam with TrustRank ( Zoltan Gyongyi , Hector Garcia-Molina, Jan Pedersen )

Mor Naaman, Yee Jiun Song, Andreas Paepcke, Hector Garcia-Molina

(Slides by Hector Garcia-Molina, www-db.stanford/~hector/cs245/notes.htm)

Data collection with Web crawlers (Web-crawl graphs)

Joint work with and