390 likes | 393 Vues
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho. Stanford InterLib Technologies. Physical Barriers. Mobile Access. Economic Concerns. IP Infrastructure. Information Loss. Archival Repository. Information Overload. Value Filtering.
E N D
How to Crawl the WebHector Garcia-MolinaStanford UniversityJoint work with Junghoo Cho
Stanford InterLib Technologies Physical Barriers • Mobile Access Economic Concerns • IP Infrastructure Information Loss • Archival Repository Information Overload • Value Filtering Service Heterogeneity • Interoperability
Web Crawler Web Crawler Web Crawler Web Crawlers WebBase Architecture Client Client Webbase API WWW Retrieval Indexes Feature Repository Repository Multicast Engine Client Client Client Client
What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages
Crawling at Stanford • WebBase Project • BackRub Search Engine, Page Rank • Google • New WebBase Crawler
Crawling Issues (1) • Load at visited web sites • robots.txt files • space out requests to a site • limit number of requests to a site per day • limit depth of crawl • multicast
? initial urls init init to visit urls get next url get next url get page get page extract urls extract urls visited urls web pages Crawling Issues (2) • Load at crawler • parallelize
Solution: Visit “important” pages microsoft visited pages microsoft Crawling Issues (3) • Scope of crawl • not enough space for “all” pages • not enough time to visit “all” pages
Focus of this talk... Crawling Issues (4) • Incremental crawling • how do we keep pages “fresh”? • how do we avoid crawling from scratch?
Web Evolution Experiment • How often does a web page change? • What is the lifespan of a page? • How long does it take for 50% of the web to change? • How do we model web changes?
Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “page rank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests
Is this correct? changes page visited 1 day How Often Does a Page Change? • Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days
Average Change Interval fraction of pages
Average Change Interval — By Domain fraction of pages
page lifetime page lifetime experiment duration experiment duration page lifetime page lifetime experiment duration experiment duration How Long Does a Page Live?
Page Lifespans fraction of pages
Page Lifespans fraction of pages Method 1 used
Time for a 50% Change fraction of unchanged pages days
Modeling Web Evolution • Poisson process with rate • T is time to next event • fT(t) = e-t (t > 0)
Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days
web database ei ei • Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) ... ... N 1 N i=1 Change Metrics • Freshness • Freshness of element ei at time t is F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
web database ei ei • Age of the database S at time t is A( S ; t ) = A( ei ; t ) ... ... N 1 N i=1 Change Metrics • Age • Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise
Time averages: t t 1 1 F( ei ) = lim F(ei ; t ) dt F( S ) = lim F(S ; t ) dt t t t t 0 0 similar for age... Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update
Crawler Types • In-place vs. shadow • Steady vs. batch shadow database web database ei ei ei ... ... ... crawler on crawler off time
Comparison: Batch vs. Steady batch mode in-place crawler crawler running steady in-place crawler
Shadowing Steady Crawler crawler’s collection without shadowing current collection
Shadowing Batch Crawler crawler’s collection without shadowing current collection
0.63 0.50 1 2 Experimental Data Freshness • Pages change on average every 4 months • Batch crawler works one week out of 4
Refresh Order • Fixed order • Example: Explicit list of URLs to visit • Random Order • Example: Start from seed URLs & follow links • Purely Random • Example: Refresh pages on demand, as requested by user database web ei ei ... ...
Freshness vs. Order steady, in-place crawler, pages change at same average rate r = / f = average change frequency / average visit frequency
Age vs. Order = Age / time to refresh all N elements r = / f = average change frequency / average visit frequency
g( ) Non-Uniform Change Rates fraction of elements
Trick Question • Two page database • e1 changes daily • e2 changes once a week • Can visit pages once a week • How should we visit pages? • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • e1 e2 e1 e2 e1 e2 ... [uniform] • e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional] • ? e1 e1 e2 e2 web database
Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2 is a better deal!
Selecting Optimal Refresh Frequency • Analysis is complex • Shape of curve is the same in all cases • Holds for any distribution g( )
Optimal Refresh Frequency for Age • Analysis is also complex • Shape of curve is the same in all cases • Holds for any distribution g( )
Comparing Policies In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.]
Building an Incremental Crawler • Steady • In-place • Variable visit frequencies • Fixed order improves freshness! • Also need: • Page change frequency estimator • Page replacement policy • Page ranking policy (what are ‘important” pages?) • See papers....
The End • Thank you for your attention... • For more information, visithttp://www-db.stanford.eduFollow “Searchable Publications” link, andsearch for author = “Cho”