1 / 39

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho. Stanford InterLib Technologies. Physical Barriers. Mobile Access. Economic Concerns. IP Infrastructure. Information Loss. Archival Repository. Information Overload. Value Filtering.

avilaf
Télécharger la présentation

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Crawl the WebHector Garcia-MolinaStanford UniversityJoint work with Junghoo Cho

  2. Stanford InterLib Technologies Physical Barriers • Mobile Access Economic Concerns • IP Infrastructure Information Loss • Archival Repository Information Overload • Value Filtering Service Heterogeneity • Interoperability

  3. Web Crawler Web Crawler Web Crawler Web Crawlers WebBase Architecture Client Client Webbase API WWW Retrieval Indexes Feature Repository Repository Multicast Engine Client Client Client Client

  4. What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages

  5. Crawling at Stanford • WebBase Project • BackRub Search Engine, Page Rank • Google • New WebBase Crawler

  6. Crawling Issues (1) • Load at visited web sites • robots.txt files • space out requests to a site • limit number of requests to a site per day • limit depth of crawl • multicast

  7. ? initial urls init init to visit urls get next url get next url get page get page extract urls extract urls visited urls web pages Crawling Issues (2) • Load at crawler • parallelize

  8. Solution: Visit “important” pages microsoft visited pages microsoft Crawling Issues (3) • Scope of crawl • not enough space for “all” pages • not enough time to visit “all” pages

  9. Focus of this talk... Crawling Issues (4) • Incremental crawling • how do we keep pages “fresh”? • how do we avoid crawling from scratch?

  10. Web Evolution Experiment • How often does a web page change? • What is the lifespan of a page? • How long does it take for 50% of the web to change? • How do we model web changes?

  11. Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “page rank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests

  12. Is this correct? changes page visited 1 day How Often Does a Page Change? • Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days

  13. Average Change Interval fraction of pages

  14. Average Change Interval — By Domain fraction of pages

  15. page lifetime page lifetime experiment duration experiment duration page lifetime page lifetime experiment duration experiment duration How Long Does a Page Live?

  16. Page Lifespans fraction of pages

  17. Page Lifespans fraction of pages Method 1 used

  18. Time for a 50% Change fraction of unchanged pages days

  19. Modeling Web Evolution • Poisson process with rate  • T is time to next event • fT(t) = e-t (t > 0)

  20. Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days

  21. web database ei ei • Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) ... ... N 1  N i=1 Change Metrics • Freshness • Freshness of element ei at time t is F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

  22. web database ei ei • Age of the database S at time t is A( S ; t ) = A( ei ; t ) ... ... N 1  N i=1 Change Metrics • Age • Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise

  23. Time averages: t t 1 1   F( ei ) = lim F(ei ; t ) dt F( S ) = lim F(S ; t ) dt t t t t 0 0 similar for age... Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update

  24. Crawler Types • In-place vs. shadow • Steady vs. batch shadow database web database ei ei ei ... ... ... crawler on crawler off time

  25. Comparison: Batch vs. Steady batch mode in-place crawler crawler running steady in-place crawler

  26. Shadowing Steady Crawler crawler’s collection without shadowing current collection

  27. Shadowing Batch Crawler crawler’s collection without shadowing current collection

  28. 0.63 0.50 1 2 Experimental Data  Freshness • Pages change on average every 4 months • Batch crawler works one week out of 4

  29. Refresh Order • Fixed order • Example: Explicit list of URLs to visit • Random Order • Example: Start from seed URLs & follow links • Purely Random • Example: Refresh pages on demand, as requested by user database web ei ei ... ...

  30. Freshness vs. Order steady, in-place crawler, pages change at same average rate r =  / f = average change frequency / average visit frequency

  31. Age vs. Order = Age / time to refresh all N elements r =  / f = average change frequency / average visit frequency

  32. g(  ) Non-Uniform Change Rates fraction of elements 

  33. Trick Question • Two page database • e1 changes daily • e2 changes once a week • Can visit pages once a week • How should we visit pages? • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • e1 e2 e1 e2 e1 e2 ... [uniform] • e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional] • ? e1 e1 e2 e2 web database

  34. Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2 is a better deal!

  35. Selecting Optimal Refresh Frequency • Analysis is complex • Shape of curve is the same in all cases • Holds for any distribution g(  )

  36. Optimal Refresh Frequency for Age • Analysis is also complex • Shape of curve is the same in all cases • Holds for any distribution g(  )

  37. Comparing Policies In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.]

  38. Building an Incremental Crawler • Steady • In-place • Variable visit frequencies • Fixed order improves freshness! • Also need: • Page change frequency estimator • Page replacement policy • Page ranking policy (what are ‘important” pages?) • See papers....

  39. The End • Thank you for your attention... • For more information, visithttp://www-db.stanford.eduFollow “Searchable Publications” link, andsearch for author = “Cho”

More Related