1 / 43

WebBase: Building a Web Warehouse

WebBase: Building a Web Warehouse. Hector Garcia-Molina Stanford University. Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley. web. The Web.

Télécharger la présentation

WebBase: Building a Web Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WebBase:Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho,Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke,Sriram Raghavan, Gary Wesley

  2. web The Web • A universal information resource • Model weak, strong agreement • How to exploit it?

  3. WebBase WEB PAGE

  4. WebBase Goals • Manage very large collections of Web pages • Today: 1500GB HTML, 200 M pages • Enable large-scale Web-related research • Locally provide a significant portion of the Web • Efficient wide-area Web data distribution

  5. WebBase Architecture

  6. WebBase Remote Users • Berkeley • Columbia • U. Washington • Harvey Mudd • Università degliStudi di Milano • U. of Arizona • California Digital Library • Cornell • U. of Houston • Learning LabLower Saxony (L3S) • France Telecom • U. Texas

  7. Outline • Technical Challenges • WebBase Use • The Future

  8. Challenges • Archiving • “units” • coordination • IP Management • copy access • link access • access control • Hidden Web • Topic-Specific Collection Building • Scalability • crawling • archive distribution • index construction • storage • Consistency • freshness • versions • Dissemination

  9. What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages

  10. C C C Parallel Crawling web ...

  11. site 1   e b  a   d c  C C site 2   h g  f   i web Independent Crawlers

  12. site 1  e b  a  d c  C C site 2   h g  f   i Partition: Firewall partition • URL hash • Site hash • Hierarchical

  13. site 1  e b  a  d c  C C site 2   h g  f   i Partition: Cross-Over partition

  14. site 1   e b  a   d c  C C site 2   h g  f   i Partition: Cross-Over partition

  15. site 1 e b  a  d c  C C site 2   h g  f  i Partition: Exchange partition

  16. site 1   e b  a   d c  C C site 2   h g  f   i Partition: Exchange partition

  17. Coverage vs Overlap cross-over crawler; 5 random seeds per C-proc

  18. process site queues ... process site queues ... WebBase Parallel Crawling computer coordinator web ... other computers

  19. WebBase Parallel Crawling 2 cpu utilzation 200% 100% 0% number of processes

  20. Challenges • Archiving • “units” • coordination • IP Management • copy access • link access • access control • Hidden Web • Topic-Specific Collection Building • Scalability • crawling • archive distribution • index construction • storage • Consistency • freshness • versions • Dissemination done next

  21. How to Refresh? a a a changes daily can visit one page per week b b changes once a week b web repository • How should we visit pages? • a a a a a a a ... • b b b b b b b ... • a b a b a b a b ... [uniform] • a a a a a a b a a a ...[proportional] • ?

  22. Using WebBase • Fast Page Rank • Complex Queries

  23. Structure of the Web Color the nodes by their domain red = stanford.edu green = berkeley.edu blue = mit.edu

  24. Structure of the Web berkeley.edu stanford.edu mit.edu

  25. Nested Block Structure of the Web to Berkeley Stanford from

  26. Personalized Page Rank a b

  27. Bulk/Streaming access Large-scale mining & indexing E.g., compute PageRank, extract communities Complex Queries Text search E.g., Search for “SARS Symptoms” Stanford WebBase Repository Complex queries Declarative analysis interface

  28. Rank pages in S by PageRank Rank domains in R by sum (incoming ranks) List top 10 domains in R Example of a Complex Query Web Entire Web Compute S = stanford.edu pages containing phrase “Mobile networking” stanford.edu Mobile networking pages (S) find universities collaborating with Stanford on mobile networking Compute R = set of all “.edu” domains pointed to by pages in S S R

  29. Supernode graph N2 E3,2 E1,2 E1,3 N3 N1 E3,1 P1 P2 P4 P5 IntraNode1 IntraNode3 P2 P5 P3 P5 SEdgePos1,3 SEdgeNeg3,2 Supernodes N2 N1 P3 N3 P1 P4 P5 P2 Web graph  = {N1, N2, N3}

  30. Growth of Supernode Graph 100 90 80 70 82MB, 115M pages (830 GB ofraw HTML) 60 Size of supernode graph (MB) 50 40 30 20 120 0 20 40 60 80 100 Number of pages (Millions)

  31. Query Execution Times 600 S-Node representation Relational DB 500 Connectivity Server Files of adjacency lists 400 300 Time for navigation operation (secs) 200 100 0 Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query

  32. P P Query Optimization

  33. Impact of cluster-based optimization 35-million page dataset 600 million links 300GB of HTML 40-45% reduction in query execution times

  34. The Future for WebBase(and clones)?? Conclusion (So Far) • Web is universal information resource • WebBase exploits this resource • WebBase Challenges: • scalability, consitency, complex queries...

  35. Will WebBase Scale? webBase capacity (optimistic) web content (indexable) webBase capacity (pesimistic) today time

  36. Pessimistic Scenario • Specialized WebBases • sports • shopping • ... web content (indexable) webBase capacity (pesimistic) today time

  37. Optimistic Scenario webBase capacity (optimistic) • Web in a Box • web delivered in “CD” monthy • search engine handles updates web content (indexable) today time

  38. Legal Issues? • Is WebBase legal? • copies • links, deep linking • International regulations

  39. Biasing Results • How long will Google, Altavista, etc.resist “temptations”? • Biasing Crawler • Link and Content Spam

  40. web Access Data • WebBase does not capture access patterns WebBase ?

  41. web Semantic Web? semantic tags • Will tags be generated? • By whom? • Agreement? WebBase ?

  42. Future Technical Challenges • Incremental Updates • Query Optimization • Crawling Deep Web

  43. WEB PAGE Final Conclusion • Many challenges ahead... • Additional information:Google: Stanford WebBase

More Related