Strategies for Addressing Information Overload and Web Page Refresh Rates

Searching the Web Junghoo Cho UCLA Computer Science

Information Galore Biblio sever Legacy database Plain text files

Information Overload Problem

Solution • Indexing approach • Google, Excite, AltaVista • Integration approach • MySimon, BizRate

Indexing Approach Central Index

Challenges • Page selection and download • What page to download? • Page and index update • How to update pages? • Page ranking • What page is “important” or “relevant”? • Scalability

Integration Approach Mediator Wrapper Wrapper Wrapper Source 1 Source 2 Source n

Challenges • Heterogeneous sources • Different data models: relational, object-oriented • Different schemas and representations: “Keanu Reeves” or “Reeves, K.” etc. • Limited query capabilities • Mediator caching

Focus of the Talk • Indexing approach • How to maintain pages up-to-date?

Outline of This Talk How can we maintain pages fresh? • How does the Web change? • What do we mean by “fresh” pages? • How should we refresh pages?

Web Evolution Experiment • How often does a Web page change? • How long does a page stay on the Web? • How long does it take for 50% of the Web to change? • How do we model Web changes?

Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “PageRank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests

Average Change Interval fraction of pages ¾ ¾ average change interval

Change Interval – By Domain fraction of pages ¾ ¾ average change interval

Modeling Web Evolution • Poisson process with rate  • T is time to next event • fT(t) = e-t (t > 0)

Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days

web database ei ei • Freshness of the database S at time t isF( S ; t ) = F( ei ; t ) • (Assume “equal importance” of pages) ... ... N 1  N i=1 Change Metrics • Freshness • Freshness of element ei at time t isF ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

web database ei ei • Age of the database S at time t isA( S ; t ) = A( ei ; t ) • (Assume “equal importance” of pages) ... ... N 1  N i=1 Change Metrics • Age • Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise

Time averages: Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update

Trick Question • Two page database • e1changes daily • e2changes once a week • Can visit one page per week • How should we visit pages? • e1e2 e1e2 e1e2 e1e2... [uniform] • e1e1e1e1e1e1e1e2e1e1 …[proportional] • e1e1e1e1e1e1 ... • e2e2e2e2e2e2 ... • ? e1 e1 e2 e2 web database

Proportional Often Not Good! • Visit fast changing e1  get 1/2 day of freshness • Visit slow changing e2  get 1/2 week of freshness • Visiting e2is a better deal!

Optimal Refresh Frequency Problem Given and f , find that maximize

Optimal Refresh Frequency • Shape of curve is the same in all cases • Holds for any change frequency distribution

Optimal Refresh for Age • Shape of curve is the same in all cases • Holds for any change frequency distribution

Comparing Policies Based on Statistics from experiment and revisit frequency of every month

In general, Not Every Page is Equal!  Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day

Weighted Freshness f w = 2 w = 1 l

Page changed Page visited 1 day Change detected Change Frequency Estimation • How to estimate change frequency? • Naïve Estimator: X/T • X: number of detected changes • T: monitoring period • 2 changes in 10 days: 0.2 times/day • Incomplete change history

Improved Estimator • Based on the Poisson model • X: number of detected changes • N: number of accesses • f : access frequency • 3 changes in 10 days: 0.36 times/day •  Accounts for “missed” changes

Improvement Significant? • Application to a Web crawler • Visit pages once every week for 5 weeks • Estimate change frequency • Adjust revisit frequency based on the estimate • Uniform: do not adjust • Naïve: based on the naïve estimator • Ours: based on our improved estimator

Improvement from Our Estimator (9,200,000 visits in total)

Summary • Information overload problem • Indexing approach • Integration approach • Page update • Web evolution experiment • Change metric • Refresh policy • Frequency estimator

Research Opportunity • Efficient query processing? • Automatic source discovery? • Automatic data extraction?

Web Archive Project • Can we store the history of the Web? • Web is ephemeral • Study of the Evolution of the Web • Challenges • Update policy? • Compression? • New storage structure? • New index structure?

The End • Thank you for your attention • For more information visit http://www.cs.ucla.edu/~cho/

Strategies for Addressing Information Overload and Web Page Refresh Rates

Strategies for Addressing Information Overload and Web Page Refresh Rates

Presentation Transcript

Searching the Web CS3352 Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the web

Searching The Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching The Web