Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Modeling Web Content Dynamics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Modeling Web Content Dynamics**Brian Brewington (brew@dartmouth.edu) George Cybenko (gvc@dartmouth.edu) IMA February 2001**Observing changing information sources**• An index of changing information sources must re-index items periodically to keep the index from becoming out-of-date. • What does it mean for an observer or index to be “up-to-date” or “current”? • Our work on the web has two parts: • Estimation of change rates for a large sample of web pages • Re-indexing speed requirements with respect to a formal definition of “up-to-date”.**Your brain is good at this**Where is your visual attention directed when driving a car? Why? Form state estimates; re-observe when uncertainty becomes too large**Ingredients**• A formal definition of “up-to-dateness” • Data • Scheduling to optimize “up-to-dateness”**A meaning for “up to date”**An index entry is (a,b)-current if it is correct to within a grace period of time b, with probability at least a. To be “b-current”: No alteration allowed in gray region for index entry to be “b-current” b (grace period) (next observed) (last observed) (time) t b t - t +T t (now) 0 n 0 n**(a,b)-currency has meaning in many contexts**Any source has a spectrum of possibilities; here are some possible values (guesses) • Newspaper: (0.9, 1 day) • Television news: (0.95, 1 hour) • Broker watching stocks: (0.95, 30 min) • Air traffic controller: (0.95, 20 sec) • Web search engine: (0.6, 1 day) • An old web page’s links: (0.4, 70 day)**Collecting web page data**• Our web page data comes from a web monitoring service. • The Informant runs periodic standing user queries against four search engines and monitors user-selected URLs. When new or updated results appear, users are notified via email. • We download ~100,000 pages per day for ~30,000 users. See http://informant.dartmouth.edu**Sampling issues**• Biased towards search engine results in the top 10 for users’ queries • No more than one observation of a page per day, pages are usually observed once every three days. • Queries and page checks are run only at night, so sample times are correlated. • Filesystem timestamps are available for about 65% of our observations.**Data in our collection**• As of March 2000, we had observations of about 3 million web pages. Data in paper spans 7 mo. • Each page is observed an average of 12 times, and the average time span of observation is 38 days. • Each observation includes: • “Last-Modified” timestamps, when available • Observation time (using remote server’s if possible) • Document summary information • Number of bytes (“Content-Length”) • Number of images, tables, forms, lists, banner ads • 16-bit hash of text, hyperlinks, and image references**“Lifetimes” vs. “ages”**• We can model objects as having independent, identically-distributed time periods between modifications. We call these “lifetimes.” • The “age” is the time since the present lifetime began. L1 L2 Lifetime=0.62 Lifetime=1.14 Lifetime=0.84 Lifetime=1.53 ... 1 Age By analogy, think of replacement parts, each with an independent lifetime length. 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (Each “” is a change)**Determining dynamics from the time data**Two ways to find the distribution of change rates: • 1.Observe the time between successive modifications. (Lifetimes) • Good: direct measurement of time between changes • Bad: aliasing possible; needs repeat observations • 2.Observe the time since the most recent modification. (Ages) • Good: doesn’t have aliasing problems, works without having to make repeat observations • Bad: requires that we accurately account for growth**x**o x x x o x 2. Observation window not big enough to see any changes (x) x x o o o o o o x time x=modification o=observation (Observation timespan) Sampling the lifetime distribution There are two problems with trying to sample the difference of successive change times: 1. Second observation (o) will miss two changes (x) (Observed lifetime) time (Actual lifetime)**1**• Median age 120 days • upper 25% > 1 year • lowest 25% < 1 month 0.9 0.8 0.7 0.6 0.5 Cumulative Pr 0.4 10 days 0.3 100 days 1 day 0.2 0.1 0 Age [days, log scale] Web page age CDF**Empirical lifetime distribution**Lifetime PDF Lifetime CDF**When do changes happen?**Change times, mod 247 hours, show more changes happen during the span of US working hours (8AM to 8PM, EST) -3 x 10 4 3 Sunday Saturday Relative frequency 2 Weds afternoon Weds morning Thursday Tuesday Monday Friday 1 0 0 50 100 150 time since Thursday 12:00 GMT [hours]**Distribution of mean change times**• The Weibull distribution, a generalized exponential, models mean lifetimes fairly well: This can be used to find an age or lifetime CDF for any shape parameter s and scale parameter d. But for the age CDF, a growth model is needed, so age-based estimates can be inaccurate.**Lifetime CDF: F (s=1.4, d=152.2)**1 Trial 0.8 Reference 0.6 Cumulative probability 0.4 0.2 0 0 1 2 3 10 10 10 10 Lifetime [days]**n=0.9**0.8 a n=0.6 0.6 Probability, 0.4 n=0.25 0.2 n=0.0 -2 0 2 10 10 10 Expected changes per check period, lT (a,b)-currency for Poisson source A single source has Poisson changes at rate l. If re-indexed every T time units, the expected probability a of the index entry being b-current is:**Probability a of b-currency over a collection**Expected probability a of a random index entry being b-current (given distribution f(t) of mean change times t): Distribution of avg. lifetimes Probability of being b-current given avg. lifetime**Index performance surface:a as a function of T, n=b/T**• Surface formed by integrating out the rate dependence • Large period T implies a=n • Plane shown for a=0.95%, intersects at a level set (n,T)**b**=1 year Age-based Lifetime-based 2 10 T =50 days b =1 month T =59 days T =18 days 1 10 b =1 week T =23 days T =8.5 days 0 b 10 =1 day -1 10 1 2 10 10 95% level set: (T,b) pairs Grace period, b [days] T =11.5 days Re-indexing period, T [days]**Bandwidth needed for (0.95, 1-week) currency**For fixed-period checks, we can estimate processing speed requirements. • For (0.95, 1 week) currency of this collection: • Must re-index with period around 18 days. • A (0.95, 1-week) index of the whole web (~800 million pages) processes about 50 megabits/sec. • A more “modest” (0.95, 1-week) index of 150 million pages will process 9 megabits/sec.**1**0.9 0.8 Google a Infoseek 0.7 AltaVista Northern Light 0.6 0.5 0.4 1 2 3 0 10 10 10 10 b [days] Empirical search engine (a,b)-currency**A calculus for (a,b)-currency**If x is(a,b)-current and y is(d,g)-current, then (x,y) is( ad, max(b,g))-current. Extend this to other atomic operations on information, eg composition.**Summary**• About one in five pages has been modified within the last 12 days. • (0.95, 1-week) on our collection: must observe every 18 days • Ideas: More specialty search engines? Distributed monitoring/remote update? • Other work: algorithms for scheduling observation based on source change rate and importance**Problem**hard to detect Semantic attacks Information System attacks Systems Denial of Service Attacks Infrastructure easy to detect**Distribution of information**Outliers “Gaussian” is expected. Collusion?**What makes a good mystery/thriller?**“Wrong” conclusion “Correct” conclusion A wrong conclusion can be reached by one large, detectable bad decision or a sequence of small, undetectably perturbed decisions. Understand the whole sequence of decisions not just one in isolation.**Ongoing research**Develop a model of such “semantic attacks”. Develop a way to quantify such things. Develop some tools for detecting/managing complex decision sequences. Make information/decision systems more robust.**NSF KDI Grant 9873138**DARPA contract F30602-98-2-0107 Acknowledgements DoD MURI (AFOSR contract F49620-97-1-03821)