Modeling Web Content Dynamics

Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu) George Cybenko (gvc@dartmouth.edu) IMA February 2001

Observing changing information sources • An index of changing information sources must re-index items periodically to keep the index from becoming out-of-date. • What does it mean for an observer or index to be “up-to-date” or “current”? • Our work on the web has two parts: • Estimation of change rates for a large sample of web pages • Re-indexing speed requirements with respect to a formal definition of “up-to-date”.

Your brain is good at this Where is your visual attention directed when driving a car? Why? Form state estimates; re-observe when uncertainty becomes too large

Ingredients • A formal definition of “up-to-dateness” • Data • Scheduling to optimize “up-to-dateness”

A meaning for “up to date” An index entry is (a,b)-current if it is correct to within a grace period of time b, with probability at least a. To be “b-current”: No alteration allowed in gray region for index entry to be “b-current” b (grace period) (next observed) (last observed) (time) t b t - t +T t (now) 0 n 0 n

(a,b)-currency has meaning in many contexts Any source has a spectrum of possibilities; here are some possible values (guesses) • Newspaper: (0.9, 1 day) • Television news: (0.95, 1 hour) • Broker watching stocks: (0.95, 30 min) • Air traffic controller: (0.95, 20 sec) • Web search engine: (0.6, 1 day) • An old web page’s links: (0.4, 70 day)

Collecting web page data • Our web page data comes from a web monitoring service. • The Informant runs periodic standing user queries against four search engines and monitors user-selected URLs. When new or updated results appear, users are notified via email. • We download ~100,000 pages per day for ~30,000 users. See http://informant.dartmouth.edu

Sampling issues • Biased towards search engine results in the top 10 for users’ queries • No more than one observation of a page per day, pages are usually observed once every three days. • Queries and page checks are run only at night, so sample times are correlated. • Filesystem timestamps are available for about 65% of our observations.

Data in our collection • As of March 2000, we had observations of about 3 million web pages. Data in paper spans 7 mo. • Each page is observed an average of 12 times, and the average time span of observation is 38 days. • Each observation includes: • “Last-Modified” timestamps, when available • Observation time (using remote server’s if possible) • Document summary information • Number of bytes (“Content-Length”) • Number of images, tables, forms, lists, banner ads • 16-bit hash of text, hyperlinks, and image references

“Lifetimes” vs. “ages” • We can model objects as having independent, identically-distributed time periods between modifications. We call these “lifetimes.” • The “age” is the time since the present lifetime began. L1 L2 Lifetime=0.62 Lifetime=1.14 Lifetime=0.84 Lifetime=1.53      ... 1 Age By analogy, think of replacement parts, each with an independent lifetime length. 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (Each “” is a change)

Determining dynamics from the time data Two ways to find the distribution of change rates: • 1.Observe the time between successive modifications. (Lifetimes) • Good: direct measurement of time between changes • Bad: aliasing possible; needs repeat observations • 2.Observe the time since the most recent modification. (Ages) • Good: doesn’t have aliasing problems, works without having to make repeat observations • Bad: requires that we accurately account for growth

x o x x x o x 2. Observation window not big enough to see any changes (x) x x o o o o o o x time x=modification o=observation (Observation timespan) Sampling the lifetime distribution There are two problems with trying to sample the difference of successive change times: 1. Second observation (o) will miss two changes (x) (Observed lifetime) time (Actual lifetime)

1 • Median age 120 days • upper 25% > 1 year • lowest 25% < 1 month 0.9 0.8 0.7 0.6 0.5 Cumulative Pr 0.4 10 days 0.3 100 days 1 day 0.2 0.1 0 Age [days, log scale] Web page age CDF

Empirical lifetime distribution Lifetime PDF Lifetime CDF

When do changes happen? Change times, mod 247 hours, show more changes happen during the span of US working hours (8AM to 8PM, EST) -3 x 10 4 3 Sunday Saturday Relative frequency 2 Weds afternoon Weds morning Thursday Tuesday Monday Friday 1 0 0 50 100 150 time since Thursday 12:00 GMT [hours]

Distribution of mean change times • The Weibull distribution, a generalized exponential, models mean lifetimes fairly well: This can be used to find an age or lifetime CDF for any shape parameter s and scale parameter d. But for the age CDF, a growth model is needed, so age-based estimates can be inaccurate.

Lifetime CDF: F (s=1.4, d=152.2) 1 Trial 0.8 Reference 0.6 Cumulative probability 0.4 0.2 0 0 1 2 3 10 10 10 10 Lifetime [days]

n=0.9 0.8 a n=0.6 0.6 Probability, 0.4 n=0.25 0.2 n=0.0 -2 0 2 10 10 10 Expected changes per check period, lT (a,b)-currency for Poisson source A single source has Poisson changes at rate l. If re-indexed every T time units, the expected probability a of the index entry being b-current is:

Probability a of b-currency over a collection Expected probability a of a random index entry being b-current (given distribution f(t) of mean change times t): Distribution of avg. lifetimes Probability of being b-current given avg. lifetime

Index performance surface:a as a function of T, n=b/T • Surface formed by integrating out the rate dependence • Large period T implies a=n • Plane shown for a=0.95%, intersects at a level set (n,T)

b =1 year Age-based Lifetime-based 2 10 T =50 days b =1 month T =59 days T =18 days 1 10 b =1 week T =23 days T =8.5 days 0 b 10 =1 day -1 10 1 2 10 10 95% level set: (T,b) pairs Grace period, b [days] T =11.5 days Re-indexing period, T [days]

Bandwidth needed for (0.95, 1-week) currency For fixed-period checks, we can estimate processing speed requirements. • For (0.95, 1 week) currency of this collection: • Must re-index with period around 18 days. • A (0.95, 1-week) index of the whole web (~800 million pages) processes about 50 megabits/sec. • A more “modest” (0.95, 1-week) index of 150 million pages will process 9 megabits/sec.

1 0.9 0.8 Google a Infoseek 0.7 AltaVista Northern Light 0.6 0.5 0.4 1 2 3 0 10 10 10 10 b [days] Empirical search engine (a,b)-currency

A calculus for (a,b)-currency If x is(a,b)-current and y is(d,g)-current, then (x,y) is( ad, max(b,g))-current. Extend this to other atomic operations on information, eg composition.

Summary • About one in five pages has been modified within the last 12 days. • (0.95, 1-week) on our collection: must observe every 18 days • Ideas: More specialty search engines? Distributed monitoring/remote update? • Other work: algorithms for scheduling observation based on source change rate and importance

Mathematics of “Semantic Hacking”

Problem hard to detect Semantic attacks Information System attacks Systems Denial of Service Attacks Infrastructure easy to detect

Distribution of information Outliers “Gaussian” is expected. Collusion?

What makes a good mystery/thriller? “Wrong” conclusion “Correct” conclusion A wrong conclusion can be reached by one large, detectable bad decision or a sequence of small, undetectably perturbed decisions. Understand the whole sequence of decisions not just one in isolation.

Ongoing research Develop a model of such “semantic attacks”. Develop a way to quantify such things. Develop some tools for detecting/managing complex decision sequences. Make information/decision systems more robust.

NSF KDI Grant 9873138 DARPA contract F30602-98-2-0107 Acknowledgements DoD MURI (AFOSR contract F49620-97-1-03821)

Modeling Web Content Dynamics

Modeling Web Content Dynamics

Presentation Transcript

Modeling Geomagnetic Storm Dynamics

Writing web content

Web Content Security

Web Content Workshop

Web content

Creating web content

Modeling Malware Spreading Dynamics

WEB DYNAMIC CONTENT

The Web Changes Everything: Understanding the Dynamics of Web Content

Web Site Content

Content Modeling 101

Web Content Development

Semantic Content based Modeling

Web Content Summarization

Web Content Filtering

Web Content Development

Modeling food-web dynamics

Web Content Development

Dynamics of Student Modeling

Modeling Population Dynamics

Modeling Rich Narrative Content