1 / 26

Bursty and Hierarchical Structure in Streams

Bursty and Hierarchical Structure in Streams. Jon Kleinberg ACM SIGKDD’02 Presented by Deng Cai 10/20/2004. Outline. The problem and main idea A Weighted Automation Model Two state Infinite state Experiments Email Paper Thoughts. Main Idea.

dannon
Télécharger la présentation

Bursty and Hierarchical Structure in Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bursty and Hierarchical Structure in Streams Jon Kleinberg ACM SIGKDD’02 Presented by Deng Cai 10/20/2004 Corporate Development & Strategy Microsoft Confidential: Internal Use Only

  2. Outline • The problem and main idea • A Weighted Automation Model • Two state • Infinite state • Experiments • Email • Paper • Thoughts

  3. Main Idea • Extract meaningful structure from document stream • Burst of activity: certain features rising sharply in frequency as the topic emerges • A formal approach for modeling such “bursts” • An infinite-state automaton • Bursts appear as state transitions • A nested representation of the set of bursts that imposes a hierarchical structure on the overall stream.

  4. Two Cases • Email • Articles’ arrival over time • Try to find hierarchy structure • Paper title • Batch appearing • Thy to enumerate all the bursts (ranking bursts)

  5. A Weighted Automation Model: One State Model • Generating model: • : the gap in time of two consecutive messages • Expectation: • : rate of message arrivals • Why this model?

  6. A Weighted Automaton Model: Two State Model • Two states automaton A: q0,q1 • A changes state with probability p, remaining in its current state with probability 1-p, independently of previous emissions and state changes. • A begins in state q0. Before each message is emitted, A changes state with probability p. A message is then emitted, and the gap in time until the next message is determined by the distribution associated with A's current state.

  7. A Weighted Automaton Model: Two State Model • Based on a set of messages to estimate a state sequence • Maximum likelihood • n inter-arrival gaps: • A state sequence: • b denotes the number of state transitions in the sequence q

  8. A Weighted Automaton Model: Two State Model • Finding a state sequence q maximizing previous probability is equivalent to finding one that minimizes • Equivalent to minimize the following cost function:

  9. An Infinite-state Model • Base state q0 • Exponential density function with rate • Consistent with completely uniform message arrivals. • State qi • Exponential density function with rate • , scaling parameter • the infinite sequence of states models inter-arrival gaps that decrease geometrically from • for every i and j, there is a cost associated with a state transition from qi to qj . • In this paper

  10. An Infinite-state Model • This automaton, with its associated parameters s and , will be denoted as • Given , find a state sequence that minimizes the cost function: • As before, minimizing the first term is consistent with having few state transitions and transitions that span only a few distinct states, while minimizing the second term is consistent with passing through states whose rates agree closely with the inter-arrival gaps. Thus, the combined goal is to track the sequence of gaps as well as possible without changing state too much. • Observe that the scaling parameter s controls the “resolution" with which the discrete rate values of the states are able to track the real-valued gaps; the parameter controls the ease with which the automaton can change states. • How to choose these two parameters?

  11. An Infinite-state Model • An optimal state sequence in can be found by restricting to a number of states k that is a very small constant, always at most 25. • This can be done by adapting the standard forward dynamic programming algorithm used for hidden Markov models to the model and cost function defined here

  12. An Infinite-state Model • We can formally define a burst of intensity j to be a maximal interval over which q is in a state of index j or higher. • It follows that bursts exhibit a natural nested structure

  13. An Infinite-state Model

  14. Experiments: Email (Hierarchical Structure) • Saved email of the author (June 9, 1997 — Aug. 23, 2001) • Total 34344 messages • Subsets of the collection can be chosen by selecting all messages that contain a particular string or set of strings • ITR: it is the name of a large National Science Foundation program for which my colleagues and I wrote two proposals in 1999-2000 • Prelim: the term used at Cornell for (non-final) exams in undergraduate courses. • To examining: • First, is it in fact the case that the appearance of messages containing particular words exhibits a “spike," in some informal sense, in the (temporal) vicinity of significant times such as deadlines, scheduled events, or unexpected developments? • Do the algorithms developed here provide a means for identifying this phenomenon

  15. ITR

  16. ITR

  17. prelim

  18. prelim

  19. Experiments: Paper Title (Enumerating Bursts) • For every word w that appears in the collection, one computes all the bursts in the stream of messages containing w. Combined with a method for computing a weight associated with each burst, and for then ranking by weight, • This essentially provides a way to find the terms that exhibit the most prominent rising and falling pattern over a limited period of time. • Extracting bursts in term usage from the titles of conference papers. • Two distinct sources of data will be used here: • The titles of all papers from the database conferences SIGMOD and VLDB for the years 1975-2001 • The titles of all papers from the theory conferences STOC and FOCS for the years 1969-2001.

  20. The Automaton • is not suitable in this case. Since it is fundamentally based on analyzing the distribution of inter-arrival gaps • Documents arrive in discrete batches; in each new batch of documents, some are relevant and some are irrelevant. • The idea is thus to find an automaton model that generates batched arrivals, with particular fractions of relevant documents. • A sequence of batched arrivals could be considered bursty if the fraction of relevant documents alternates between reasonably long periods in which the fraction is large and other periods in which it is small.

  21. The Automaton • Base state • State qi • will only be defined for i such that • State qi produces a mixture of relevant and irrelevant documents according to a binomial distribution with probability pi.

  22. The Automaton • Cost function • If the automaton is in state qi when the tth batch arrives

  23. Experiment (Paper Title) • The main goal is to enumerate bursts of positive intensity, thus the two state automaton is used. • Given an optimal state sequence, bursts of positive intensity correspond to intervals in which the state is q1 rather than q0 • weight of the burst

  24. Experiment Result

  25. Experiment Result

  26. Thoughts • How to identify one message belongs to a certain topic is the hard problem. (This paper avoid this) • Text mining should handle this problem • The result are not very impressive and the model might not so meaningful. • Why this generating model? • Why this parameters? • All of these can not be mathematically proved (verified)

More Related