1 / 25

Mining Correlated Bursty Topic Patterns from Coordinated Text Streams

Mining Correlated Bursty Topic Patterns from Coordinated Text Streams. Xuanhui Wang, ChengXiang Zhai, Xiao Hu, Richard Sproat From KDD 07. Outline. Introduction Preliminaries Coordinated mixture model Experiment conclusion. Introduction.

arne
Télécharger la présentation

Mining Correlated Bursty Topic Patterns from Coordinated Text Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Correlated Bursty Topic Patterns from Coordinated Text Streams Xuanhui Wang, ChengXiang Zhai, Xiao Hu, Richard Sproat From KDD 07

  2. Outline. • Introduction • Preliminaries • Coordinated mixture model • Experiment • conclusion

  3. Introduction. • Text mining research has almost exclusively focused on mining one single text stream. • Topic Detection and Tracking (TDT) • Others

  4. Preliminaries. • Text Stream: A text stream S of length n and with vocabulary V is an ordered sequence of text samples (S1, S2, ..., Sn) indexed by time, where Si is a sequence of words from the vocabulary set V at time point i.

  5. Preliminaries. (cont) • Coordinated Text Streams: A set of text streams is called coordinated text streams if all the streams share the same time index and have the same length

  6. Preliminaries. (cont) • Topic: A topic in stream Si is defined as a probability distribution of words in vocabulary set Vi. We also call such a word distribution a topic model.

  7. Preliminaries. (cont) • Bursty Topic: Let θ be a topic (model) in stream Si. Let t ∈ [1, n] be a time index variable and p(θ|t, Si) be the relative coverage of the topic θ at time t in stream Si. θ is a bursty topic in stream Si if ∃t1, t2 ∈ [1, n] such that t2 − t1 ≥ σ and ∀t ∈ [t1, t2], p(θ|t, Si) ≥ κ where σ is a span threshold and κ is a coverage threshold.

  8. Preliminaries. (cont) • Correlated Bursty Topic Patten: A correlated bursty topic pattern in a set of coordinated text streams S = {S1, ..., Sm} is defined as a set of topics {θ1 , ..., θm} such that θi is a bursty topic in stream Si and ∃t1, t2 ∈ [1, n] such that t2 − t1 ≥ σ and ∀t ∈ [t1, t2], ∀i ∈ [1,m], p(θi |t, Si) ≥ κ where σ is a span threshold and κ is a coverage threshold.

  9. Coordinated mixture model. • The basic idea of our approach is to align the text samples from different streams based on the shared time stamps and discover topics from multiple streams simultaneously with a single probabilistic mixture model.

  10. Coordinated mixture model. (cont) • There are two problems with this simple approach: (1) We will need to match topics across different streams, which is difficult because the vocabularies of different streams do not necessarily overlap. (2) The topics discovered in each stream may explain the corresponding stream well but not necessarily match the common topics shared by multiple streams.

  11. Coordinated mixture model. (cont) • Formal Definition: S = {S1, ..., Sm} be m coordinated text streams with vocabularies V1, ..., Vm. Assume there are k correlated bursty topic patterns in our streams. A latent cause variable z ∈ [1, k]. w ∈ Vi

  12. Coordinated mixture model. (cont) • The generative model: • We assume that a word w appearing at time t in stream Si with probability P(w|t, i). • λB is the mixture weight of the background model. • P(z|t) is the probability of choosing pattern z at time point t.

  13. Coordinated mixture model. (cont) • : • : • :

  14. Coordinated mixture model. (cont) • The log-likelihood of generating text sample Sit • c(w, Sit) is the count of word w in Sit. • Generating all the m coordinated streams is

  15. Coordinated mixture model. (cont) • Parameter Estimation: • P(w|z, i) and P(z|t) use the expectation-maximization (EM) algorithm to compute an estimate iteratively.

  16. Coordinated mixture model. (cont) • The expectation step is to calculate: • The maximization step is to update the probabilities:

  17. Coordinated mixture model. (cont) • Constraining EM with Temporal Dependency:

  18. Coordinated mixture model. (cont) • Mutual Reinforcement across Streams:

  19. Experiment. • News streams consist of six months’ news articles of Xinhua English and Chinese newswires dated from June 8th,2001 through November 7th, 2001.

  20. Experiment. (cont) • We use λB = 0.95 in our experiments. • we set λ = 0.1 in the following experiments. • Bursty patterns which satisfy σ = 5 for κ = 0.01 are kept.

  21. Experiment. (cont)

  22. Experiment. (cont)

  23. Experiment. (cont) • PLSA: (document-based clustering)

  24. mutual reinforcement: Noisy words such as “APEC”、“economic.” Experiment. (cont)

  25. Conclusion.

More Related