1 / 16

Hierarchical Topic Detection UMass - TDT 2004

Hierarchical Topic Detection UMass - TDT 2004. Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst. Task this year. 4 times the size of TDT4 (407,503 stories in three languages)

landry
Télécharger la présentation

Hierarchical Topic Detection UMass - TDT 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst

  2. Task this year • 4 times the size of TDT4 (407,503 stories in three languages) • Many clustering algorithms not feasible (all algorithms with complexity Ω(n2) will take too long) • Time limited - one month • Pilot study this year • We need a simple algorithm that can be finished in a short time

  3. HTD system of UMass • Two step clustering • Step 1 – K-NN • Step 2 – agglomerative clustering Similarity>threshold? √ ×

  4. Step 1 – event threading • Why event threading? • Event: something that happens at a specific time and location • An event contains multiple stories • Each topic is composed of one or more related events • Events have temporal locality • What do we do • Each story is compared to limited previous stories • For simplicity, events do not overlap (false assumption)

  5. Step 2 – agglomerative clustering • Agglomerative clustering has complexity of Ω(n2) • Modification required • Online clustering algorithm • Limited window size • Merge until 1/3 left • First half clusters removed and new events come in • Clusters not overlapping • Assumption: stories in the same source are more likely to be in the same topic • Clusters in the same source are merged first • Then the same language • Finally all languages

  6. Official runs • We submitted 3 runs for each condition • UMASSv1 (UMass3): baseline run • Tf-idf term weight • Cosine similarity • Threshold=0.3 • Window size=120 • UMASSv12 (UMass2): smaller clusters have higher priority in agglomerative clustering • UMASSv19 (UMass1): similar to UMASSv12 • Double window size

  7. Evaluation results

  8. Our result is not good, why? • Online clustering algorithm • Reduces complexity • Stories far away in time cannot be in the same cluster • The assumption of time locality is not valid for topic • Non-overlapping clusters • Increase miss rate • Miss correct granularity • Hard to find • UMass HTD system reasonably quick but ineffective • One day per run

  9. What did TNO do? • TNO – UMass: 1/8 detection cost, similar travel cost. How? • Four steps • Build the similarity matrix for a sample with size 20,000 • Agglomerative clustering to build a binary tree • Simplify the tree to reduce travel cost • For each story not in the sample, find the 10 closest stories in the sample and add it to all the relevant clusters

  10. Why is TNO successful? • To deal with the large size, TNO used a 20,000 documents sample for clustering • Clustering tree is binary which gets most possible granularities • Branching factor of 2 or 3 reduces travel cost • Each story can be assigned to at most 10 clusters • greatly increases the probability to find a perfect or nearly perfect cluster

  11. Detection cost • Overlapping clusters • According to TNO’s observation, adding a story to different clusters decreases miss rate significantly • Branching factor • Smaller branching factor keeps more possible granularities. In our experiment, limited branching factor improved performance • Similarity function • There is no evidence that different similarity functions show large difference • Time locality • Our experiment denies the assumption, larger window size gets better results

  12. Travel cost • With the current parameter setting, a smaller branching factor is preferred (optimal value 3) • Comparison of travel cost • ICT: eng,nat 0.0767 mul,eng 0.0934 • CUHK: 0.0554 0.0481 • UMass: 0.0030 0.0063 • TNO: 0.0040 0.0027 • Reason: branching factors • The current normalization factor is very large • normalized travel cost negligible in comparison to detection cost

  13. Toy example Most topics are small Only 20(8%) have more than 100 stories Generate all possible clusters of size 1 to 100 Put them in a binary tree Detection cost for 92% topics is 0!!! Plus empty cluster and whole set, the other 8% at most 1 Travel cost is So the combined cost is It is comparable to most participants! With careful arrangement of the binary tree, it can be easily improved

  14. What is wrong? • The idea of the travel cost is to avoid cheating experiments like power set • The normalized travel cost and detection cost should be comparable • With current parameter setting, small branching factor can reduce both travel cost and detection cost • Suggested modification • smaller normalization factor, like the old one - travel cost of the optimal hierarchy • If normalized travel cost too large, give a smaller weight to it • Increase CTITLE and decrease CBRANCH so that the optimal branching factor is larger (5~10?) • Other evaluation algorithms, like expected travel cost (still too expensive, need some approximation algorithm)

  15. Summary • This year’s evaluation shows that overlapping clusters and small branching factor can get better results • Current normalization scheme of travel cost does not work well • Need some modification • New evaluation methods? • Reference Allan, J., Feng, A., and Bolivar, A., Flexible Intrinsic Evaluation of Hierarchical Clustering for TDT, in CIKM 2003, pp. 263-270.

More Related