Hierarchical Topic Detection UMass - TDT 2004

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst

Task this year • 4 times the size of TDT4 (407,503 stories in three languages) • Many clustering algorithms not feasible (all algorithms with complexity Ω(n2) will take too long) • Time limited - one month • Pilot study this year • We need a simple algorithm that can be finished in a short time

HTD system of UMass • Two step clustering • Step 1 – K-NN • Step 2 – agglomerative clustering Similarity>threshold? √ ×

Step 1 – event threading • Why event threading? • Event: something that happens at a specific time and location • An event contains multiple stories • Each topic is composed of one or more related events • Events have temporal locality • What do we do • Each story is compared to limited previous stories • For simplicity, events do not overlap (false assumption)

Step 2 – agglomerative clustering • Agglomerative clustering has complexity of Ω(n2) • Modification required • Online clustering algorithm • Limited window size • Merge until 1/3 left • First half clusters removed and new events come in • Clusters not overlapping • Assumption: stories in the same source are more likely to be in the same topic • Clusters in the same source are merged first • Then the same language • Finally all languages

Official runs • We submitted 3 runs for each condition • UMASSv1 (UMass3): baseline run • Tf-idf term weight • Cosine similarity • Threshold=0.3 • Window size=120 • UMASSv12 (UMass2): smaller clusters have higher priority in agglomerative clustering • UMASSv19 (UMass1): similar to UMASSv12 • Double window size

Evaluation results

Our result is not good, why? • Online clustering algorithm • Reduces complexity • Stories far away in time cannot be in the same cluster • The assumption of time locality is not valid for topic • Non-overlapping clusters • Increase miss rate • Miss correct granularity • Hard to find • UMass HTD system reasonably quick but ineffective • One day per run

What did TNO do? • TNO – UMass: 1/8 detection cost, similar travel cost. How? • Four steps • Build the similarity matrix for a sample with size 20,000 • Agglomerative clustering to build a binary tree • Simplify the tree to reduce travel cost • For each story not in the sample, find the 10 closest stories in the sample and add it to all the relevant clusters

Why is TNO successful? • To deal with the large size, TNO used a 20,000 documents sample for clustering • Clustering tree is binary which gets most possible granularities • Branching factor of 2 or 3 reduces travel cost • Each story can be assigned to at most 10 clusters • greatly increases the probability to find a perfect or nearly perfect cluster

Detection cost • Overlapping clusters • According to TNO’s observation, adding a story to different clusters decreases miss rate significantly • Branching factor • Smaller branching factor keeps more possible granularities. In our experiment, limited branching factor improved performance • Similarity function • There is no evidence that different similarity functions show large difference • Time locality • Our experiment denies the assumption, larger window size gets better results

Travel cost • With the current parameter setting, a smaller branching factor is preferred (optimal value 3) • Comparison of travel cost • ICT: eng,nat 0.0767 mul,eng 0.0934 • CUHK: 0.0554 0.0481 • UMass: 0.0030 0.0063 • TNO: 0.0040 0.0027 • Reason: branching factors • The current normalization factor is very large • normalized travel cost negligible in comparison to detection cost

Toy example Most topics are small Only 20(8%) have more than 100 stories Generate all possible clusters of size 1 to 100 Put them in a binary tree Detection cost for 92% topics is 0!!! Plus empty cluster and whole set, the other 8% at most 1 Travel cost is So the combined cost is It is comparable to most participants! With careful arrangement of the binary tree, it can be easily improved

What is wrong? • The idea of the travel cost is to avoid cheating experiments like power set • The normalized travel cost and detection cost should be comparable • With current parameter setting, small branching factor can reduce both travel cost and detection cost • Suggested modification • smaller normalization factor, like the old one - travel cost of the optimal hierarchy • If normalized travel cost too large, give a smaller weight to it • Increase CTITLE and decrease CBRANCH so that the optimal branching factor is larger (5~10?) • Other evaluation algorithms, like expected travel cost (still too expensive, need some approximation algorithm)

Summary • This year’s evaluation shows that overlapping clusters and small branching factor can get better results • Current normalization scheme of travel cost does not work well • Need some modification • New evaluation methods? • Reference Allan, J., Feng, A., and Bolivar, A., Flexible Intrinsic Evaluation of Hierarchical Clustering for TDT, in CIKM 2003, pp. 263-270.

Hierarchical Topic Detection UMass - TDT 2004