Overview of the TDT 2001 Evaluation and Results

Overview of the TDT 2001 Evaluation and Results Jonathan Fiscus Gaithersburg Holiday Inn Gaithersburg, Maryland November 12-13, 2001

Outline • TDT Evaluation Overview • 2001 TDT Evaluation Result Summaries • First Story Detection (FSD) • Topic Detection • Topic Tracking • Link Detection • Other Investigations www.nist.gov/TDT

TDT 101 “Applications for organizing text” Terabytes of Unorganized data • 5 TDT Applications • Story Segmentation • Topic Tracking • Topic Detection • First Story Detection • Link Detection www.nist.gov/TDT

TDT’s Research Domain • Technology challenge • Develop applications that organize and locate relevant stories from a continuous feed of news stories • Research driven by evaluation tasks • Composite applications built from • Automatic Speech Recognition • Story Segmentation • Document Retrieval www.nist.gov/TDT

Definitions A topicis … a seminal event or activity, along with alldirectly related events and activities. A storyis … a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event. www.nist.gov/TDT

Example Topic Title: Mountain Hikers Lost • WHAT: 35 or 40 young mountain hikers were lost in an avalanche in France around the 20th of January. • WHERE: Orres, France • WHEN: January 1998 • RULES OF INTERPRETATION: • Rule #5. Accidents www.nist.gov/TDT

TDT 2001 Evaluation Corpus • TDT3 + Supplemental Corpora used for the evaluation*† • TDT3 Corpus • Third consecutive use for evaluations • XXX stories, 4th Qtr. 1998 • Used for Tracking and Link Detection development test • Supplement of 35K stories added to TDT3 • No annotations • Data added from both 3rd and 4th Qtr. 1998 • Not used for FSD tests • LDC Annotations † • 120 fully annotated topics: divided into published and withheld sets • 120 partially annotated topics • FSD used all 240 topics • Topic Detection used the 120 fully annotated topics • Tracking and Link Detection used the 60 fully annotated withheld topics * see www.nist.gov/speech/tests/tdt/tdt2001 for details † see www.ldc.upenn.edu/Projects/TDT3/ for details www.nist.gov/TDT

TDT3 Topic Division TDT 2000 Systems • Two topic sets: • Published topics • Withheld topics • Selection criteria: • 60 topics per set • 30 of the 1999 topics • 30 of the 2000 topics • Balanced by number of on-topic stories www.nist.gov/TDT

TDT Evaluation Methodology • Evaluation tasks are cast as detection tasks: • YES there is a target, or NO there is not • Performance is measured in terms of detection cost: “a weighted sum of missed detection and false alarm probabilities”CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget) • CMiss = 1 and CFA=0.1 are preset costs • Ptarget = 0.02 is the a priori probability of a target • Detection Cost is normalized to generally lie between 0 and 1:(CDet)Norm = CDet/ min{CMiss • Ptarget, CFA • (1- Ptarget)} • When based on the YES/NO decisions, it is referred to as the actual decision cost • Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA • Makes use of likelihood scores attached to the YES|NO decisions • Minimum DET point is the best score a system could achieve with proper thresholds www.nist.gov/TDT

TDT: Experimental Control • Good research requires experimental controls • Conditions that affect performance in TDT • Newswire vs. Broadcast News • Manual vs. automatic transcription of Broadcast News • Manual vs. automatic story segmentation • Mono vs. multilingual language material • Topic training amounts and languages • Default automatic English translations of Mandarin vs. native Mandarin orthography • Decision deferral periods www.nist.gov/TDT

First Stories on two topics = Topic 1 = Topic 2 Not First Stories First Story Detection Results System Goal: • To detect the first story that discusses each topic • Evaluating “part” of a Topic Detection system, i.e., when to start a new cluster www.nist.gov/TDT

2001 TDT Primary FSD ResultsNewswire+BNews ASR, English texts,automatic story boundaries, 10 File Deferral www.nist.gov/TDT

TDT Topic Detection Task System Goal: • To detect topics in terms of the (clusters of) storiesthat discuss them. • “Unsupervised” topic training • New topics must be detected as the incoming stories are processed. • Input stories are then associated with one of the topics. Topic 1 Story Stream Topic 2

Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10 Mandarin Native Translated Mandarin www.nist.gov/TDT

training data on-topic unknown unknown test data Topic Tracking Task System Goal: • To detect stories that discuss the target topic,in multiple source streams. • Supervised Training • Given Nt sample stories that discuss a given target topic • Testing • Find all subsequent stories that discuss the target topic www.nist.gov/TDT

Primary Tracking ResultsNewswire+BNman, English Training:1 Positive-0 Negative www.nist.gov/TDT

TDT Link Detection Task System Goal: • To detect whether a pair of stories discuss the same topic. (Can be thought of as a “primitive operator” to build a variety of applications) ? www.nist.gov/TDT

Primary Link Det. ResultsNewswire+BNasr, Deferral=10 NTU’s threshholding is unusual Native Mandarin Mandarin Native Translated Mandarin www.nist.gov/TDT

Primary Topic Detection Sys. Newswire+Bnasr, Multilingual, Auto Boundaries, Deferral=10 www.nist.gov/TDT

Topic Detection:False Alarm Visualization UMass1 • Systems behave very differently • IMHO a user would not like to use a high FA rate system • Perhaps False alarms should get more weight in the cost function • Outer Circle: Number of stories in a cluster • Light => cluster was mapped to a reference topic • Blue => unmapped cluster • Inner Circle: Number of on-topic stories Topic ID TNO1-late System clusters, ordered by size Topic ID ` System clusters, ordered by size

Topic Detection:2000 vs. 2001 Index FilesMultilingual Text, Newswire + Broadcast News,Auto Boundaries, Deferral =10 • The 2000 test corpus covered 3 months • The 2001 corpus covered 6 months • 35K more stories • Might affect performance, BUT appears not to. www.nist.gov/TDT

Topic Detection Evaluation via a Link-Style Metric • Motivation: • There is instability of measured performance during system tuning • Likely to be a direct result of the need to map reference topic clusters to system-defined clusters • We would like to avoid the assumption of independent topics www.nist.gov/TDT

Topic Detection Evaluation via a Link-Style Metric • Evaluation Criterion: “Is this pair of stories discuss the same topic?” • If a story pair is on the same topic • A missed detection is declared if the system put the stories in different clusters • Otherwise, it’s a correct detection • If a pair of stories in not on the same topic • A false alarm is declared if the system put the stories in the same cluster • Otherwise, it’s a correct non-detection www.nist.gov/TDT

Link-Based vs. Topic Detection Metrics: Parameter Optimization Sweep System 1: 62K Test Stories 98 Topics • The link curve is less erratic for System1 • Link curve is higher: What does this mean? System 2: 27K Test Stories 31 Topics www.nist.gov/TDT

What can be learned? • Are all the experimental controls necessary? • Tracking performance degrades 50% going from manual to automatic transcription, and an additional 50% going to automatic boundaries • Cross-language issues still not solved • Most systems used only the required deferral period • Progress was modest: did the lack of a new evaluation corpus impede research? www.nist.gov/TDT

Summary • TDT Evaluation Overview • 2001 TDT Evaluation Results • Evaluating Topic Detection with the Link-based metric is feasible, but inconclusive • The TDT3 corpus annotations are now public! www.nist.gov/TDT

Overview of the TDT 2001 Evaluation and Results

Overview of the TDT 2001 Evaluation and Results

Presentation Transcript

Cognitive Tutor ® Evaluation Results October 2001

Results 2001

Interim Results 2001

2001 Interim Results

Evaluation Results

2001 INTERIM RESULTS

Overview of the TDT-2003 Evaluation and Results

Evaluation Results

2001 PGG Results

Overview of the ProComp Evaluation

Project MORE Evaluation Results 2001-2006

Preliminary Results 2001

Overview of Evaluation and Preparedness

2001 results

CMU TDT Report 12-13 November 2001

Overview of the TDT 2004 Evaluation and Results

Overview of the ProComp Evaluation

Creating the Annotated TDT-4 Y2003 Evaluation Corpus

Evaluation results

Overview of the FTU results