120 likes | 263 Vues
This report explores advancements in text segmentation and topic detection, focusing on HMM-based segmentation methods. It examines the definition of tasks in the TDT final report, detailing the segmentation, detection, and tracking tasks. The study utilizes extensive training and testing data from broadcast news and applies various clustering techniques. Evaluation methods and experiment results are discussed, alongside suggestions for future improvements. This comprehensive overview equips researchers and practitioners in computational linguistics with essential insights into these vital areas.
E N D
Draft of segmentation Haei-Ming Chu 2004/08/09
Reference • Topic Detection and Tracking Pilot Study Final ReportJames Allan, Jaime Carbonelly, George Doddingtonz, Jonathan Yamronx, and Yiming YangyUMass Amherst, yCMU, zDARPA, xDragon Systems, and yCMU1997 • A Critique and Improvement of an Evaluation Metric for Text SegmentationLev Pevzne Marti A. HearstyHarvard University University of California, BerkeleyComputational Linguistics 2002 p.19~p.36 • Chinese Spoken Document Segmentation withConsideration of Features, Language Models and ExtraInformation - Examples using Broadcast NewNTU2003-陳家甫
IntroductionThe tasks definition in the TDT final report • The Segmentation Task • Segmenting a continuous stream of text into its constituent stories • The Detection Task • Retrospective Event Detection • Identify all of the events in a corpus of stories • On-line New event Detection • The Tracking Task • Associating incoming stories with events known to the system
IntroductionSegmentation • Automatically dividing a text stream into topically homogeneous blocks • Segmentation is therefore an “enabling” technology for other applications. • Segmenter which is designed for this task will find story boundaries
IntroductionCorpus • Training data • 40 hours公視新聞 人工轉寫文字檔 • 採用 “,”及 “。”做為斷句依據 • 只取長度大於 4 及小於 40 的句子 • 共23945句 • Testing data • 80 hours公視新聞人工轉寫文字檔 • 共58090句
IntroductionEvaluation • Window difference • b(i,j) represents the number of boundaries between positions i and j in the text • N represent the number of sentences in the text • Slightly affected by variation of segment size distribution • With less penalty in Near-Miss Error
MethodHMM based segmentation • Based on the Dragon approach
MethodHMM based segmentation • Training topic language model • Clustering the training data to 10 clusters (topics) • Using K-means Clustering with cosine measure and 20 iterations • Get the character unigram of each topics • Feature Selection • 中文詞庫-character count >9 and <1000 • total 2872 characters
MethodHMM based segmentation • Testing • Using Viterbi algorithm to find the maximum topic sequences • Using LogAdd() avoid the under flow • Set the unknown character probability in the sentence in testing data LSMALL (-0.5E10)
Future Wordother models • K-means algorithm modification • Language model smoothing • Another baseline : sentence similarity • PLSI model • Relevance model or Translation model • Survey method used in the segmentation or tracking