1 / 12

Draft of segmentation

Draft of segmentation. Haei-Ming Chu 2004/08/09. Reference. Topic Detection and Tracking Pilot Study Final Report James Allan, Jaime Carbonell y , George Doddington z , Jonathan Yamron x , and Yiming Yang y UMass Amherst, yCMU, zDARPA, xDragon Systems, and yCMU 1997

takoda
Télécharger la présentation

Draft of segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Draft of segmentation Haei-Ming Chu 2004/08/09

  2. Reference • Topic Detection and Tracking Pilot Study Final ReportJames Allan, Jaime Carbonelly, George Doddingtonz, Jonathan Yamronx, and Yiming YangyUMass Amherst, yCMU, zDARPA, xDragon Systems, and yCMU1997 • A Critique and Improvement of an Evaluation Metric for Text SegmentationLev Pevzne Marti A. HearstyHarvard University University of California, BerkeleyComputational Linguistics 2002 p.19~p.36 • Chinese Spoken Document Segmentation withConsideration of Features, Language Models and ExtraInformation - Examples using Broadcast NewNTU2003-陳家甫

  3. IntroductionThe tasks definition in the TDT final report • The Segmentation Task • Segmenting a continuous stream of text into its constituent stories • The Detection Task • Retrospective Event Detection • Identify all of the events in a corpus of stories • On-line New event Detection • The Tracking Task • Associating incoming stories with events known to the system

  4. IntroductionSegmentation • Automatically dividing a text stream into topically homogeneous blocks • Segmentation is therefore an “enabling” technology for other applications. • Segmenter which is designed for this task will find story boundaries

  5. IntroductionCorpus • Training data • 40 hours公視新聞 人工轉寫文字檔 • 採用 “,”及 “。”做為斷句依據 • 只取長度大於 4 及小於 40 的句子 • 共23945句 • Testing data • 80 hours公視新聞人工轉寫文字檔 • 共58090句

  6. IntroductionEvaluation • Window difference • b(i,j) represents the number of boundaries between positions i and j in the text • N represent the number of sentences in the text • Slightly affected by variation of segment size distribution • With less penalty in Near-Miss Error

  7. MethodHMM based segmentation • Based on the Dragon approach

  8. MethodHMM based segmentation

  9. MethodHMM based segmentation • Training topic language model • Clustering the training data to 10 clusters (topics) • Using K-means Clustering with cosine measure and 20 iterations • Get the character unigram of each topics • Feature Selection • 中文詞庫-character count >9 and <1000 • total 2872 characters

  10. MethodHMM based segmentation • Testing • Using Viterbi algorithm to find the maximum topic sequences • Using LogAdd() avoid the under flow • Set the unknown character probability in the sentence in testing data LSMALL (-0.5E10)

  11. Experiment ResultHMM based segmentation

  12. Future Wordother models • K-means algorithm modification • Language model smoothing • Another baseline : sentence similarity • PLSI model • Relevance model or Translation model • Survey method used in the segmentation or tracking

More Related