1 / 24

Topic Detection and Tracking

Lexical Chains for Topic Detection and Tracking British Classification Society Feb 23rd 2001 Joe Carthy & Nicola Stokes University College Dublin joe.carthy@ucd.ie nicola.stokes@ucd.ie http://www.cs.ucd.ie/staff/jcarthy Tel. +353 1 706 2481 or 706 2469 Fax. +353 1 269 7262.

Gideon
Télécharger la présentation

Topic Detection and Tracking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Chains for Topic Detection and TrackingBritish Classification SocietyFeb 23rd 2001Joe Carthy & Nicola StokesUniversity College Dublinjoe.carthy@ucd.ienicola.stokes@ucd.iehttp://www.cs.ucd.ie/staff/jcarthyTel. +353 1 706 2481 or 706 2469Fax. +353 1 269 7262

  2. Topic Detection and Tracking • Topic Detection and Tracking (TDT) • DARPA funded TDT project with UMass, CMU and Dragon Systems • Domain is all broadcast news: written and spoken • TDT includes: • First story Detection • Event Tracking • Segmentation • Applications • digital news editors • media analysts • equity traders

  3. Topic Tracking and Detection • Tracking may be defined as • Take a corpus of news stories • Given 1 (or 2,4,8,16) sample stories about anevent • Find all subsequent stories in the corpusabout that event • Detection: Is this a new story ?

  4. Topic Tracking and Detection • Event is defined by a list of stories that discuss the event e.g. “Kobe earthquake”is defined by first story that describes this event

  5. SERVER Lexical Chainer Event Tracker Event Detector UCD TDT ARCHITECTURE

  6. DATA STREAM DATE: 02:36 TITLE: O.J. SIMPSON Bought Knife, Murder Hearing told CARLOS THE JACKEL NYC SUBWAY BOMBINGS O.J. SIMPSON MURDER TRIAL Previous Stories Topic Detection and Tracking

  7. Benchmark Systems • Implemented Benchmark systems using conventional IR techniques: • Stemmed keywords • Stopword removal(Porter) • Term weighting (Robertson, Sparck Jones)

  8. Lexical Chaining • Lexical chains - textual cohesion (Halliday & Hasan) • Cohesion: text makes sense as a whole • Cohesion occurs where the interpretation of one item is dependent of that of another item in the text. It is this dependency that gives rise to cohesion.

  9. Lexical Chaining • Where the cohesive elements occur over a number of sentences a cohesive chain is formed. • For example, the sentences:John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it. • give rise to the lexical chain:{mud pie, dessert, mud pie, chocolate, it} • Lexical cohesion is as the name suggests lexical - it involves the selection of a lexical item that is in some way related to one occurring previously.

  10. Lexical Chaining • Reiteration is a form of lexical cohesion which involves the repetition of a lexical item. This may involve simple repetition of the word but also includes the use of a synonym, near-synonym or superordinate. For example in the sentences John bought a Jag. He loves the car. a superordinate, car, refers back to a subordinate Jag. The part-whole relationship is also an example of lexical cohesion e.g. airplane and wing. • A lexical chain is a sequence of related words in the text, spanning short or long distances.

  11. Lexical Chaining • A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. • A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept the term represents i.e. word sense disambiguation • Morris and Hirst were the first researchers to suggest the use of lexical chains to determine the structure of texts.

  12. Lexical Chaining • By identifying the lexical chains in a news story we hope to identify the focus of a news story. This can then be used in tracking and detection. • It is important to realise that determining lexical chains is not a sophisticated natural language analysis process. • Other Applications of Lexical Chaining • Hypertext links: Green • Summarisation: Barzilay • Segmentation: Okumura and Honda • IR: Stairmand, Ellman, Mochizuki • Malapropism detection: St. Onge • Multimedia indexing: Kazman,Al-Halimi

  13. Chain Generation • In order to construct lexical chains we must be able to identify relationships between terms. • This is made possible by the use of WordNet • WordNet is a computational lexicon which was developed at Princeton University. • In WordNet, synonym sets (synsets) are used to represent concepts where asynonymset corresponds to a concept and consists of all those terms that may be used to refer to that concept.

  14. Chain Generation • For example, take the concept airplane it is represented by the synset {airplane, aeroplane, plane}. • A WordNet synset has a numerical identifier such as 02054514. • Links between synsets in WordNet represent conceptual relations such as synonymy, hyponymy, meronymy (part-of) etc. • The synset identifier can be used to represent the concept referred to in the synset, for indexing and lexical chaining purposes.

  15. Exhaust 32748 Automobile 057643 Railway carriage 324932 Train 3984 Word Sense Disambiguation CAR 1st Term Has a Part of Termi EXHAUST Car_exhaust 32748 Tire_out, Fatigue 374222

  16. Chain Generation • Chaining procedure for a story: • Take the ith term in the story and generate the set Neighbouri of its related synsets • For each other term, if it is a member of the set Neighbourithenadd it to the lexical chain for termi. • If the lexical chain contains 3 or more elements then store the chain in a chain index file • Repeat above for all terms in the story.

  17. Computing Chain_Sim(Trackseti, Storyj ) • Overlap Coefficient which may be defined as follows, for two lexical chains c1 and c2: • Overlap Coefficient =

  18. Evaluation Metrics • System returns a set of S documents : • a = # in S discussing new events • b = # in S not discussing new events • c = # in S' discussing new events • d = # in S' not discussing new events • Recall = a / (a+c) • Precision = a / (a+b) • Miss Rate = c / (a+c) = 1 - R • False Alarm Rate = b / (b+d) = Fallout

  19. Tracking Results

  20. Tracking Results

  21. Detection Results

  22. Analysis of results • Expected trade-off between precision and recall • Small number of stories are sufficient to construct a tracking query • Performance in line with other TDT researchers • Lexical Chains - Improvement not significant ?

  23. TDT and Lexical Chain References • Allan, J., Carbonell, J., Doddington, G., Yamron, J, and Yang, Y., “Topic Detection and Tracking Pilot Study: Final Report”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Morgan Kaufmann, San Francisco,1998. • Allan, J., Papka, R., and Lavrenko, V., “Online New Event Detection and Tracking”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. • Barzilay, R., “Lexical Chains for Summarization”, M.Sc. Thesis, Ben-Gurion University of the Negev, Israel, November 1997. • Barzilay, R., and Elhadad, M., “Using Lexical Chains for Text Summarization”, The Fifth Bar-Ilan Symposium on Foundations of Artificial Intelligence Focusing on Intelligent Agents, Bar-Ilan University, Ramat Gan, Israel, June, 1997 • Budanitsky, A., “Lexical Semantic Relatedness and its Application in Natural Language Processing”, (PhD thesis) Technical Report CSRG-390, University of Toronto, 1999. • Ellman, J., “Using Roget's Thesaurus to Determine the Similarity of Texts”, PhD Thesis, University of Sunderland, 2000. • Fellbaum, C., (Ed.), WordNet: An Electronic Lexical Database and Some of its Applications, MIT Press, 1998. • Green, S.J., “Automatically Generating Hypertext by Computing Semantic Similarity”, Ph.D. Thesis, University of Toronto, 1997. http://www.cs.ucd.ie/staff/jcarthy

  24. Halliday, M.A.K. and Hasan, R., “Cohesion In English”, Longman , 1976. • Hatch, P., "Lexical Chaining for the Online Detection of New Events", M.Sc. Thesis, University College Dublin, 2000. • Hirst, G., and St-Onge, D., “Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms”, in WordNet: An Electronic Lexical Database and Some of its Applications, Fellbaum, C., (Ed.), MIT Press, 1998. • Kazman, R., Al-Halimi, R., Hunt, W., and Mantei, M., “Four Paradigms for Indexing Video Conferences”, IEEE MultiMedia, 3 (1), Spring 1996. • Mochizuki, H., Iwayama, M., and Okumura, M., “Passage Level Document Retrieval Using Lexical Chains”, RIAO 2000, Content Based Multimedia Information Access, 491-506, 2000. • Morris J., and Hirst, G., “Lexical Cohesion, the Thesaurus, and the Structure of Text”, Computational Linguistics, 17 (1), 211-232, 1991. • Okumura, M., and Honda, T., “Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion”, In Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING-94), Vol. 2, 775-761, Kyoto, Japan, August 1994. • Porter, M.F., “An Algorithm for Suffix Stripping”, Program, 14, 130-137, 1980. • Robertson, S.E. and Sparck Jones, K, "Simple Approaches to Text Retrieval", University of Cambridge Computing Laboratory Technical Report Number 356, May 1997. • Stairmand, M.A., “A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval”, Ph.D. Thesis, UMIST, 1996. • Stokes, N., Carthy, J., First Story Detection using a Composite Document Representation, HLT 2001, Human Language Technology Confererence, San Diego, California, March 18-21, 2001 • TDT2000, “The Year 2000 Topic Detection and Tracking (TDT2000) Task Definition and Evaluation Plan”, available at the following URL: http://morph.ldc.upenn.edu/TDT/Guide/manual.front.html, November 2000. http://www.cs.ucd.ie/staff/jcarthy

More Related