1 / 42

Lecture 16: Filtering & TDT

Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/is240/s06/. Lecture 16: Filtering & TDT. Principles of Information Retrieval. Overview.

kim
Télécharger la présentation

Lecture 16: Filtering & TDT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/is240/s06/ Lecture 16: Filtering & TDT Principles of Information Retrieval

  2. Overview • Review • LSI • Filtering & Routing • TDT – Topic Detection and Tracking

  3. Overview • Review • LSI • Filtering & Routing • TDT – Topic Detection and Tracking

  4. How LSI Works • Start with a matrix of terms by documents • Analyze the matrix using SVD to derive a particular “latent semantic structure model” • Two-Mode factor analysis, unlike conventional factor analysis, permits an arbitrary rectangular matrix with different entities on the rows and columns • Such as Terms and Documents

  5. How LSI Works • The rectangular matrix is decomposed into three other matices of a special form by SVD • The resulting matrices contain “singular vectors” and “singular values” • The matrices show a breakdown of the original relationships into linearly independent components or factors • Many of these components are very small and can be ignored – leading to an approximate model that contains many fewer dimensions

  6. How LSI Works Titles C1: Human machine interface for LAB ABC computer applications C2: A survey of user opinion of computer system responsetime C3: The EPS user interface management system C4: System and human system engineering testing of EPS C5: Relation of user-percieved response time to error measurement M1: The generation of random, binary, unordered trees M2: the intersection graph of paths in trees M3: Graph minors IV: Widths of trees and well-quasi-ordering M4: Graph minors: A survey Italicized words occur and multiple docs and are indexed

  7. How LSI Works Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 Human 1 0 0 1 0 0 0 0 0 Interface 1 0 1 0 0 0 0 0 0 Computer 1 1 0 0 0 0 0 0 0 User 0 1 1 0 1 0 0 0 0 System 0 1 1 2 0 0 0 0 0 Response 0 1 0 0 1 0 0 0 0 Time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 Survey 0 1 0 0 0 0 0 0 0 Trees 0 0 0 0 0 1 1 1 0 Graph 0 0 0 0 0 0 1 1 1 Minors 0 0 0 0 0 0 0 1 1

  8. How LSI Works 11graph M2(10,11,12) 10 Tree 12 minor C4(1,5,8) C3(2,4,5,8) C1(1,2,3) M1(10) C5(4,6,7) C2(3,4,5,6,7,9) M2(10,11) M4(9,11,12) 2 interface 9 survey 7 time 3 computer 1 human 6 response 5 system 4 user Dimension 2 SVD to 2 dimensions Q(1,3) Blue dots are terms Documents are red squares Blue square is a query Dotted cone is cosine .9 from Query “Human Computer Interaction” -- even docs with no terms in common (c3 and c5) lie within cone. Dimension 1

  9. How LSI Works docs X T0 = S0 D0’ terms txd txm mxm mxd X = T0S0D0’ T0 has orthogonal, unit-length columns (T0’ T0 = 1) D0 has orthogonal, unit-length columns (D0’ D0 = 1) S0 is the diagonal matrix of singular values t is the number of rows in X d is the number of columns in X m is the rank of X (<= min(t,d)

  10. Overview • Review • LSI • Filtering & Routing • TDT – Topic Detection and Tracking

  11. Filtering • Characteristics of Filtering systems: • Designed for unstructured or semi-structured data • Deal primarily with text information • Deal with large amounts of data • Involve streams of incoming data • Filtering is based on descriptions of individual or group preferences – profiles. May be negative profiles (e.g. junk mail filters) • Filtering implies removing non-relevant material as opposed to selecting relevant.

  12. Filtering • Similar to IR, with some key differences • Similar to Routing – sending relevant incoming data to different individuals or groups is virtually identical to filtering – with multiple profiles • Similar to Categorization systems – attaching one or more predefined categories to incoming data objects – is also similar, but is more concerned with static categories (might be considered information extraction)

  13. Structure of an IR System Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents

  14. Structure of an Filtering System Individual or Group users Incoming Data Stream Interest profiles Raw Documents & data Information Filtering System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Indexing/ Categorization/ Extraction Formulating query in terms of descriptors Storage of profiles Store1: Profiles/ Search requests Doc surrogate Stream Comparison/ filtering Adapted from Soergel, p. 19 Potentially Relevant Documents

  15. Major differences between IR and Filtering • IR concerned with single uses of the system • IR recognizes inherent faults of queries • Filtering assumes profiles can be better than IR queries • IR concerned with collection and organization of texts • Filtering is concerned with distribution of texts • IR is concerned with selection from a static database. • Filtering concerned with dynamic data stream • IR is concerned with single interaction sessions • Filtering concerned with long-term changes

  16. Contextual Differences • In filtering the timeliness of the text is often of greatest significance • Filtering often has a less well-defined user community • Filtering often has privacy implications (how complete are user profiles?, what to they contain?) • Filtering profiles can (should?) adapt to user feedback • Conceptually similar to Relevance feedback

  17. Methods for Filtering • Adapted from IR • E.g. use a retrieval ranking algorithm against incoming documents. • Collaborative filtering • Individual and comparative profiles

  18. TREC Filtering Track • Original Filtering Track • Participants are given a starting query • They build a profile using the query and the training data • The test involves submitting the profile (which is not changed) and then running it against a new data stream • New Adaptive Filtering Track • Same, except the profile can be modified as each new relevant document is encountered. • Since streams are being processed, there is no ranking of documents

  19. TREC-8 Filtering Track • Following Slides from the TREC-8 Overview by Ellen Voorhees • http://trec.nist.gov/presentations/TREC8/overview/index.htm

  20. Overview • Review • LSI • Filtering & Routing • TDT – Topic Detection and Tracking

  21. TDT: Topic Detection and Tracking • Intended to automatically identify new topics – events, etc. – from a stream of text and follow the development/further discussion of those topics

  22. Topic Detection and Tracking Introduction and Overview The TDT3 R&D Challenge TDT3 Evaluation Methodology • Slides from “Overview NIST Topic Detection and Tracking • Introduction and Overview” by G. Doddington • http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt99/presentations/index.htm

  23. 5 R&D Challenges: Story Segmentation Topic Tracking Topic Detection First-Story Detection Link Detection TDT3 Corpus Characteristics:† Two Types of Sources: Text • Speech Two Languages: English 30,000 stories Mandarin 10,000 stories 11 Different Sources: _8 English__ 3 MandarinABC CNN VOAPRI VOA XINNBC MNB ZBNAPW NYT TDT Task Overview* * see http://www.itl.nist.gov/iaui/894.01/tdt3/tdt3.htm for details † see http://morph.ldc.upenn.edu/Projects/TDT3/ for details

  24. Preliminaries A topicis … a seminal event or activity, along with alldirectly related events and activities. A storyis … a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.

  25. Example Topic Title: Mountain Hikers Lost • WHAT: 35 or 40 young Mountain Hikers were lost in an avalanche in France around the 20th of January. • WHERE: Orres, France • WHEN: January 1998 • RULES OF INTERPRETATION: 5. Accidents

  26. The Segmentation Task: Transcription: text (words) (for Radio and TV only) Story: Non-story: To segment the source stream into its constituent stories, for all audio sources.

  27. Story Segmentation Conditions • 1 Language Condition: • 3 Audio Source Conditions: • 3 Decision Deferral Conditions:

  28. The Topic Tracking Task: To detect stories that discuss the target topic,in multiple source streams. • Find all the stories that discuss a given target topic • Training: Given Nt sample stories that discuss a given target topic, • Test: Find all subsequent stories that discuss the target topic. on-topic unknown unknown training data test data New This Year: not guaranteed to be off-topic

  29. Topic Tracking Conditions • 9 Training Conditions: • 1 Language Test Condition: • 3 Source Conditions: • 2 Story Boundary Conditions:

  30. The Topic Detection Task: To detect topics in terms of the (clusters of) storiesthat discuss them. • Unsupervised topic trainingA meta-definition of topic is required -independent of topic specifics. • New topics must be detected as the incoming stories are processed. • Input stories are then associated with one of the topics. a topic!

  31. Topic Detection Conditions • 3 Language Conditions: • 3 Source Conditions: • Decision Deferral Conditions: • 2 Story Boundary Conditions:

  32. The First-Story Detection Task: First Stories Time = Topic 1 = Topic 2 Not First Stories • There is no supervised topic training (like Topic Detection) To detect the first story that discusses a topic, for all topics.

  33. First-Story Detection Conditions • 1 Language Condition: • 3 Source Conditions: • Decision Deferral Conditions: • 2 Story Boundary Conditions:

  34. The Link Detection Task To detect whether a pair of stories discuss the same topic. same topic? • The topic discussed is a free variable. • Topic definition and annotation is unnecessary. • The link detection task represents a basic functionality, needed to support all applications (including the TDT applications of topic detection and tracking). • The link detection task is related to the topic tracking task, with Nt = 1.

  35. Link Detection Conditions • 1 Language Condition: • 3 Source Conditions: • Decision Deferral Conditions: • 1 Story Boundary Condition:

  36. TDT3 Evaluation Methodology • All TDT3 tasks are cast as statistical detection (yes-no) tasks. • Story Segmentation: Is there a story boundary here? • Topic Tracking: Is this story on the given topic? • Topic Detection: Is this story in the correct topic-clustered set? • First-story Detection: Is this the first story on a topic? • Link Detection: Do these two stories discuss the same topic? • Performance is measured in terms of detection cost, which is a weighted sum of miss and false alarm probabilities:CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget) • Detection Cost is normalized to lie between 0 and 1:(CDet)Norm = CDet/ min{CMiss • Ptarget, CFA • (1- Ptarget)}

  37. Example Performance Measures: 1 0.1 Normalized Tracking Cost 0.01 English Mandarin Tracking Results on Newswire Text (BBN)

  38. More on TDT • Some slides from James Allan from the HICSS meeting in January 2005

More Related