1 / 65

Introduction to Cross-Document Coreference

Introduction to Cross-Document Coreference. Amit Bagga StreamSage/Comcast Amit_Bagga@cable.comcast.com. Outline. Motivation and Definition Comparison with Within-Document Coreference, WSD and other NL tasks Methodologies for Entity Cross-Document Coreference

seda
Télécharger la présentation

Introduction to Cross-Document Coreference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Cross-Document Coreference Amit Bagga StreamSage/Comcast Amit_Bagga@cable.comcast.com

  2. Outline • Motivation and Definition • Comparison with Within-Document Coreference, WSD and other NL tasks • Methodologies for Entity Cross-Document Coreference • Other types of Cross-Document Coreference • Concept Cross-Document Coreference • Event Cross-Document Coreference • Cross-Media Coreference • Cross-Language, Cross-Document Coreference • Scoring Methodologies

  3. Motivation • Proper names comprise approximately 10% of news text (Coates-Stephens, 1992) • Names are often ambiguous across documents • increasingly becoming a challenge for NLP systems as collection size and generality grow • also as systems break the “document boundary”

  4. Definition • Cross-Document Coreference (CDC) for entities, in broad terms, asks • how can one computationally disambiguate the intended referent of a name • Winchester & Lee 2002 • for example, it asks, which ‘John Smith’ is meant by a particular occurrence of the string “John Smith”

  5. Comparison with Within-Document Coreference • Within a document • Identical or similarly named entities seldom appear in the same context • when they do, writers distinguish them explicitly • i.e. it is usually the case that we have one referent per discourse • Variant form of the same name generally obey certain regularities which are predictable • For example: Michael Jordan may be referred to by the following – Michael, Mr. Jordan, Jordan, etc.

  6. Across documents • Assumption that same or similar names refer to same entity is not valid • Linguistics theories do not apply • The only way to distinguish between these entities is to examine context

  7. Comparison with WSD • CDC can be thought of as disambiguating the “sense” of usage of a name • In WSD: • Usually possible to enumerate a priori all possible senses of word • Number of possible senses of word is small (1-10) • In CDC: • A large corpus can contain 10s or 100s of entities with same name which are impossible to enumerate a priori • From linguistic perspective, all entities equally plausible

  8. The Role of Context • Similar to WSD, context is vital for CDC • context can be of different sizes • window of words centered around a name, sentence containing name, group of sentences, or even whole document • modeling context can be done in many different ways • bag of words, set of phrases, set of entities, set of relations, etc. • All CDC systems use context in one form or another

  9. Bag of Words Approach • Bagga and Baldwin, 1998 • Within-document coreference system is used to identify all mentions of entity • Sentences containing mention are extracted from each document • “summaries” with respect to entity • Set of summaries compared using VSM (tf*idf) • Single-link clustering used • Version 2 (1999) eliminates use of within document coreference system • sentences containing any variant of name extracted

  10. Corpus, Evaluation, and Results • 197 articles containing “John Smith” extracted from 2 years of New York Times data • 35 different John Smiths • B-CUBED algorithm used • Version 2 results • 84% F-Measure • 90% Precision, 78% Recall • < 1% F-Measure drop when compared to original system

  11. Minimizing Context Matches • Kazi and Ravin, 2000 • Problem with Bagga and Baldwin, 1998 • Prohibitively expensive in terms of storage and n-to-n comparisons (specially in a large corpus) • Use IBM’s Nominator for named entity identification and within document coreference (non-pronominal) • CDC task is merging canonical names from different documents that refer to same entity • Context analysis done by use of a Context Thesaurus • Given a name, returns a ranked list of terms that are related to name in the corpus

  12. E = Exclusives – i.e. no merging possible • M = Mergeables – i.e. compatible with some or all exclusives

  13. Tables are created by analyzing two lists sorted by ambiguity • PERS names • George Walker Bush > George W. Bush > George Bush > G. Bush > Bush • PLACE names • Albany, NY > Albany • Merging steps • Merge identical canonical strings >= 2 words • Merges 28 George Bush, 2 President Bush 7 Vannevar Bush articles into 3 equivalence classes • Between mergeables and exclusives, combine if any variants share a common prefix • Merges E3, M1 and M3 (common prefix = President) • Reduces # of context matches from 58x58 to 7x4

  14. Corpus, Evaluation, and Results • Corpus – 1998 editions of New York Times • 15 name families • For example: Berger, Black, Brown, Bush, Clinton, Gore, etc. • B-CUBED algorithm for scoring • Without context comparisons: • Avg Precision = 98.5% • Avg Recall = 72.85% • No results reported when context comparisons are used (Ravin and Kazi, 1999)

  15. 3 Models of Similarity • Gooi and Allan, 2004 • Methodology similar to Bagga and Baldwin • extract 55 word snippets centered at name or its variant • Problem with Bagga and Baldwin • sharp drop off in F-Measure around threshold • 3 different models of similarity • Incremental Vector Space • tf*idf, but with average link clustering • KL divergence • snippets are represented as probability distribution of words • similarity = “distance” between two probability distributions • Agglomerative Vector Space • tf*idf with bottom-up, complete-link clustering

  16. Corpus • John Smith corpus (Bagga and Baldwin) • Person-x corpus • created by querying TREC collection with queries like arts, business, sports, etc. • BBN’s IdentiFinder used for named entity recognition • one name (and its corresponding variants) randomly replaced with phrase Person-x • 34,404 documents; 14,767 actual unique entities

  17. Evaluation and Results • B-CUBED algorithm used for scoring • Agglomerative VS best • 88.2% F-Measure for John Smith corpus • 83% F-Measure for Person-x corpus • When run on each sub-corpus (arts, sports, etc.) of Person-x corpus • F-Measure drops to 77% • shows that a more homogenous corpus is more difficult • Results for Agglomerative VS degrade much more smoothly around threshold than others

  18. Second Order Co-Occurrence • Three methods – independently published • Bagga, Baldwin, and Ramesh, 2001 - 2-pass algorithm • First pass: as before • Second pass: • for each chain, compute set of most frequent overlapping words in chain (signature words for chain) • for each singleton document after pass 1, compare to each chain • use signature words to extract additional sentences • compare enhanced summary to every summary in chain • merge if similarity > threshold • if not merged with any chain, remains singleton

  19. Winchester and Lee, 2001 • named entity detection and conflation within documents is done as pre-processing step • based on Schutze’s (1998) algorithm for context-group discrimination • 3 types of vectors are created • Term Vectors – formed for each name occurring in context of entity of interest and its variants • stores co-occurrence stats for term across whole corpus • Context Vectors – formed for entity of interest by summing all term vectors associated with its context • term vectors are weighted with their idf scores before sum • Entity Vectors – for each entity, it is centroid of set of context vectors • entity disambiguation is done by comparing Entity Vectors using VSM with single-link clustering

  20. Corpus, Evaluation, Results • Bagga, Baldwin, and Ramesh • John Smith corpus, B-CUBED scoring • new F-Measure 91% (+7 from before) • Winchester and Lee • 30 name sets; 10 each of PER, LOC, ORG • from 6000 WSJ articles • B-CUBED scoring • discovered that selective creation of 3 types of vectors boosts performance • for example, LOC helps disambiguate other LOC • Birmingham, Alabama vs UK; John Smith associated with Pocahontas • overall F-Measure 78.5% • NAM – 90.3%, LOC – 79.2%, ORG – 72.5%

  21. Guha and Garg, 2004 • mine descriptions associated with entity of interest (sketch) • descriptions are other entities + professions that are in close proximity • comparing descriptions • different weights given to different descriptions given type of entity of interest and entity-type of description • for example: location is more likely to be disambiguated by another location than by the name of a person • Corpus and Evaluation • 26 entities (names + places), 2-6 instances identified of each • sent as queries to search engines, top 150 results collated and manually tagged for truth • best F-Measure = 90.3%

  22. Maximum Entropy Model • Fleischman and Hovy, 2004 – use ME to determine if two concept/instance pairs are same entity • concept/instance pairs – ACL dataset (2M pairs) • John Edwards/lawyer and John Edwards/politician • Name features: NAME-COMMON (census), NAME-FAME (ACL dataset), WEB-FAME (Google) • Web features: based on # of Google hits with name plus headwords of concepts used as queries • Overlap features: based on # words overlapping in context of names and concepts • Semantic features: based on semantic relatedness of concepts (WordNet) • for example: lawyers are more likely to become politicians • Estimated Statistics features: probabilities that a name is associated with a particular concept (computed over entire ACL dataset) • Disambiguation using group-average agglomerative clustering • Tested on set of 31 concept/instance pairs (1875 used for training) • 20 had a single referent • F-Measure = 93.9% • baseline (all in same chain) = 92.4%

  23. Robust Reading Approach • Li, Morie, and Roth, 2004 • a global probabilistic view of how documents are generated and how entities are “sprinkled” into them • Model 1 (simplest – no notion of author) • entities are present in a document with a prior probability, independent of other entities • mentions (references) are selected according to probability distribution P(mj|ei) • i.e. entity referenced by a mention is not dependent on other mentions • Model 2 (more expressive) • # of entities in doc and # of mentions follow uniform distribution • entities enter doc with a prior probability, independent of others • representative (canonical form) for each entity is selected according to P(rj|ei) • for each representative, mentions are selected by P(mk|rj) • i.e. entity referenced by a mention depends on other mentions in the same document

  24. Model 3 (least relaxation) • # of entities based on uniform distribution – but not independent of each other • entities in doc viewed as nodes in a weighted directed graph with edges labeled as P(ej|ei) • entities inserted in document via a random walk starting at an entity with prior probability P(ek) • representatives and mentions follow the same probabilities as Model 2 • i.e. entity referenced by a mention depends on other mentions in same document, but also on other entities in entire corpus • Models learned using truncated EM algorithm • Evaluation • 300 NYT articles from TREC corpus • 8000 mentions corresponding to 2000 entities (people, locations, organizations) • compared to SOFT-TF-IDF and baseline (entities with identical writing are same) • overall F-Measure = 89% (model 2) • baseline = 70.7% and SOFT-TF-IDF = 79.8% • Model 3 does not perform best because • global dependencies enforces restrictions over groupings of similar mentions • because of limited document set, estimating global dependency is inaccurate

  25. Using IE Features • 3 different methods published • Mann and Yarowsky, 2003 • use unsupervised learning to learn patterns from corpus that capture biographical features • birth day, birth year, birth place and occupation • use bottom-up centroid agglomerative clustering for disambiguation • vectors for each document are generated by using the following • all words (plain) or proper nouns (nnp) • most relevant words (mi and tf-idf) • basic biographical features (feat) • extended biographical features (extfeat)

  26. Corpus, Evaluation, and Results • Mann and Yarowsky • Pseudoname corpus • query Google with names of 8 people • take 28 possible pairs and replace with different pseudonames • Naturally occurring corpus • query for 4 naturally occurring polysemous names • example: Jim Clark • 60 articles for each name • 3-way classification (top 2 occurring people + “others”) • Disambiguating accuracy for Pseudonames • 86.4% with nnp+feat+tf-idf • For naturally occurring corpus • using mutual information 88% Precision and 73% Recall

  27. Niu, Li, and Srihari, 2004 - use 3 different categories of contextual features • set of 50 words centered around name (or alias) • other entities occurring in 50 word context of name (or alias) • automatic extracted relationships (25 possible) • birth day, age, affiliation, title, address, degree, etc. • features combined using Maximum Entropy Model • Evaluation using B-CUBED algorithm • 4 sets of 4 famous names mixed together using pseudonames • 88% F-Measure achieved • 2 naturally occurring sets • Peter Sutherland – 96% F-Measure • John Smith – 85% F-Measure

  28. Dozier and Zielund, 2004 • CDC for people in legal domain • attorneys, judges, and expert witnesses • Combine IE techniques with record linkage techniques • biographical records for attorneys and judges created manually from Westlaw Legal Directory • biographical record for expert witnesses created through text mining • IE techniques extract templates associated with each type from document • record linkage part uses Bayesian network to match templates with biographical records • Evaluation • for docs with stereotypical syntax and full names – 98% precision and 95% recall • Otherwise, 95% precision and 60% recall

  29. Baseline • Guha and Garg, 2004 • established baseline when full docs were compared using TF-IDF without considering context for 26 entities (names and places) • 2-6 instances of each entity considered • for each instance, top 10 results evaluated • 22.5% accuracy overall

  30. Types of CDC • Named Entities • described earlier • Terms or Concept • Kazi and Ravin, 2000 • Events • Bagga and Baldwin, 1999 • Cross-Media and/or Multimedia Coreference • Between text and pictures for names (Bagga and Hu, unpublished) • Between text and video for names (Satoh and Kanade, 1997) • Between video streams (using image and text) for events (Bagga, Hu, and Zhong, 2002) • Cross-Language, Cross-Document Coreference • parallel corpus (Harabagiu and Maiorano, 2000) • non-parallel corpus – open problem, although manual results encouraging (Bagga and Baldwin, unpublished)

  31. Term or Concept CDC • Single or multi-word terms refer to concepts occurring in domain • Multi-word terms • identified by Terminator (rule-based) • form subset of noun phrases in document • discard those that occur only once in document • for example: price rose where rose is mistakenly identified as noun • discard those that are found only as proper sub-strings • for example: dimension space (part of lower dimension space) • are seldom ambiguous and are merged across documents

  32. Single Word Terms • Capitalized single words are most common sources of ambiguity • for example: Wired – name of magazine and an adjective that is first word in sentence • Within-doc categorization of single words • If capitalized word occurs in lowercase in document – consider as regular word • If capitalized word appears as capitalized in middle of sentence – consider as name • If no lowercase occurrences and word appears at beginning of sentence or in title/header - consider as term • All other single words not identified as part of name or multi-word terms – consider as lower-case term

  33. Disambiguating Single Words Across Documents Unambiguous cases – no merging Ambiguous cases – merge if only name or only lower-case term found in corpus

  34. Single occurrences of single capitalized terms can be merged with occurrences of corresponding names if names occur more than once in at least one document • No evaluation was performed

  35. Event CDC • Bagga and Baldwin, 1999 • similar approach to entity-based CDC • Two events are coreferent iff the players, time, and location are the same • Event CDC system extracts as “summaries” sentences which contain: • main event verb (for example: resign) • nominalization of main verb (for example: resignation) • synonyms (for example: quit) • Summaries are clustered using single-link clustering and VSM similarity

  36. Evaluation and Results • Articles chosen for 3 events: resignations, elections, and espionage • 2 years of New York Times data • B-CUBED algorithm used for scoring

  37. Analysis • Events are harder than entities: • no within-document coreference • no explicit references • are at time spread over the entire document • Analysis of Elections event • elections are temporal in nature • disambiguating phrases largely use temporal references (for example – upcoming fall elections, elections last year, next elections, etc) • exposes weakness of using a bag of words approach • presence of sub-events • US General election consists of both Presidential elections and Congressional elections • “players” are the same due to high rate of incumbency • descriptions of events are very similar • issues in every election are similar (inflation, unemployment, economy)

  38. Cross-Media Coreference – Between Text and Video (Names) • Satoh and Kanade, 1997 • Association of face and name in video • given unknown face, infer name or, • given name, guess faces which are likely to have that name • Use closed caption transcripts and video images for correlation

  39. Face extraction: neural-network based face detector to locate faces in images • Name candidate extraction: use Oxford Text Archive dictionary (appx 70k words) • Word is considered to be a proper noun if • annotated as one in dictionary • not found in dictionary • Face similarity: eigenvector based method to compute distance between two faces • Face and name co-occurrence: use co-occurrence factor • captures how well name and face co-occur in time

  40. Corpus, Evaluation, and Results • No large scale evaluation done • Problem with technique: false positives • specially for famous people • Clinton mentioned by news anchor repeatedly • name gets associated with news anchor

  41. Between Text and Pictures (Names) • Bagga and Hu, unpublished (2004) • Algorithm • Use text and image based features to identify coreference • Tested on web pages • Text narrowed by extracting sentences containing name variants of entity • Image features computed by analyzing distribution of colors in L*a*b perceptual color space • Across URLs, first compute text similarity (VSM) and image similarity (L*a*b) and then combine

  42. Preliminary Results Portraits of Captain John Smith Maps related to Captain John Smith’s explorations Captain John Smith as portrayed in the movie Pocahontas

  43. Cross-Media Coreference • Goal: identify and track “important” news events in broadcast news video • Observations: • “important” stories of the day are repeated within/across stations • common footage scenes can be used as representative clips for these stories

  44. News Story 1 Story 2 Commercial Segment 1 Story seg. 3 Story seg. 2 Story seg. 1 Scene 2 Scene 3 Scene 1 Scene 4 Scene 5 Scene 6 Scene 7 images images images sound sound sound Closed Caption Closed Caption Closed Caption Structure of Broadcast News

  45. Methodology • For each video source, use closed caption text: • to identify segment boundaries (>> signs indicate speaker change) • identify and eliminate commercial segments (based upon text-tiling method) • cluster story segments into stories • Use complete link, hierarchical clustering to identify overlapping stories between programs • identify common footage scenes between each pair of overlapping stories

  46. key frames Scenes from video source 1 text Visual similarity Overlapping Story Combined- Media clustering key frames Scenes from video source 2 text Text similarity Common Footages Common Footage Detection

  47. CBS 3873 NBC 3885 NBC 5061 CBS 2829 Flood rescue -> rescue school bus NBC 20805 CBS 38805 US submarine->US submarine incident CBS 13833 NBC 16317 Topic: US/Iraq->US bombing of Iraq. CBS 4125 NBC 7377 CBS 4257 Examples – Found by System News conference On Iraqi bombing

  48. Night at Baghdad->night bombing at Iraq. CBS 2253 NBC 4173 Iraqi map CBS 2001 NBC 3177 UN cars->UN inspectors leaving Iraq Found by algorithm, but missed by human subjects CBS 5193 NBC 30021 More Examples Same stories and similar key-frame images, but not really identical footage.

  49. US submarine incident. Missed because weak text link and image intensity change. CBS 13305 NBC 16977 CBS 501 Missed by system False positive: Death of Dale Earnhardt

More Related