Semi-supervised Relation Extraction with Large-scale Word Clustering

NYU Semi-supervised Relation Extraction with Large-scale Word Clustering Ang Sun Ralph Grishman Satoshi Sekine New York University June 20, 2011

NYU Outline • Task • Problems • Solutions and Experiments • Conclusion

NYU 1. Task • Relation Extraction • The last U.S. presidentto visit … M1 M2 M := Entity Mention Is there a relation between M1 and M2 ? If so, what kind of relation ?

NYU 1. Task • Relation Types (ACE 2004)

NYU 2. Problems • Sparsity of lexical features • Word cluster features to the rescue • Training Instances • USpresident • USsenator • Arkansasgovernor • Israeli government spokesman • … … • Training features • HeadOfM2 = president • HeadOfM2 = spokesman • … … • Testing Instances • USambassador • U.N.spokeswoman • … … • Testing features • HM2 = ambassador • HM2 = spokeswoman • … … C1 president ambassador spokesman spokeswoman WordClusterHM2=C1 WC_HM2=C1

NYU 2. Problems • Problem 1: How to choose effective clusters? • The Brown word hierarchy Where To Cut ?

NYU 2. Problems • Problem 2: Augment which lexical feature to improve generalization accuracy? • Named entity recognition augments every token with cluster • Same for relation extraction? • Relation instance LeftContext M1 MidContext M2 RightContext Where To Generalize ?

NYU 3.1 Cluster Selection • Main idea • Rank each length (from 1 to the length of the longest bit string) based on importance measures • Select a subset of lengths to cut the word hierarchy • Typically select 3 or 4 prefix lengths to avoid commitment to a single cluster 3. Solutions and Experiments

NYU 3.1 Cluster Selection • Importance measure 1: Information Gain (IG) prior entropy of classes relation class A cluster feature with the length i to rank Value of the cluster feature posterior entropy, given values V of the feature f 3. Solutions and Experiments

NYU 3.1 Cluster Selection • Importance measure 2: Prefix Coverage (PC) i := length := lexical feature := non-null cluster feature for the lexical feature Count (*) := number of occurrences 3. Solutions and Experiments

NYU 3.1 Cluster Selection • Other measures to compare with • Use All Prefixes (UA): consider every length, hoping that the underlying learning algorithm can assign proper weights • Exhaustive Search (ES): try every possible subset of lengths and pick the one that works the best 3. Solutions and Experiments

NYU 3.1 Cluster Selection • Experiment • Setup • 348 ACE 2004 bnews and nwire documents • 70 as testing, the rest 278 are split into training and development sets in a ratio of 7:3 • The development set is used to learn the best lengths • Choose only 3 or 4 lengths (match prior work) • For simplicity, only augment the head of each mention with clusters • Induced 1,000 word clusters on the TDT 5 corpora using the Brown Algo. • Baseline • Feature based MaxEnt classification model • A large feature set: • full set from Zhou et al. (2005); • cherry-picked effective features from Zhao and Grishman (2005), Jiang and Zhai (2007) and others 3. Solutions and Experiments

NYU 3.1 Cluster Selection • Experiment • Effectiveness of Cluster Selection Methods 3. Solutions and Experiments

NYU 3.2 Effectiveness of cluster features • Explore cluster features in a systematic way • Rank each lexical feature according to its importance • Importance is based on linguistic intuition and performance contribution from previous research • Test the effectiveness of a lexical feature with augmentation of word clusters • individually and incrementally 3. Solutions and Experiments

NYU 3.2 Effectiveness of cluster features • Importance of lexical features • Simplify an instance into a 3-tuple M1 Other | Head M2 Other | Head Context Other | Head 3. Solutions and Experiments

NYU 3.2 Effectiveness of cluster features • Experiment • Setup • 5-fold cross-validation • PC4 was used to select effective clusters • Performance 3. Solutions and Experiments

NYU 3.2 Effectiveness of cluster features • The Impact of Training Size (augment mention heads only) Sometimes word cluster features allow reduction in annotation 3. Solutions and Experiments

NYU 3.2 Effectiveness of cluster features • Performance of each individual relation class The highlighted 5 types share the same entity type GPE; PER-SOC holds only between PERSON and PERSON; We may say word cluster can also help to distinguish between ambiguous relation types. No improvement for the PHYS relation? It is just too hard! 3. Solutions and Experiments

NYU 4. Conclusion • Main contributions • Proposed a principled way in choosing clusters at an appropriate level of granularity • Systematically explored the effectiveness of word cluster features for relation extraction • Future work • Extend to • phrase clustering (Lin and Wu, 2009) • pattern clustering (Sun and Grishman, 2010)

NYU Thanks!

Semi-supervised Relation Extraction with Large-scale Word Clustering

Semi-supervised Relation Extraction with Large-scale Word Clustering

Presentation Transcript

Semi-supervised Learning

Semi-Supervised Clustering I

A Probabilistic Framework for Semi-Supervised Clustering

Semi-Supervised Clustering II

Relation Extraction

Very Large-Scale Incremental Clustering

Relation Extraction

Supervised Clustering

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Semi-Supervised Clustering

Efficient Semi-supervised Spectral Co-clustering with Constraints

Relation Extraction

Coupled Semi-Supervised Learning for Information Extraction

Relation Extraction

Semi-Supervised Clustering

Relation Extraction (RE) via Supervised Classification

Semi-Supervised Learning With Graphs

Semi-Supervised Learning With Graphs

Learning Dual Retrieval Module for Semi-supervised Relation Extraction

Semi-Supervised Boosting for Statistical Word Alignment

Semi-Supervised Clustering

Relation Extraction