170 likes | 269 Vues
This paper presents a method to identify direct and indirect sources of opinions using Conditional Random Fields & extraction patterns, useful in opinion QA & summarization. The approach involves semantic tagging via CRF algorithm, features like capitalization, POS, opinion lexicon, and extraction pattern features. Baselines & error analysis are discussed along with future directions for a hybrid approach covering direct and indirect sources, cross-sentence boundaries, coreference, and expanding coverage of opinion lexicon.
E N D
Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns Yejin Choi and Claire Cardie (Cornell University) Ellen Riloff and Siddharth Patwardhan (University of Utah) EMNLP 2005
Introduction • Especially useful in opinion QA and summarization. Ex. How does X feel about Y? • The goal is to identify direct and indirect sources. Ex. International officers said US officials want the EU to prevail.
Source Identification • View it as an information extraction task. • Tackle it using sequence tagging and pattern matching techniques. • Evaluate using NRRC corpus (Wiebe et al., 2005)
Big Picture • This task goes beyond NER: role relationships. • Learning-based methods: graphical models and extraction pattern learning. • Hybrid method: CRF for NER and AutoSlog for information extraction.
Semantic Tagging via CRF • A sequence of tokens x=x1x2…xn • A sequence of labels y=y1y2…yn • Label values ‘S’, ‘T’, ‘-’ • ‘S’: the first token of a source • ‘T’: a non-initial token of a source • ‘-’: tokens not part of any source
CRF Algorithm (1) • G=(V,E), V: the set of random variables Y={Yi|1 <= i <= n}, n tokens in a input sentence. • E={(Yi-1, Yi)| 1 < i <= n} is the set of n-1 edges forming a linear chain. • For each edge: • For each node:
CRF Algorithm (2) • The conditional probability of a sequence of labels y given a sequence of tokens x is: Zx is a normalization factor
CRF Algorithm (3) • Maximize the conditional log-likelihood • Given a sentence x in the test data, the tagging sequence y is given by
Features • The sources of opinions are mostly noun phrases. • The source phrases should be semantic entities that can bear or express opinions. • The source phrases should be directly related to an opinion expression.
Features (1) • Capitalization features: all-capital, initial-capital • POS features: in a [-2, +2] window • Opinion lexicon feature: in a [-1,1] window, add subclass info such as moderately subjective
Features (2) • Dependency tree features: Syntactic chunking and Opinion word propagation. • Semantic class features (Sundance shallow parser)
Extraction Pattern Features • SourcePatt: whether a word activates any pattern (ex. Complained) • SourceExtr: whether a word is extracted by any pattern (ex. Jacques in “Jacques complained …”) • Freq and Prob added (4 extra features)
Measures • Overlap match (OL): lenient • Head match (HM): conservative • Exact match (EM): strict • Precision, Recall, f-measure
Baselines • Baseline1: label all phrases that belong to the semantic categories • Baseline2: Noun phrases + • NP is the subject of a verb phrase containing an opinion word • NP contains a possessive and is preceded by an opinion word • NP follows “according to” • NP follows “by” and attaches to an opinion word • Baseline3: NP + Baseline1 + Baseline2
Error Analysis • Sentence boundary detector in GATE and parsing • Complex and Unusual sentence structure • Limited coverage of the opinion lexicon (slangs, idioms)
Conclusion and Future Work • A hybrid approach • Make AutoSlog automatic • Both direct and indirect sources • Cross sentence boundary • Coreference • The strength of the opinion expression may be useful