Opinion Source Identification with Conditional Random Fields

Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns Yejin Choi and Claire Cardie (Cornell University) Ellen Riloff and Siddharth Patwardhan (University of Utah) EMNLP 2005

Introduction • Especially useful in opinion QA and summarization. Ex. How does X feel about Y? • The goal is to identify direct and indirect sources. Ex. International officers said US officials want the EU to prevail.

Source Identification • View it as an information extraction task. • Tackle it using sequence tagging and pattern matching techniques. • Evaluate using NRRC corpus (Wiebe et al., 2005)

Big Picture • This task goes beyond NER: role relationships. • Learning-based methods: graphical models and extraction pattern learning. • Hybrid method: CRF for NER and AutoSlog for information extraction.

Semantic Tagging via CRF • A sequence of tokens x=x1x2…xn • A sequence of labels y=y1y2…yn • Label values ‘S’, ‘T’, ‘-’ • ‘S’: the first token of a source • ‘T’: a non-initial token of a source • ‘-’: tokens not part of any source

CRF Algorithm (1) • G=(V,E), V: the set of random variables Y={Yi|1 <= i <= n}, n tokens in a input sentence. • E={(Yi-1, Yi)| 1 < i <= n} is the set of n-1 edges forming a linear chain. • For each edge: • For each node:

CRF Algorithm (2) • The conditional probability of a sequence of labels y given a sequence of tokens x is: Zx is a normalization factor

CRF Algorithm (3) • Maximize the conditional log-likelihood • Given a sentence x in the test data, the tagging sequence y is given by

Features • The sources of opinions are mostly noun phrases. • The source phrases should be semantic entities that can bear or express opinions. • The source phrases should be directly related to an opinion expression.

Features (1) • Capitalization features: all-capital, initial-capital • POS features: in a [-2, +2] window • Opinion lexicon feature: in a [-1,1] window, add subclass info such as moderately subjective

Features (2) • Dependency tree features: Syntactic chunking and Opinion word propagation. • Semantic class features (Sundance shallow parser)

Extraction Pattern Features • SourcePatt: whether a word activates any pattern (ex. Complained) • SourceExtr: whether a word is extracted by any pattern (ex. Jacques in “Jacques complained …”) • Freq and Prob added (4 extra features)

Measures • Overlap match (OL): lenient • Head match (HM): conservative • Exact match (EM): strict • Precision, Recall, f-measure

Baselines • Baseline1: label all phrases that belong to the semantic categories • Baseline2: Noun phrases + • NP is the subject of a verb phrase containing an opinion word • NP contains a possessive and is preceded by an opinion word • NP follows “according to” • NP follows “by” and attaches to an opinion word • Baseline3: NP + Baseline1 + Baseline2

Experiment Results

Error Analysis • Sentence boundary detector in GATE and parsing • Complex and Unusual sentence structure • Limited coverage of the opinion lexicon (slangs, idioms)

Conclusion and Future Work • A hybrid approach • Make AutoSlog automatic • Both direct and indirect sources • Cross sentence boundary • Coreference • The strength of the opinion expression may be useful

Opinion Source Identification with Conditional Random Fields