Enhancing Parsing Accuracy through Boosting-based Re-ranking with Subtree Features

Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

Discriminative methods for parsing • have shown a remarkable performance compared to traditional generative models, e.g., PCFG • two approaches • re-ranking [Collins 00, Collins 02] • discriminative machine learning algorithms are used to rerank n-best outputs of generative/conditional parsers. • dynamic programming • Max margin parsing [Tasker 04]

0.2 0.5 0.1 Reranking x: I buy cars with money G(x) n-best results • Let x be an input sentence, and y be a parse tree for x • Let G(x) be a function that returns a set of n-best results for x • A re-ranker gives a score to each sentence and selects the result which has the highest score y1 y2 y3 ….

Scoring with linear model • is a feature function that maps output y into space • is a parameter vector (weights) modeled with training data

Two issues in linear model [1/2] • How to estimate the weights ? • try to minimize a loss for given training data • definition of loss: ME SVMs Boosting

Two issues in linear model [2/2] • How to define the feature set ? • use all subtrees • Pros: - natural extension of CFG rules - can capture long contextual information • Cons: naïve enumerations give huge complexities

A question for all subtrees • Do we always need all subtrees? • only a small set of subtrees is informative • most subtrees are redundant • Goal: automatic feature selection from all subtrees • can perform fast parsing • can give good interpretation to selected subtrees • Boosting meets our demand!

Why Boosting? • Different regularization strategies for • L1 (Boosting) • better when most given features are irrelevant • can remove redundant features • L2 (SVMs) • better when most given features are relevant • uses features as much as they can • Boosting meets our demand, because most subtrees are irrelevant and redundant

Current weights Next weights Update feature k with an increment δ select the optimal pair <k,δ> that minimizes the Loss RankBoost [Freund03]

A variant ofBranch-and-Bound • Define a search space in which the whole set of subtrees is given • Find the optimal subtree by traversing this search space • Prune the search space by proposing a criterion How to find the optimal subtree? • Set of all subtrees is huge • Need to find the optimal subtree efficiently

Ad-hoc techniques • Size constraints • Use subtrees whose size is less than s (s = 6~8) • Frequency constraints • Use subtrees that occur no less than f times in training data (f = 2 ~ 5) • Pseudo iterations • After several 5- or 10-iterations of boosting, we alternately perform 100- or 300 pseudo iterations, in which the optimal subtee is selected from the cache that maintains the features explored in the previous iterations.

Relation to previous work Boosting vs Kernel methods [Collins 00] Boosting vs Data Oriented Parsing [Bod 98]

Kernels [Collins 00] • Kernel methods reduce the problem into the dual form that only depends on dot products of two instances (parsed trees) • Pros • No need to provide explicit feature vector • A dynamic programming is used to calculate dot products between trees, which is very efficient! • Cons • Require a large number of kernel evaluations in testing • Parsing is slow • Difficult to see which features are relevant

DOP [Bod 98] • DOP is not based on re-ranking • DOP deals with the all the subtrees representation explicitly like our method • Pros • high accuracy • Cons • exact computation is NP-complete • cannot always provide sparse feature representation • very slow since the number of subtrees the DOP uses is huge

Kernels vs DOP vs Boosting

Experiments WSJ parsing Shallow parsing

Experiments • WSJ parsing • Standard data: training: 2-21, test 23 of PTB • Model2 of Collins 99 was used to obtain n-best results • exactly the same setting as [Collins 00 (Kernels)] • Shallow parsing • CoNLL 2000 shared task • training:15-18, test: 20 of PTB • CRF-based parser [Sha 03] was used to obtain n-best results

Tree representations • WSJ parsing • lexicalized tree • each non-terminal has a special node labeled with a head word • Shallow parsing • right-branching tree where adjacent phrases are child/parent relation • special node for right/left boundaries

Results: WSJ parsing LR/LP = labeled recall/precision. CBs is the average number of cross brackets per sentence. 0 CBs, and 2CBs are the percentage of sentences with 0 or 2 crossing brackets, respectively • Comparable to other methods • Better than kernel method that uses all subtree representations with different parameter estimation

Results: Shallow parsing Fβ=1 is a harmonic mean between precision and recall • Comparable to other methods • Our method is also comparable to Zhang’s method even without extra linguistic features

Advantages • Compact feature set • WSJ parsing: ~ 8,000 • Shallow parsing: ~ 3,000 • Kernels implicitly use a huge number of features • Parsing is very fast • WSJ parsing: 0.055 sec./sentence • Shallow parsing: 0.042 sec./sentence (n-best parsing time is NOT included)

Advantages, cont’d • Sparse feature representations allow us to analyze which kinds of subtrees are relevant WSJ parsing Shallow parsing positive subtrees positive subtrees negative subtrees negative subtrees

Conclusions • All subtrees are potentially used as features • Boosting • L1 norm regularization performs automatic feature selection • Branch and bound • enables us to find the optimal subtrees efficiently • Advantages: • comparable accuracy to other parsing methods • fast parsing • good interpretability

Efficient computation

a 1 b c t 2 4 7 a c a b a 1 1 3 5 6 rightmost- path b c b c 2 2 4 4 a 1 c a b c a b 3 5 6 3 5 6 b c 2 4 7 c a b 7 3 5 6 Right most extension [Asai02, Zaki02] • Extend a given tree of size (n-1) by adding a new node to obtain trees of size n • a node is added to the right-most-path • a node is added as the rightmost sibling

Right most extension, cont. • Recursive applications of right most extensions create a search space

Pruning strategy μ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4 Pruning • For all propose an upper bound such that • Can prune the node t if , where is a suboptimal gain

Upper bound of the gain

Enhancing Parsing Accuracy through Boosting-based Re-ranking with Subtree Features