190 likes | 285 Vues
This paper explores semi-supervised training methods for statistical parsers, addressing the resource bottleneck in natural language processing. The focus is on co-training with different parser models, particularly Collins CFG and LTAG, to improve portability and performance without extensive labeled data. Experiments demonstrate how initial seed set size and parse selection impact training outcomes, highlighting that leveraging two distinct parsing approaches enhances effectiveness. The findings suggest significant improvements, especially with small seed sets and in cross-domain scenarios.
E N D
Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006
Roadmap • Motivation: • Resource Bottleneck • Co-training • Co-training with different parsers • CFG & LTAG • Experiments: • Initial seed set size • Parse selection • Domain porting • Results and discussion
Motivation: Issues • Current statistical parsers • Many grammatical models • Significant progress: F-score ~ 93% • Issues: • Trained on ~1M words Penn WSJ treebank • Annotation: significant investment: time & money • Portability: • Single genre – business news • Later treebanks – smaller, still news • Training resource bottleneck
Motivation: Approach • Goal: • Enhance portability, performance without large amounts of additional training data • Observations: • “Self-training”: Train parser on own output • Very small improvement (better counts for heads) • Limited to slightly refining current model • Ensemble methods, voting: useful • Approach: Co-training
Co-Training • Co-Training (Blum & Mitchell 1998) • Weakly supervised training technique • Successful for basic classification • Materials • Small “seed” set of labeled examples • Large set of unlabeled examples • Training: Evidence from multiple models • Optimize degree of agreement b/t models on unlabeled data • Train several models on seed data • Run on unlabeled data • Use new “reliable” labeled examples to train others • Iterate
Co-training Issues • Challenge: • Picking reliable novel examples • No guaranteed, simple approach • Rely on heuristics • Intersection: Highly ranked by other; low by self • Difference: Score by other exceeds self by some margin • Possibly employ parser confidence measures
Experimental Structure • Approach (Steedman et al, 2003) • Focus here: Co-training with different parsers • Also examined reranking, supertaggers &parsers • Co-train CFG (Collins) & LTAG • Data: Penn Treebank WSJ, Brown, NA News • Questions: • How select reliable novel samples? • How does labeled seed size affect co-training? • How effective in co-training w/in, across genre?
System Architecture • Two “different” parsers • “Views” – can be different by feature space • Here Collins CFG & LTAG • Comparable performance, different formalisms • Cache Manager • Draws labeled sentences for parsers to label • Selects subset of newly labeled to training set
Two Different Parsers • Both train on treebank input • Lexicalized, head information percolated • Collins-CFG • Lexicalized CFG parser • “Bi-lexical”: each pair of non-terminals leads to bigram relation b/t pair of lexical items • Ph= head percolation; Pm=modifiers of head dtr • LTAG: • Lexicalized TAG parser • Bigram relations b/t trees • Ps=substitution probability; Pa=adjunction probability • Different in tree creation and lexical reln depth
Selecting Labeled Examples • Scoring the parse • Ideal – true – score impossible • F-prob: trust the parser; F-norm-prob: norm by len • F-entropy: Diff b/t parse score distr and uniform • Baseline: # of parses, sentence length • Selecting (newly labeled) sentences • Goal: minimize noise, maximize training utility • S-base: n highest scores (both parsers use same) • Asymmetric: teacher/student • S-topn: teacher’s top n • S-intersect: sentences in teacher’s top n, student’s bottom n • S-diff: teacher’s score higher than student’s by some amount
Experiments: Initial Seed Size • Typically evaluate after all training • Consider convergence rate • Initial rapid growth – tailing off w/more • Largest improvement: 500-1000 instances • Collins-CFG plateaus at 40K (89.3) • LTAG still improving • Will benefit from additional training • Co-training w/500 vs 1000 instances • Less data, greater benefit • Enhance coverage • However, 500 seed doesn’t reach level of 1000 seed
Experiments: Parse Selection • Contrast: • Select-all newly labeled vs S-intersect (67%) • Co-training experiments: • 500 seed set • LTAG performs better w/S-intersect • Reduces noise, LTAG sensitive to noisy trees • CFG performs better w/S-select-all • CFG needs to increase coverage, more samples
Experiments: Cross-domain • Train on Brown corpus -1000 seed • Co-train on WSJ • CFG, w/s-intersect improves, 76.6-> 78.3 • Mostly in first 5 iterations • Lexicalizing for new domain vocab • Train on Brown + 100 WSJ seed • Co-train on other WSJ • Base improves to 78.7, co-train to 80 • Gradual improvement, new constructs?
Summary • Semi-supervised parser training • Co-training • Two different parse formalisms provide diff’t views • Enhances effectiveness • Biggest gains with small seed sets • Cross-domain enhancement • Selection methods dependent on • Parse model, amount of seed data
Findings • Co-training enhances parsing when trained on small datasets: 500-10000 sentences • Co-training aids genre porting w/o labels • Co-training improved w/ANY labels for genre • Approaches for crucial sample selection