Enhancing Statistical Parsers via Semi-Supervised Co-Training Techniques

Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006

Roadmap • Motivation: • Resource Bottleneck • Co-training • Co-training with different parsers • CFG & LTAG • Experiments: • Initial seed set size • Parse selection • Domain porting • Results and discussion

Motivation: Issues • Current statistical parsers • Many grammatical models • Significant progress: F-score ~ 93% • Issues: • Trained on ~1M words Penn WSJ treebank • Annotation: significant investment: time & money • Portability: • Single genre – business news • Later treebanks – smaller, still news • Training resource bottleneck

Motivation: Approach • Goal: • Enhance portability, performance without large amounts of additional training data • Observations: • “Self-training”: Train parser on own output • Very small improvement (better counts for heads) • Limited to slightly refining current model • Ensemble methods, voting: useful • Approach: Co-training

Co-Training • Co-Training (Blum & Mitchell 1998) • Weakly supervised training technique • Successful for basic classification • Materials • Small “seed” set of labeled examples • Large set of unlabeled examples • Training: Evidence from multiple models • Optimize degree of agreement b/t models on unlabeled data • Train several models on seed data • Run on unlabeled data • Use new “reliable” labeled examples to train others • Iterate

Co-training Issues • Challenge: • Picking reliable novel examples • No guaranteed, simple approach • Rely on heuristics • Intersection: Highly ranked by other; low by self • Difference: Score by other exceeds self by some margin • Possibly employ parser confidence measures

Experimental Structure • Approach (Steedman et al, 2003) • Focus here: Co-training with different parsers • Also examined reranking, supertaggers &parsers • Co-train CFG (Collins) & LTAG • Data: Penn Treebank WSJ, Brown, NA News • Questions: • How select reliable novel samples? • How does labeled seed size affect co-training? • How effective in co-training w/in, across genre?

System Architecture • Two “different” parsers • “Views” – can be different by feature space • Here Collins CFG & LTAG • Comparable performance, different formalisms • Cache Manager • Draws labeled sentences for parsers to label • Selects subset of newly labeled to training set

Two Different Parsers • Both train on treebank input • Lexicalized, head information percolated • Collins-CFG • Lexicalized CFG parser • “Bi-lexical”: each pair of non-terminals leads to bigram relation b/t pair of lexical items • Ph= head percolation; Pm=modifiers of head dtr • LTAG: • Lexicalized TAG parser • Bigram relations b/t trees • Ps=substitution probability; Pa=adjunction probability • Different in tree creation and lexical reln depth

Selecting Labeled Examples • Scoring the parse • Ideal – true – score impossible • F-prob: trust the parser; F-norm-prob: norm by len • F-entropy: Diff b/t parse score distr and uniform • Baseline: # of parses, sentence length • Selecting (newly labeled) sentences • Goal: minimize noise, maximize training utility • S-base: n highest scores (both parsers use same) • Asymmetric: teacher/student • S-topn: teacher’s top n • S-intersect: sentences in teacher’s top n, student’s bottom n • S-diff: teacher’s score higher than student’s by some amount

Experiments: Initial Seed Size • Typically evaluate after all training • Consider convergence rate • Initial rapid growth – tailing off w/more • Largest improvement: 500-1000 instances • Collins-CFG plateaus at 40K (89.3) • LTAG still improving • Will benefit from additional training • Co-training w/500 vs 1000 instances • Less data, greater benefit • Enhance coverage • However, 500 seed doesn’t reach level of 1000 seed

Experiments: Parse Selection • Contrast: • Select-all newly labeled vs S-intersect (67%) • Co-training experiments: • 500 seed set • LTAG performs better w/S-intersect • Reduces noise, LTAG sensitive to noisy trees • CFG performs better w/S-select-all • CFG needs to increase coverage, more samples

Experiments: Cross-domain • Train on Brown corpus -1000 seed • Co-train on WSJ • CFG, w/s-intersect improves, 76.6-> 78.3 • Mostly in first 5 iterations • Lexicalizing for new domain vocab • Train on Brown + 100 WSJ seed • Co-train on other WSJ • Base improves to 78.7, co-train to 80 • Gradual improvement, new constructs?

Summary • Semi-supervised parser training • Co-training • Two different parse formalisms provide diff’t views • Enhances effectiveness • Biggest gains with small seed sets • Cross-domain enhancement • Selection methods dependent on • Parse model, amount of seed data

Findings • Co-training enhances parsing when trained on small datasets: 500-10000 sentences • Co-training aids genre porting w/o labels • Co-training improved w/ANY labels for genre • Approaches for crucial sample selection

Enhancing Statistical Parsers via Semi-Supervised Co-Training Techniques

Enhancing Statistical Parsers via Semi-Supervised Co-Training Techniques

Presentation Transcript

Semi-supervised Learning

Semi-Supervised Clustering I

Semi-Supervised Learning

Semi-supervised learning and self-training

Semi-Supervised Learning

Semi-supervised learning

Semi-Supervised Learning

Semi-supervised learning

Semi-Supervised Learning

Semi-supervised Learning

Inductive Semi-supervised Learning

Semi-Supervised Clustering

Semi-Supervised Training for Appearance-Based Statistical Object Detection Methods

Semi-Supervised Learning

Semi-Supervised Clustering

Semi-Supervised Learning

Semi-supervised Learning

Semi-Supervised Learning

Semi-Supervised Boosting for Statistical Word Alignment

Semi-Supervised Clustering

Semi-Supervised Learning