Constraint satisfaction inference for discrete sequence processing in NLP

Constraint satisfaction inference for discrete sequence processing in NLP Antal van den Bosch ILK / CL and AI, Tilburg University DCU Dublin April 19, 2006 (work with Sander Canisius and Walter Daelemans)

Constraint satisfaction inference for discrete sequence processing in NLP Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion

How to map sequences to sequences? • Machine learning’s pet solution: • Local-context windowing (NETtalk) • One-shot prediction of single output tokens. • Concatenation of predicted tokens.

The near-sightedness problem • A local window never captures long-distance information. • No coordination of individual output tokens. • Long-distance information does exist; holistic coordination is needed.

Holistic information • “Counting” constraints: • Certain entities occur only once in a clause/sentence. • “Syntactic validity” constraints: • On discontinuity and overlap; chunks have a beginning and an end. • “Cooccurrence” constraints: • Some entities must occur with others, or cannot co-exist with others.

Solution 1: Feedback • Recurrent networks in ANN (Elman, 1991; Sun & Giles, 2001), e.g. word prediction. • Memory-based tagger (Daelemans, Zavrel, Berck, and Gillis, 1996). • Maximum-entropy tagger (Ratnaparkhi, 1996).

Feedback disadvantage • Label bias problem (Lafferty, McCallum, and Pereira, 2001). • Previous prediction is an important source of information. • Classifier is compelled to take its own prediction as correct. • Cascading errors result.

Label bias problem

Solution 2: Stacking • Wolpert (1992) for ANNs. • Veenstra (1998) for NP chunking: • Stage-1 classifier, near-sighted, predicts sequences. • Stage-2 classifier learns to correct stage-1 errors by taking stage-1 output as windowed input.

Windowing and stacking

Stacking disadvantages • Practical issues: • Ideally, train stage-2 on cross-validated output of stage-1, not “perfect” output. • Costly procedure. • Total architecture: two full classifiers. • Local, not global error correction.

What exactly is the problem with mapping to sequences? • Born in Made, The Netherlands  O_O_B-LOC_O_B-LOC_I-LOC • Multi-class classification with 100s or 1000s of classes? • Lack of generalization • Some ML algorithms cannot cope very well. • SVMs • Rule learners, decision trees • However, others can. • Naïve Bayes, Maximum-entropy • Memory-based learning

Solution 3: n-gram subsequences • Retain windowing approach, but • Predict overlapping n-grams of output tokens.

Resolving overlapping n-grams • Probabilities available: Viterbi • Other options: voting

N-gram+voting disadvantages • Classifier predicts syntactically valid trigrams, but • After resolving overlap, only local error correction. • End result is still a concatenation of local uncoordinated decisions. • Number of classes increases (problematic for some ML).

Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion

Four “chunking” tasks • English base-phrase chunking • CoNLL-2000, WSJ • English named-entity recognition • CoNLL-2003, Reuters • Dutch medical concept chunking • IMIX/Rolaquad, medical encyclopedia • English protein-related entity chunking • Genia, Medline abstracts

Treated the same way • IOB-tagging. • Windowing: • 3-1-3 words • 3-1-3 predicted PoS tags (WSJ / Wotan) • No seedlists, suffix/prefix, capitalization, … • Memory-based learning and maximum-entropy modeling • MBL: automatic parameter optimization (paramsearch, Van den Bosch, 2004)

IOB-codes for chunks: step 1, PTB-II WSJ ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))

IOB codes for chunks:Flatten tree [Once]ADVP [he]NP [was held]VP [for]PP [three months]NP [without]PP [being charged]VP

Example: Instances feature 1 feature 2 feature 3 class (word -1) (word 0) (word +1) • _ Once he I-ADVP • Once he was I-NP • he was held I-VP • was held for I-VP • held for three I-PP • for three months I-NP • three months without I-NP • months without being I-PP • without being charged I-VP • being charged . I-VP • charged._ O

MBL • Memory-based learning • k-NN classifier (Fix and Hodges, 1951; Cover and Hart, 1967; Aha et al., 1991), Daelemans et al. • Discrete point-wise classifier • Implementation used: TiMBL (Tilburg Memory-Based Learner)

Memory-based learning and classification • Learning: • Store instances in memory • Classification: • Given new test instance X, • Compare it to all memory instances • Compute a distance between X and memory instance Y • Update the top k of closest instances (nearest neighbors) • When done, take the majority class of the k nearest neighbors as the class of X

Similarity / distance • A nearest neighbor has the smallest distance, or the largest similarity • Computed with a distance function • TiMBL offers two basic distance functions: • Overlap • MVDM (Stanfill & Waltz, 1986; Cost & Salzberg, 1989) • Feature weighting • Exemplar weighting • Distance-weighted class voting

The Overlap distance function • “Count the number of mismatching features”

The MVDM distance function • Estimate a numeric “distance” between pairs of values • “e” is more like “i” than like “p” in a phonetic task • “book” is more like “document” than like “the” in a parsing task

Feature weighting • Some features are more important than others • TiMBL metrics: Information Gain, Gain Ratio, Chi Square, Shared Variance • Ex. IG: • Compute data base entropy • For each feature, • partition the data base on all values of that feature • For all values, compute the sub-data base entropy • Take the weighted average entropy over all partitioned subdatabases • The difference between the “partitioned” entropy and the overall entropy is the feature’s Information Gain

Feature weighting in the distance function • Mismatching on a more important feature gives a larger distance • Factor in the distance function:

Distance weighting • Relation between larger k and smoothing • Subtle extension: making more distant neighbors count less in the class vote • Linear inverse of distance (w.r.t. max) • Inverse of distance • Exponential decay

Current practice • Default TiMBL settings: • k=1, Overlap, GR, no distance weighting • Work well for some morpho-phonological tasks • Rules of thumb: • Combine MVDM with bigger k • Combine distance weighting with bigger k • Very good bet: higher k, MVDM, GR, distance weighting • Especially for sentence and text level tasks

Base phrase chunking • 211,727 training, 47,377 test examples • 22 classes • [He]NP [reckons]VP [the current account deficit]NP [will narrow]VP [to]PP [only $ 1.8 billion]NP [in]PP [September]NP .

Named entity recognition • 203,621 training, 46,435 test examples • 8 classes • [U.N.]organizationofficial [Ekeus]personheads for [Baghdad]location

Medical concept chunking • 428,502 training, 47,430 test examples • 24 classes • Bij [infantiel botulisme]diseasekunnen in extreme gevallen [ademhalingsproblemen]symptomen [algehele lusteloosheid]symptomoptreden.

Protein-related concept chunking • 458,593 training, 50,916 test examples • 51 classes • Most hybrids express both [KBF1]proteinand [NF-kappa B]proteinin their nuclei , but one hybrid expresses only [KBF1]protein .

Results: feedback in MBT

Results: stacking

Results: trigram classes

Numbers of trigram classes

Error reductions

Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion

Classification + inference

Comparative study • Base discrete classifier: Maximum-entropy model (Zhang Le, maxent) • Extended with feedback, stacking, trigrams, combinations • Compared against • Conditional Markov Models (Ratnaparkhi, 1996) • Maximum-entropy Markov Models (McCallum, Freitag, and Pereira, 2000) • Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001) • On Medical & Protein chunking

Maximum entropy • Probabilistic model: conditional distribution p(C|x) (= probability matrix between classes and values) with maximal entropy H(p) • Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible • Maximize entropy in matrix through iterative process: • IIS, GIS (Improved/Generalized Iterative Scaling) • L-BFGS • Discretized!

Results: discrete Maxent variants

Conditional Markov Models • Probabilistic analogue of Feedback • Processes from left to right • Produces conditional probabilities, including previous classification, limited by beam search • With beam=1, equal to Feedback • Can be trained with maximum entropy • E.g. MXPOST, Ratnaparkhi (1996)

Constraint satisfaction inference for discrete sequence processing in NLP

Constraint satisfaction inference for discrete sequence processing in NLP

Presentation Transcript

Constraint Satisfaction

Constraint Satisfaction Problems

Constraint Satisfaction

Constraint Satisfaction Problems

Constraint Satisfaction

Constraint Satisfaction

Constraint Satisfaction

Constraint Satisfaction

Constraint Satisfaction

Constraint satisfaction inference for discrete sequence processing in NLP

Constraint Satisfaction

Constraint Satisfaction

Constraint Satisfaction

Cognitive processing as constraint satisfaction

Constraint Satisfaction

Constraint Satisfaction Problems

Constraint Satisfaction Problems

Constraint Satisfaction Problems

Constraint Satisfaction

Constraint Satisfaction