1 / 88

Constraint satisfaction inference for discrete sequence processing in NLP

Constraint satisfaction inference for discrete sequence processing in NLP. Antal van den Bosch ILK / CL and AI, Tilburg University DCU Dublin April 19, 2006 (work with Sander Canisius and Walter Daelemans). Constraint satisfaction inference for discrete sequence processing in NLP.

rhea
Télécharger la présentation

Constraint satisfaction inference for discrete sequence processing in NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Constraint satisfaction inference for discrete sequence processing in NLP Antal van den Bosch ILK / CL and AI, Tilburg University DCU Dublin April 19, 2006 (work with Sander Canisius and Walter Daelemans)

  2. Constraint satisfaction inference for discrete sequence processing in NLP Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion

  3. How to map sequences to sequences? • Machine learning’s pet solution: • Local-context windowing (NETtalk) • One-shot prediction of single output tokens. • Concatenation of predicted tokens.

  4. The near-sightedness problem • A local window never captures long-distance information. • No coordination of individual output tokens. • Long-distance information does exist; holistic coordination is needed.

  5. Holistic information • “Counting” constraints: • Certain entities occur only once in a clause/sentence. • “Syntactic validity” constraints: • On discontinuity and overlap; chunks have a beginning and an end. • “Cooccurrence” constraints: • Some entities must occur with others, or cannot co-exist with others.

  6. Solution 1: Feedback • Recurrent networks in ANN (Elman, 1991; Sun & Giles, 2001), e.g. word prediction. • Memory-based tagger (Daelemans, Zavrel, Berck, and Gillis, 1996). • Maximum-entropy tagger (Ratnaparkhi, 1996).

  7. Feedback disadvantage • Label bias problem (Lafferty, McCallum, and Pereira, 2001). • Previous prediction is an important source of information. • Classifier is compelled to take its own prediction as correct. • Cascading errors result.

  8. Label bias problem

  9. Label bias problem

  10. Label bias problem

  11. Label bias problem

  12. Solution 2: Stacking • Wolpert (1992) for ANNs. • Veenstra (1998) for NP chunking: • Stage-1 classifier, near-sighted, predicts sequences. • Stage-2 classifier learns to correct stage-1 errors by taking stage-1 output as windowed input.

  13. Windowing and stacking

  14. Stacking disadvantages • Practical issues: • Ideally, train stage-2 on cross-validated output of stage-1, not “perfect” output. • Costly procedure. • Total architecture: two full classifiers. • Local, not global error correction.

  15. What exactly is the problem with mapping to sequences? • Born in Made, The Netherlands  O_O_B-LOC_O_B-LOC_I-LOC • Multi-class classification with 100s or 1000s of classes? • Lack of generalization • Some ML algorithms cannot cope very well. • SVMs • Rule learners, decision trees • However, others can. • Naïve Bayes, Maximum-entropy • Memory-based learning

  16. Solution 3: n-gram subsequences • Retain windowing approach, but • Predict overlapping n-grams of output tokens.

  17. Resolving overlapping n-grams • Probabilities available: Viterbi • Other options: voting

  18. N-gram+voting disadvantages • Classifier predicts syntactically valid trigrams, but • After resolving overlap, only local error correction. • End result is still a concatenation of local uncoordinated decisions. • Number of classes increases (problematic for some ML).

  19. Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion

  20. Four “chunking” tasks • English base-phrase chunking • CoNLL-2000, WSJ • English named-entity recognition • CoNLL-2003, Reuters • Dutch medical concept chunking • IMIX/Rolaquad, medical encyclopedia • English protein-related entity chunking • Genia, Medline abstracts

  21. Treated the same way • IOB-tagging. • Windowing: • 3-1-3 words • 3-1-3 predicted PoS tags (WSJ / Wotan) • No seedlists, suffix/prefix, capitalization, … • Memory-based learning and maximum-entropy modeling • MBL: automatic parameter optimization (paramsearch, Van den Bosch, 2004)

  22. IOB-codes for chunks: step 1, PTB-II WSJ ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))

  23. IOB-codes for chunks: step 1, PTB-II WSJ ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))

  24. IOB codes for chunks:Flatten tree [Once]ADVP [he]NP [was held]VP [for]PP [three months]NP [without]PP [being charged]VP

  25. Example: Instances feature 1 feature 2 feature 3 class (word -1) (word 0) (word +1) • _ Once he I-ADVP • Once he was I-NP • he was held I-VP • was held for I-VP • held for three I-PP • for three months I-NP • three months without I-NP • months without being I-PP • without being charged I-VP • being charged . I-VP • charged._ O

  26. MBL • Memory-based learning • k-NN classifier (Fix and Hodges, 1951; Cover and Hart, 1967; Aha et al., 1991), Daelemans et al. • Discrete point-wise classifier • Implementation used: TiMBL (Tilburg Memory-Based Learner)

  27. Memory-based learning and classification • Learning: • Store instances in memory • Classification: • Given new test instance X, • Compare it to all memory instances • Compute a distance between X and memory instance Y • Update the top k of closest instances (nearest neighbors) • When done, take the majority class of the k nearest neighbors as the class of X

  28. Similarity / distance • A nearest neighbor has the smallest distance, or the largest similarity • Computed with a distance function • TiMBL offers two basic distance functions: • Overlap • MVDM (Stanfill & Waltz, 1986; Cost & Salzberg, 1989) • Feature weighting • Exemplar weighting • Distance-weighted class voting

  29. The Overlap distance function • “Count the number of mismatching features”

  30. The MVDM distance function • Estimate a numeric “distance” between pairs of values • “e” is more like “i” than like “p” in a phonetic task • “book” is more like “document” than like “the” in a parsing task

  31. Feature weighting • Some features are more important than others • TiMBL metrics: Information Gain, Gain Ratio, Chi Square, Shared Variance • Ex. IG: • Compute data base entropy • For each feature, • partition the data base on all values of that feature • For all values, compute the sub-data base entropy • Take the weighted average entropy over all partitioned subdatabases • The difference between the “partitioned” entropy and the overall entropy is the feature’s Information Gain

  32. Feature weighting in the distance function • Mismatching on a more important feature gives a larger distance • Factor in the distance function:

  33. Distance weighting • Relation between larger k and smoothing • Subtle extension: making more distant neighbors count less in the class vote • Linear inverse of distance (w.r.t. max) • Inverse of distance • Exponential decay

  34. Current practice • Default TiMBL settings: • k=1, Overlap, GR, no distance weighting • Work well for some morpho-phonological tasks • Rules of thumb: • Combine MVDM with bigger k • Combine distance weighting with bigger k • Very good bet: higher k, MVDM, GR, distance weighting • Especially for sentence and text level tasks

  35. Base phrase chunking • 211,727 training, 47,377 test examples • 22 classes • [He]NP [reckons]VP [the current account deficit]NP [will narrow]VP [to]PP [only $ 1.8 billion]NP [in]PP [September]NP .

  36. Named entity recognition • 203,621 training, 46,435 test examples • 8 classes • [U.N.]organizationofficial [Ekeus]personheads for [Baghdad]location

  37. Medical concept chunking • 428,502 training, 47,430 test examples • 24 classes • Bij [infantiel botulisme]diseasekunnen in extreme gevallen [ademhalingsproblemen]symptomen [algehele lusteloosheid]symptomoptreden.

  38. Protein-related concept chunking • 458,593 training, 50,916 test examples • 51 classes • Most hybrids express both [KBF1]proteinand [NF-kappa B]proteinin their nuclei , but one hybrid expresses only [KBF1]protein .

  39. Results: feedback in MBT

  40. Results: stacking

  41. Results: trigram classes

  42. Numbers of trigram classes

  43. Error reductions

  44. Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion

  45. Classification + inference

  46. Classification + inference

  47. Comparative study • Base discrete classifier: Maximum-entropy model (Zhang Le, maxent) • Extended with feedback, stacking, trigrams, combinations • Compared against • Conditional Markov Models (Ratnaparkhi, 1996) • Maximum-entropy Markov Models (McCallum, Freitag, and Pereira, 2000) • Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001) • On Medical & Protein chunking

  48. Maximum entropy • Probabilistic model: conditional distribution p(C|x) (= probability matrix between classes and values) with maximal entropy H(p) • Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible • Maximize entropy in matrix through iterative process: • IIS, GIS (Improved/Generalized Iterative Scaling) • L-BFGS • Discretized!

  49. Results: discrete Maxent variants

  50. Conditional Markov Models • Probabilistic analogue of Feedback • Processes from left to right • Produces conditional probabilities, including previous classification, limited by beam search • With beam=1, equal to Feedback • Can be trained with maximum entropy • E.g. MXPOST, Ratnaparkhi (1996)

More Related