Learning in NLP: When can we reduce or avoid annotation cost?

Learning in NLP: When can we reduce or avoid annotation cost? Tutorial at RANLP 2003 Ido Dagan Bar Ilan University, Israel

Introduction • Motivations for learning in NLP • NLP requires huge amounts of diverse types of knowledge – learning makes knowledge acquisition more feasible, automatically or semi-automatically • Much of language behavior is preferential in nature, so need to acquire both quantitative and qualitative knowledge

Introduction (cont.) • Apparently, empirical modeling obtains (so far) mostly “first-order” approximation of linguistic behavior • Often, learning models that are more complex computationally improve results only to a modest extent • Often, several learning models obtain comparable results • Proper linguistic modeling seems crucial

Information Units of Interest - Examples • Explicit units: • Documents • Lexical units: words, terms (surface/base form) • Implicit (hidden) units – human stipulation: • Word senses, name types • Document categories • Lexical syntactic units: part of speech tags • Syntactic relationships between words – parsing • Semantic concepts and relationships

Tasks and Applications • Supervised/classification: identify hidden units (concepts) of explicit units • Syntactic analysis, word sense disambiguation, name classification, categorization, … • Unsupervised: identify relationships and properties of explicit units (terms, docs) • Association, topicality, similarity, clustering • Combinations

Data and Representations • Frequencies of units • Co-occurrence frequencies • Between all relevant types of units (term-doc, term-term, term-category, sense-term, etc.) • Representations and modeling • Sequences • Feature sets/vectors

Characteristics of Learning in NLP • Very high dimensionality • Sparseness of data and relevant modeling • Addressing the basic problems of language: • Ambiguity – of concepts and features • One way to say many things • Variability • Many ways to say the same thing

Supervised Classification • Hidden concept is defined by a set of labeled training examples (category, sense) • Classification is based on entailment of the hidden concept by related elements/features • Example: two senses of “sentence”: • word, paragraph, description Sense1 • judge, court, lawyer Sense2 • Single or multiple concepts per example • Word sense vs. document categories

Supervised Tasks and Features • Typical Classification Tasks: • Lexical: Word sense disambiguation, target word selection in translation, name-type classification, accent restoration, text categorization (notice task similarity) • Syntactic: POS tagging, PP-attachment, parsing • Hybrid: anaphora resolution, information extraction • Features (“feature engineering”): • Adjacent context: words, POS, … • In various relationships – distance, syntactic • possibly generalized to classes • Other: morphological, orthographic, syntactic

Learning to Classify • Two possibilities for acquiring “entailment” relationships: • Manually: by an expert (“rules”) • time consuming, difficult – “expert system” approach • Automatically: concept is defined by a set of training examples • training quantity/quality • Training: learn entailment of concept by features of training examples (a model) • Classification: apply model to new examples

Supervised Learning Scheme “Labeled” Examples Training Algorithm Classification Model New Examples Classification Algorithm Classifications

Learning Approaches • Model-based: define entailment relations and their strengths by training algorithm • Statistical/Probabilistic: model is composed of probabilities (scores) computed from training statistics • Iterative feedback/search (neural network): start from some model, classify training examples, and correct model according to feedback • Memory-based: no training algorithm and model - classify by matching to raw training (compare to unsupervised tasks)

Motivation of Tutorial Theme: Reducing or Avoiding Manual Labeling • Basic supervised setting – requires large manually labeled training corpora • Annotation is often very expensive • Many results rely on standard training materials, which were assembled through dedicated projects and evaluation frameworks • Penn Treebank, Brown Corpus, Semcor, TREC, MUC and SenseEval evaluations, CoNLL shared tasks. • Limited applicability for settings not covered by the generic resources • Different languages, specialized domains, full scope of word senses, text categories, … • Severely hurts industrial applicability

Tutorial Scope • Obtaining some (noisy) labeled data without manual annotation • Exploiting bilingual resources • Generalizations by unsupervised methods • Bootstrapping • Unsupervised clustering as an alternative to supervised classes • Expectation-Maximization (EM) for detecting underlying structures/concepts • Selective sampling • These approaches are demonstrated for basic statistical and probabilistic learning models Some of these approaches might be perceived as unsupervised learning, though they actually address supervised tasks of identifying externally imposed classes (“unsupervised” training)

Sources • Major literature sources: • Foundations of Statistical Natural Language Processing, by Manning & Schutze, MIT Press, 2000 (2nd printing with corrections) • Articles (see bibliography) • Additional slide credits: • Prof. Shlomo Argamon, Chicago

Evaluation • Evaluation mostly based on (subjective) human judgment of relevancy/correctness • In some cases – task is objective (e.g. OCR), or evaluate by applying mathematical criteria (likelihood) • Basic measure for classification – accuracy • Cross validation – different training/test splits • In many tasks (extraction, multiple class per-instance, …) most instances are “negative”; hence using recall/precision measures, following information retrieval (IR) tradition

Evaluation: Recall/Precision • Recall: #correct extracted/total correct • Precision: #correct extracted/total extracted • Recall/precision curve - by varying the number of extracted items, assuming the items are sorted by decreasing score 1 Precision 0 1 Recall

Simple Examples for Statistics-based Classification • Based on class-feature counts – from labeled data • Contingency table: • We will see several examples of simple models based on these statistics C ~C a b f c d ~f

Prepositional-Phrase Attachment • Simplified version of Hindle & Rooth (1993) [MS 8.3] • Setting: V NP-chunk PP • Moscow sent soldiers into Afghanistan • ABC breached an agreementwith XYZ • Motivation for the classification task: • Attachment is often a problem for (full) parsers • Augment shallow/chunk parsers

Relevant Probabilities • P(prep|n) vs. P(prep|v) • The probability of having the preposition prep attached to an occurrence of the noun n (the verb v). • Notice: a single feature for each class • Example: P(into|send) vs. P(into|soldier) • Decision measured by the likelihood ratio: • Positive/negative λ verb/noun attachment

Estimating Probabilities • Based on attachment counts from a training corpus • Maximum likelihood estimates: • How to count from an unlabeled ambiguous corpus? (Circularity problem) • Some cases are unambiguous: • The roadto London is long • Moscow sent him to Afghanistan

Heuristic Bootstrapping and Ambiguous Counting • Produce initial estimates (model) by counting all unambiguous cases • Apply the initial model to all ambiguous cases; count each case under the resulting attachment if |λ| is greater than a threshold • E.g. |λ|>2, meaning one attachment is at least 4 times more likely than the other • Consider each remaining ambiguous case as a 0.5 count for each attachment. • Likely n-p and v-p pairs would “pop up” in the ambiguous counts, while incorrect attachments are likely to accumulate low counts

Example Decision • Moscow sent soldiers into Afghanistan • Verb attachment is 70 times more likely

Hindle & Rooth Evaluation • H&R results for a somewhat richer model: • 80% correct if we always make a choice • 91.7% precision for 55.2% recall, when requiring |λ|>3 for classification. • Notice that the probability ratio doesn’t distinguish between decisions made based on high vs. low frequencies.

Possible Extensions • Consider a-priori structural preference for “low” attachment (to noun) • Consider lexical head of the PP: • I saw the bird with the telescope • I met the man with the telescope • Such additional factors can be incorporated easily, assuming their independence • Addressing more complex types of attachments, such as chains of several PP’s • Similar attachment ambiguities within noun compounds: [N [N N]] vs. [[N N] N]

Classify by Best Single Feature: Decision List • Training: for each feature, measure its “entailment score ” for each class, and register the class with the highest score • Sort all features by decreasing score • Classification: for a given example, identify the highest entailment score among all “active” features, and select the appropriate class • Test all features for the class in decreasing score order, until first success  output the relevant class • Default decision: the majority class • For multiple classes per example: may apply a threshold on the feature-class entailment score • Suitable when relatively few strong features indicate class (compare to manually written rules)

Example: Accent Restoration • (David Yarowsky, 1994): for French and Spanish • Classes: alternative accent restorations for words in text without accent marking • Labeled training generated from accented texts • Example: côte (coast) vs. côté (side) • A variant of the general word sense disambiguation problem - “one sense per collocation” motivates using decision lists • Similar tasks (with available training): • Capitalization restoration in ALL-CAPS text • Homograph disambiguation in speech synthesis (wind as noun and verb)

Accent Restoration - Features • Word form coloocation features: • Single words in window: ±1, ±k (20-50) • Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex features) • Easy to implement

Accent Restoration - Features • Local syntactic-based features (for Spanish) • Use a morphological analyzer • Lemmatized features - generalizing over inflections • POS of adjacent words as features • Some word classed (primarily time terms, to help with tense ambiguity for unaccented words in Spanish)

Accent Restoration – Decision Score • Probabilities estimated from training statistics, taken from a corpus with accents • Smoothing - add small constant to all counts • Pruning: • Remove redundancies for efficiency: remove specific features that score lower than their generalization (domingo - WEEKDAY, w1w2 – w1) • Cross validation: remove features that causes more errors than correct classifications on held-out data

Counts are obtained from a sample of the probability space: sample • Maximum Likelihood Estimate proportional to sample counts: MLE estimate – 0 probability for unobserved events • Smoothing discounts observed events, leaving probability “mass” to unobserved events: discounted estimate for observed events positive estimate for unobserved events Probabilistic Estimation - Smoothing

“Add-1/Add-Constant” Smoothing

Accent Restoration – Results • Agreement with accented test corpus for ambiguous words: 98% • Vs. 93% for baseline of most frequent form • Accented test corpus also includes errors • Worked well for most of the highly ambiguous cases (see random sample in next slide) • Results slightly better than Naive Bayes (weighing multiple features) • Consistent with related study on binary homograph disambiguation, where combining multiple features almost always agrees with using a single best feature • Incorporating many low-confidence features may introduce noise that would override the strong features

Accent Restoration – Tough Examples

(Dagan, Justeson, Lappin, Lease, Ribak 1995) The terrorist pulled the grenade from his pocket and threw it at the policeman ? Traditional AI-style approach Manually encoded semantic preferences/constraints Actions Weapon <object – verb> Cause_movement Bombs grenade throw drop Related Application: Anaphora Resolution

Statistics can be acquired from unambiguous (non-anaphoric) occurrences in raw (English) corpus (cf. PP attachment) • Semantic confidence combined with syntactic preferences it  grenade • “Language modeling” for disambiguation Statistical Approach “Semantic” Judgment Corpus (text collection) <verb–object: throw-grenade> 20 times <verb–object: throw-pocket> 1 time

Word Sense Disambiguation • Many words have multiple meanings • E.g, river bank, financial bank • Problem: Assign proper sense to each ambiguous word in text • Applications: • Machine translation • Information retrieval (mixed evidence) • Semantic interpretation of text

Approaches • Supervised learning: Learn from a pre-tagged corpus (Semcor, SenseEval) • all sense-occurrences are hidden – vs. PP and anaphora • Bilingual-based methods Obtain sense labels by mapping to another language • Dictionary-Based Learning Learn to distinguish senses based on dictionary entries • Unsupervised Learning Automatically cluster word occurrences into different senses

Using an Aligned Bilingual Corpus • Goal: get sense tagging cheaply • Use correlations between aphrases in two languages to disambiguate E.g, interest = ‘legal share’ (acquire an interest) ‘attention’ (show interest) In German Beteiligung erwerben Interesse zeigen • For each occurrence of an ambiguous word, determine which sense applies according to the aligned translation • Limited to senses that are discriminated by the other language; suitable for disambiguation in translation • Gale, Church and Yarowsky (1992) – Bayesian model

Evaluation Settings • Train and test on pre-tagged (or bilingual) texts • Difficult to come by • Artificial data – pseudo-senses – cheap to train and test: ‘merge’ two words to form an ‘ambiguous’ word with two ‘senses’ • E.g, replace all occurrences of door and of window with doorwindow and see if the system figures out which is which • Useful to develop sense disambiguation methods

Performance Bounds • How good is (say) 83.2%?? • Evaluate performance relative to lower and upper bounds: • Baseline performance: how well does the simplest “reasonable” algorithm do? E.g., compare to selecting the most frequent sense • Human performance: what percentage of the time do people agree on classification? • Nature of the senses used impacts accuracy levels

I bought soap bars I bought window barssense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’) ? ? Corpus (text collection) Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 timesSense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times • Features: co-occurrence within distinguished syntactic relations • “Hidden” senses – manual labeling required(?) Word Sense Disambiguationfor Machine Translation

Map ambiguous “relations” to second language (all possibilities): <noun-noun: soap-bar> 1<noun-noun: ‘cahfisat-sabon’> 20 times2<noun-noun: ‘sorag-sabon’> 0 times <noun-noun: window-bar> 1<noun-noun: ‘cahfisat-chalon’> 0 times 2<noun-noun: ‘sorag-chalon’> 15 times Hebrew Corpus Solution: Dictonary-based Mapping to Target Language English(-English)-Hebrew Dictionary: bar1 ‘chafisa’ soap  ‘sabon’ window  ‘chalon’bar2 ‘sorag’ • Exploiting ambiguities difference • Principle – intersecting redundancies(Dagan and Itai 1994)

The Selection Model • Constructed to choose (classify) the right translation for a complete relation rather than for each individual word at a time • since both words in a relation might be ambiguous, having their translations dependent upon each other • Assuming a multinomial model, under certain linguistic assumptions • The multinomial variable: a source relation • Each alternative translation of the relation is a possible outcome of the variable

An Example Sentence • A Hebrew sentence with 3 ambiguous words: • The alternative translations to English:

Example - Relational Representation

Selection Model • We would like to use as a classification score the log of the odds ratio between the most probable relation i and all other alternatives (in particular, the second most probable one j): • Estimation is based on smoothed counts • A potential problem: the odds ratio for probabilities doesn’t reflect the absolute counts from which the probabilities were estimated. • E.g., a count of 3 vs. (smoothed) 0 • Solution: using a one sided confidence interval (lower bound) for the odds ratio

Selection Model (cont.) • The distribution of the log of the odds ratio (across samples) converges to normal distribution • Selection “confidence” score for a single relation - the lower bound for the odds-ratio: • The most probable translation i for the relation is selected if Conf(i), the lower bound for the log odds ratio, exceeds θ. • Notice roles of θvs. α, and impact of n1,n2

Handling Multiple Relations in a Sentence: Constraint Propagation • Compute Conf(i) for each ambiguous source relation. • Pick the source relation with highest Conf(i). If Conf(i)< θ, or if no source relations left, then stop;Otherwise,select word translations according to target relation i and remove the source relation from the list. • Propagate the translation constraints: remove any target relation that contradicts the selections made; remove source relations that now become unambiguous. • Go to step 2. • Notice similarity to the decision list algorithm

Selection Algorithm Example

Learning in NLP: When can we reduce or avoid annotation cost?