Supervised Classification of Feature-based Instances

Supervised Classification of Feature-based Instances

Simple Examples for Statistics-based Classification • Based on class-feature counts • Contingency table: • We will see several examples of simple models based on these statistics C ~C a b f c d ~f

Prepositional-Phrase Attachment • Simplified version of Hindle & Rooth (1993) [MS 8.3] • Setting: V NP-chunk PP • Moscow sent soldiers into Afghanistan • ABC breached an agreementwith XYZ • Motivation for the classification task: • Attachment is often a problem for (full) parsers • Augment shallow/chunk parsers

Relevant Probabilities • P(prep|n) vs. P(prep|v) • The probability of having the preposition prep attached to an occurrence of the noun n (the verb v). • Notice: a single feature for each class • Example: P(into|send) vs. P(into|soldier) • Decision measured by the likelihood ratio: • Positive/negative λ verb/noun attachment

Estimating Probabilities • Based on attachment counts from a training corpus • Maximum likelihood estimates: • How to count from an unlabeled ambiguous corpus? (Circularity problem) • Some cases are unambiguous: • The roadto London is long • Moscow sent him to Afghanistan

Heuristic Bootstrapping and Ambiguous Counting • Produce initial estimates (model) by counting all unambiguous cases • Apply the initial model to all ambiguous cases; count each case under the resulting attachment if |λ| is greater than a threshold • E.g. |λ|>2, meaning one attachment is at least 4 times more likely than the other • Consider each remaining ambiguous case as a 0.5 count for each attachment. • Likely n-p and v-p pairs would “pop up” in the ambiguous counts, while incorrect attachments are likely to accumulate low counts

Example Decision • Moscow sent soldiers into Afghanistan • Verb attachment is 70 times more likely

Hindle & Rooth Evaluation • H&R results for a somewhat richer model: • 80% correct if we always make a choice • 91.7% precision for 55.2% recall, when requiring |λ|>3 for classification. • Notice that the probability ratio doesn’t distinguish between decisions made based on high vs. low frequencies.

Possible Extensions • Consider a-priori structural preference for “low” attachment (to noun) • Consider lexical head of the PP: • I saw the bird with the telescope • I met the man with the telescope • Such additional factors can be incorporated easily, assuming their independence • Addressing more complex types of attachments, such as chains of several PP’s • Similar attachment ambiguities within noun compounds: [N [N N]] vs. [[N N] N]

Classify by Best Single Feature: Decision List • Training: for each feature, measure its “entailment score ” for each class, and register the class with the highest score • Sort all features by decreasing score • Classification: for a given example, identify the highest entailment score among all “active” features, and select the appropriate class • Test all features for the class in decreasing score order, until first success  output the relevant class • Default decision: the majority class • For multiple classes per example: may apply a threshold on the feature-class entailment score • Suitable when relatively few strong features indicate class (compare to manually written rules)

Example: Accent Restoration • (David Yarowsky, 1994): for French and Spanish • Classes: alternative accent restorations for words in text without accent marking • Example: côte (coast) vs. côté (side) • A variant of the general word sense disambiguation problem - “one sense per collocation” motivates using decision lists • Similar tasks: • Capitalization restoration in ALL-CAPS text • Homograph disambiguation in speech synthesis (wind as noun and verb)

Accent Restoration - Features • Word form coloocation features: • Single words in window: ±1, ±k (20-50) • Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex features) • Easy to implement

Accent Restoration - Features • Local syntactic-based features (for Spanish) • Use a morphological analyzer • Lemmatized features - generalizing over inflections • POS of adjacent words as features • Some word classed (primarily time terms, to help with tense ambiguity for unaccented words in Spanish)

Accent Restoration – Decision Score • Probabilities estimated from training statistics, taken from a corpus with accents • Smoothing - add small constant to all counts • Pruning: • Remove redundancies for efficiency: remove specific features that score lower than their generalization (domingo - WEEKDAY, w1w2 – w1) • Cross validation: remove features that causes more errors than correct classifications on held-out data

“Add-1/Add-Constant” Smoothing

Accent Restoration – Results • Agreement with accented test corpus for ambiguous words: 98% • Vs. 93% for baseline of most frequent form • Accented test corpus also includes errors • Worked well for most of the highly ambiguous cases (see random sample in next slide) • Results slightly better than Naive Bayes (weighing multiple features) • Consistent with related study on binary homograph disambiguation, where combining multiple features almost always agrees with using a single best feature • Incorporating many low-confidence features may introduce noise that would override the strong features

Accent Restoration – Tough Examples

(Dagan, Justeson, Lappin, Lease, Ribak 1995) The terrorist pulled the grenade from his pocket and threw it at the policeman ? Traditional AI-style approach Manually encoded semantic preferences/constraints Actions Weapon <object – verb> Cause_movement Bombs grenade throw drop Related Application: Anaphora Resolution

Statistics can be acquired from unambiguous (non-anaphoric) occurrences in raw (English) corpus (cf. PP attachment) • Semantic confidence combined with syntactic preferences it  grenade • “Language modeling” for disambiguation Statistical Approach “Semantic” Judgment Corpus (text collection) <verb–object: throw-grenade> 20 times <verb–object: throw-pocket> 1 time

I bought soap bars I bought window barssense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’) ? ? Corpus (text collection) Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 timesSense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times • Features: co-occurrence within distinguished syntactic relations • “Hidden” senses – manual labeling required(?) Word Sense Disambiguationfor Machine Translation

Map ambiguous “relations” to second language (all possibilities): <noun-noun: soap-bar> 1<noun-noun: ‘cahfisat-sabon’> 20 times2<noun-noun: ‘sorag-sabon’> 0 times <noun-noun: window-bar> 1<noun-noun: ‘cahfisat-chalon’> 0 times 2<noun-noun: ‘sorag-chalon’> 15 times Hebrew Corpus Solution: Mapping to Target Language English(-English)-Hebrew Dictionary: bar1 ‘chafisa’ soap  ‘sabon’ window  ‘chalon’bar2 ‘sorag’ • Exploiting ambiguities difference • Principle – intersecting redundancies(Dagan and Itai 1994)

The Selection Model • Constructed to choose (classify) the right translation for a complete relation rather than for each individual word at a time • since both words in a relation might be ambiguous, having their translations dependent upon each other • Assuming a multinomial model, under certain linguistic assumptions • The multinomial variable: a source relation • Each alternative translation of the relation is a possible outcome of the variable

An Example Sentence • A Hebrew sentence with 3 ambiguous words: • The alternative translations to English:

Example - Relational Representation

Selection Model • We would like to use as a classification score the log of the odds ratio between the most probable relation i and all other alternatives (in particular, the second most probable one j): • Estimation is based on smoothed counts • A potential problem: the odds ratio for probabilities doesn’t reflect the absolute counts from which the probabilities were estimated. • E.g., a count of 3 vs. (smoothed) 0 • Solution: using a one sided confidence interval (lower bound) for the odds ratio

Confidence Interval (for a proportion) • Given an estimate, what is the confidence that the estimate is “correct”, or at least close enough to the true value?

Confidence Interval (cont.) • Approximating by normal distribution: the distribution of the sampled proportion (across samples) approaches a normal distribution for large n.

Confidence Interval (cont.)

Selection Model (cont.) • The distribution of the log of the odds ratio (across samples) converges to normal distribution • Selection “confidence” score for a single relation - the lower bound for the odds-ratio: • The most probable translation i for the relation is selected if Conf(i), the lower bound for the log odds ratio, exceeds θ. • Notice roles of θvs. α, and impact of n1,n2

Handling Multiple Relations in a Sentence: Constraint Propagation • Compute Conf(i) for each ambiguous source relation. • Pick the source relation with highest Conf(i). If Conf(i)< θ, or if no source relations left, then stop;Otherwise,select word translations according to target relation i and remove the source relation from the list. • Propagate the translation constraints: remove any target relation that contradicts the selections made; remove source relations that now become unambiguous. • Go to step 2. • Notice similarity to the decision list algorithm

Selection Algorithm Example

Evaluation Results • Results - HebrewEnglish translation:Coverage: ~70% Precision within coverage: ~90% • ~20% improvement over choosing most frequent translation (95% statistical confidence for an improvement relative to this common baseline)

Analysis • Correct selections capture: • Clear semantic preferences: sign/seal treaty • Lexical collocation usage: peace treaty/contract • No selection: • Mostly: no statistics for any alternative (data sparseness) • investigator/researcher of corruption • Also: similar statistics for several alternatives • Solutions: • Consult more features in remote (vs. syntactic) contextprime minister … take position/job • Class/similarity-based generalizations (corruption-crime)

Analysis (cont.) • Confusing multiple sources (senses) for the same target relation: • ‘sikkuy’ (chance/prospect) ‘kattan’ (small/young)Valid (frequent) target relations: • small chance - correct • young prospect – incorrect, due to - • “Young prospect” is the translation of another Hebrew expression – ‘tikva’ (hope) ‘zeira’ (young) • The “soundness” assumption of the multinomial model is violated: • Assume counting the generated target relations corresponds to sampling the source relation, hence assuming a known 1:n mapping (also completeness – another source of errors) • Potential solutions: bilingual corpus, “reverse” translation

Sense Translation Model: Summary • Classification instance: a relation with multiple words, rather than a single word at a time, to capture immediate (“circular”) dependencies. • Make local decisions, based on a single feature • Taking into account statistical confidence of decisions • Constraint propagation for multiple dependent classifications (remote dependencies) • Decision list style rational – classifying by a single high confidence evidence is simpler, and may work better, than considering all weaker evidence simultaneously • Computing statistical confidence for a combination of multiple events is difficult; easier to perform for each event at a time • Statistical classification scenario (model) constructed for the linguistic setting • Important to identify explicitly the underlying model assumptions, and to analyze the resulting errors

Word Sense Disambiguation • Many words have multiple meanings • E.g, river bank, financial bank • Problem: Assign proper sense to each ambiguous word in text • Applications: • Machine translation • Information retrieval (mixed evidence) • Semantic interpretation of text

Compare to POS Tagging? • Idea: Treat sense disambiguation like POS tagging, just with “semantic tags” • The problems differ: • POS tags depend on specific structural cues -mostly neighboring, and thus dependent, tags • Senses depend on semantic context – less structured, longer distance dependency many relatively independent/unstructured features

Approaches • Supervised learning: Learn from a pre-tagged corpus • Dictionary-Based Learning Learn to distinguish senses from dictionary entries • Unsupervised Learning Automatically cluster word occurrences into different senses

Using an Aligned Bilingual Corpus • Goal: get sense tagging cheaply • Use correlations between phrases in two languages to disambiguate E.g, interest = ‘legal share’ (acquire an interest) ‘attention’ (show interest) In German Beteiligung erwerben Interesse zeigen • For each occurrence of an ambiguous word, determine which sense applies according to the aligned translation • Limited to senses that are discriminated by the other language; suitable for disambiguation in translation • Gale, Church and Yarowsky (1992)

Evaluation • Train and test on pre-tagged (or bilingual) texts • Difficult to come by • Artificial data – cheap to train and test: ‘merge’ two words to form an ‘ambiguous’ word with two ‘senses’ • E.g, replace all occurrences of door and of window with doorwindow and see if the system figures out which is which • Useful to develop sense disambiguation methods

Performance Bounds • How good is (say) 83.2%?? • Evaluate performance relative to lower and upper bounds: • Baseline performance: how well does the simplest “reasonable” algorithm do? E.g., compare to selecting the most frequent sense • Human performance: what percentage of the time do people agree on classification? • Nature of the senses used impacts accuracy levels

Supervised Classification of Feature-based Instances