LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 16 • 3/6/2013

Recommended reading • Brill (1995): transformation-based learning for POS tagging • Jurafsky & Martin • 22.1 Named Entity Recognition • 5.7 Evaluation and error analysis, in the context of POS tagging • 13.5.2-3 Machine learning approaches to chunking, Chunking-system evaluations • Bikel et al. (1999): named entity recognition • CoNLL 2003 website and results

Outline • Transformation-based learning • Comparison of HMMs and TBL • Named entity recognition and chunk classification • Phrase chunking • Evaluating chunks: Precision, Recall, and F-measure • Multiclass classification • CoNLL 2003 shared task on NER

Transformation-based learning (TBL) • Eric Brill, 1992-1995 • Benefits of both rule-based and statistical systems: • Automatic training from an annotated corpus • Easy-to-understand symbolic rules • Performs competitively with statistical systems (such as HMM) on POS tagging • (But hasn’t been much applied to other problems besides POS tagging) http://ccl.pku.edu.cn/doubtfire/NLP/Lexical_Analysis/Word_Segmentation_Tagging/Eric_Brill/Eric%20Brill.files/brill_bw.gif

Software • fbTBL • http://www.cs.jhu.edu/~rflorian/fntbl/index.html • NLTK demo • http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

General idea of TBL The process of Brill tagging is usually explained by analogy with painting. Suppose we were painting a tree, with all its details of boughs, branches, twigs and leaves, against a uniform sky-blue background. Instead of painting the tree first then trying to paint blue in the gaps, it is simpler to paint the whole canvas blue, then "correct" the tree section by over-painting the blue background. In the same fashion we might paint the trunk a uniform brown before going back to over-paint further details with even finer brushes. Brill tagging uses the same idea: begin with broad brush strokes then fix up the details, with successively finer changes. (NLTK book, section 5.6)

Example sentence and rules • The President said he will ask Congress to increase grants to states for vocational rehabilitation • Consider these two rules: Rule 1:replace NN with VB when the previous tag is TO Rule 2: replace TO with IN when the next tag is NNS.

Tagging process • Begin with an initial tagger that assigns tags to all words • Apply transformation rules in order, to fix mistakes made in previous steps Rule 1:replace NN with VB when the previous tag is TO Rule 2:replace TO with IN when the next tag is NNS.

Basic system design for training INITIAL TAGGER

Initial state tagger • Built by hand: • Proper noun finder, based on capitalization, etc. • Acquired from training corpus: • A lexicon • A list of all tags seen for each word, with one tag identified as most probable • (suffix trigram, POS) pair counts, used to process unknown words

Tagging at the initial state • Tagging procedure • Tag each word its most probable tag • Unknown words are tagged according to their final trigram • Proper noun finder labels names • Initial performance: • Trained on 90% of the Brown Corpus • 92% correct • (Remember that baseline of picking most common tag per word results in 91% correct)

TBL in more detail • In subsequent iterations, the tagger acquires patch rules to improve performance. • Patch rules correct the results of previous rules.

Patch rule acquisition algorithm • While there are useful rules to be found: • Find the rule whose application results in the greatest reduction in error rate. • Apply the rule to the tagged corpus.

Schema ti-3 ti-2 ti-1ti ti+1 ti+2 ti+3 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * Patch rule templates • Patch rule templates are of the form: • Change POS tag A to B in triggering context C A → B / C • The range of triggering contexts:

Some possible patch templates Change tag A to tag B when: • The preceding (or following) word • is tagged Z • is the word W • One of the two preceding (or following) words • is tagged Z • is the word W • The preceding word is tagged Z and the following word is W. • The current word is (or is not) capitalized.

TBL algorithm again

Rule learning is greedy • Select the rule that maximizes the decrease in error at the next state of the tagger (from Günter Neumann, LT-Lab DFKI)

First ten patch rules found in training,using original Brown Corpus tag set(different from Penn Treebank tags)

Tagging new text • To tag new text: • Use initial tagger. • Apply each patch rule in order to the entire corpus.

(Optional) Why does TBL work? • Combines two of the earliest good ideas in AI • General Problem Solver (Newell, Shaw, Simon 1958) • STRIPS MACROPS (Fikes, Nilsson, 1971)

(Optional) Problem Spaces & Problem Solving • Problem Space: Operators and Goals • Operators: The set of legal moves • Goal: Ultimate solution to the problem. • Problem Solving: • Involves Sequence of Operations • Moves one makes using the available operators or legal moves • Subgoal Decomposition (a subgoal is an intermediate goal along the route to eventual solution of the problem). • Break the problem down into a series of subgoals, the achievement of which are necessary to achieve the overall goal

(Optional) The General Problem Solver • Newell, Shaw and Simon (1958) • A computer simulation based on the Means-End Analysis heuristic: • Solving problems by repeatedly determining the difference between the current state and the goal or subgoal state, then finding an operator that reduces this difference. • Uses Productions: If Context then Operation.

(Optional) An Example from Algebra • How can we represent the knowledge underlying the solution of a problem solving skill like solving for an algebraic unknown? • What do we know that allows us to solve equations such as 2(3x - 11) = 3x + 8 ? • Consider a GPS-like problem space representation -- goals, operators and states

(Optional) Macrops and STRIPS • STRIPS is an implementation of GPS that notes that sequences of operators are often reusable. • These sequences of operators can be viewed as one operator Final State Initial state Operator 1: Operator 2: Macrop 1:

(Optional) TBL as early AI • Templates are just GPS operators • Error metric is just greedily reducing difference between current state & goal state • Learner: Solves POS tagging given training data by finding sequence of GPS operators • New material: Take the learned sequence as one giant MACROP, and simply apply it to new material.

Advantages of HMMs • General-purpose model for learning a mapping between observations and hidden representation of sequential data • Mathematically well-understood • Extremely popular algorithm, successfully applied to data in a wide range of domains

Disadvantages of HMM • Limited range of context for determination of current item in sequence • Due to sparse data problem • Assigns nonzero probability to “linguistically impossible” sequences • “Little her to she girl mother way on” • DT DTDTDTDTDTDTDTDTDT • etc.

Disadvantage of HMM: lots of parameters • Lower end for English POS tagging: • 40 tags • 30,000 word vocabulary • Initial state probability distribution • 40 probabilities • State transition table: • 40 x 40 = 1,600 probabilities • Symbol emission tables: • 40 tags x 30,000 words = 1,200,000 probabilities • Very difficult to inspect tables of probabilities and understand how the HMM works

Advantages of TBL • Patch rules can look at a wider range of context than n-grams • Context to the right of a word • Combination of left and right context • Look for spelling, capitalization, other things • No probability estimation and smoothing for rules • Rules are easy to understand

Disadvantages of TBL • Long training procedure • At each iteration, generate many possible rules • Need to evaluate every rule on entire corpus • Tag for a word can be revised multiple times, unlike HMM • Algorithm did not become popular in NLP

Comparative performance • Bottom: TBL with rules for unknown words

Named Entity Recognition (NER) • Find phrases in a sentence referring to named entities of different types, such as: • Person names • Company/organization names • Locations • Dates and times • Percentages • Monetary amounts

Named Entity Recognition example

List searching • Use lists of entities • List of names of people • Lists of locations • etc. • List searching doesn‘t perform well • Too numerous to include in standard dictionaries • Changing constantly • Appear in many variant forms, including abbreviations

For locations, gazetteers are insufficient http://nltk.googlecode.com/svn/trunk/doc/images/locations.png

Many forms for same entity • Hillary Clinton • Hillary Diane Clinton • Hillary D. Clinton • Secretary of State Clinton • Hillary • H D Clinton • Hillary Rodham Clinton • Mrs. Clinton • Clinton, Hillary • Clinton

Difficult cases • Foreign names • Wen Jiabao • Mikhail Khodorkovsky • Abdul Aziz bin Abdur Rahman Al Saud • Capitalization • hillaryclinton • HILLARY CLINTON • Ambiguities • Clinton, S.C. = location • rich baker = person or phrase? • john = person or toilet?

Problems with hand-written rules • Large number of rare names • Very difficult to provide coverage of all cases • Rules too general • Need information about how string appears in sentence • She became a rich baker by selling cupcakes. • Very difficult to specify exact combination of conditions for precise recognition • Hard to maintain system • Rule-based systems can get very, very large

NER: example • Want to identify sequences of words in a text that are Named Entities • Schools Chancellor Joel Klein announced the grades at Williamsburg Preparatory School in Brooklyn, which received an A. • Want to extract: • PERSON: Schools Chancellor Joel Klein • LOCATION: Williamsburg Preparatory School in Brooklyn

Nested entities • Long entities often contain shorter ones • Examples: • “Williamsburg Preparatory School in Brooklyn” contains: • “Williamsburg Preparatory School” • Williamsburg • “Brooklyn” • “Schools Chancellor Joel Klein” contains: • Joel Klein • Joel • Klein • Your application determines the types of entities desired (all entities, longer entities, shorter entities)

Doing NER with machine learning • 1. Formulate NER as a machine learning problem • 2. Choose features to predict whether word(s) are named entities • Spelling of word • POS tag of word • Capitalization • Punctuation marks • Position in the sentence • Above features for previous, next words • etc. • 3. Train a classifier

{^-2, Sonny-1, Bono0, was+1(be), an+2} Features for NER: words and morphemes in a 5-word window Sonny Bono was an advanced skier on an intermediate run , the Orion trail . And he was familiar with this popular South Lake TAHOE resort , having skied at Heavenly for more than 20 years . { skied-2(ski), at-1,Heavenly0, for1, more2 }

Features for NER: prefixes/suffixes of length up to 4 Sonny Bono was an advanced skier on an intermediate run , the Orion trail . And he was familiar with this popular South Lake TAHOE resort , having skied at Heavenly for more than 20 years . {s_, sk_, ski_, skie_, _r, _er, _ier, ..}

Features for NER: string properties allLower Sonny Bono was an advanced skier on an intermediate run , the Orion trail . And he was familiar with this popular South Lake TAHOE resort , having skied at Heavenly for more than 20 years . firstCap allCaps number

Features for NER: lists / gazeteers Person Sonny Bono was an advanced skier on an intermediate run , the Orion trail . And he was familiar with this popular South Lake TAHOE resort , having skied at Heavenly for more than 20 years . Location Location

LING / C SC 439/539 Statistical Natural Language Processing