Understanding Named Entity Tagging in NLP: Techniques and Frameworks

Overview of Machine Learning for NLP Tasks: part II Named Entity Tagging: A Phrase-Level NLP Task

Outline • Identify a (hard) problem • Frame the problem ‘appropriately’ • (...so that we can apply our tools, find appropriate labeled data) • Preprocess data • Apply FEX and SNoW • Process output from FEX, SNoW to annotate new text • FEX and SNoW server modes

Named Entity Tagging • Identify e.g. people, locations, organizations After receiving his [MISC M.B.A.] from [ORG Harvard Business School], [PER Richard F. America] accepted a faculty position at the [ORG McDonough School of Business] ([ORG Georgetown University]) in [LOC Washington].

Framing NE-tagging Problem • Not an easy problem: • We won’t seek stellar results – • Just want to show that tools work, and how to apply them • Where to begin? • Need labeled data • Data must work with FEX

Ways to Approach NE-tagging • BIO/Open-Close Chunking: • Word-level classification + inference • BIO/Open-Close chunking found depends on labels you train with (e.g. NE labels) • Impose common-sense constraints on open/close labels • Optimize based on classifier confidence • V. Punyakanok and D. Roth, “The Use of Classifiers in Sequential Inference” NIPS-13, Dec, 2000 • Use chunker to find phrase boundaries: • phrase-level predicate – learn labels for phrases • can use FEX’s phrase mode

Framing NE-tagging Problem • We have some labeled Named Entity data • We can identify Noun-phrases with our chunker... • See the Demos page for an example • ...and FEX has a phrase mode... • ...So we can frame this as a (noun) phrase classification problem (assume all NEs are NPs) • avoids working with invalid phrases • avoids inference (as opposed to open-close classifiers)

Raw Text Formatted Text Feature Extraction Preprocessing Machine Learner Function Parameters Labels Labels Inference Classifier(s) Review: Machine Learning System Feature Vectors Training Examples Testing Examples

Solution Sketch • Use labeled data to develop core classifier • Adapt our labeled data to our model of the problem • Experiment with FEX and SNoW to get good performance using our labeled data • Use the FEX and SNoW resources we develop as the core of our NE Tagger • Write tools to preprocess raw text into appropriate form for input to FEX, SNoW • Write tools to convert SNoW output to labels for preprocessed data • Convert labeled preprocessed data into desired output format • For the training/evaluation data, we’ve done the pre- and post-processing for you…

CONLL03 data • Have some column-format data... any problems? O 0 0 B-NP PRP He x TXT/1 0 O 0 1 B-VP VBD said x TXT/1 0 O 0 2 I-NP DT a x TXT/1 0 O 0 3 I-NP NN proposal x TXT/1 0 O 0 4 B-NP JJ last x TXT/1 0 O 0 5 I-NP NN month x TXT/1 0 O 0 6 B-PP IN by x TXT/1 0 B-ORG 0 7 B-NP NNP EU x TXT/1 0 O 0 8 I-NP NNP Farm x TXT/1 0 O 0 9 I-NP NNP Commissioner x TXT/1 B-PER 0 10 I-NP NNP Franz x TXT/1 0 I-PER 0 11 I-NP NNP Fischler x TXT/1 0

Design Decisions • NE phrases are a subset of NPs • We can find NPs, so label only NPs • Given chunking, can use FEX phrase mode • CONLL03 data: NPs not labeled as NEs • NE phrases could be embedded • How to resolve embeddings? • Avoid embedding – ‘enlarge’ NE phrases • Data has been preprocessed to reflect our needs

Setting up... • Download NE data from CogComp tools page • ne_tut_processed.tar.gz • Download sample FEX script • link: ‘sample NE FEX script’ • file: NE-simple.scr

Review: What FEX is doing... • Think of FEX as generating a list of boolean variables, X1, X2, … , Xn • Lexicon maps boolean variable Xi to a propositional logic term • e.g. “1204 w[rejects*]” could be written X1024 == BEFORE(X, TARG) where X == “rejects”, TARG є {too, to, two} • In FEX output: • If boolean variable is present, it is active • If boolean variable is not present, it is inactive

FEX advanced modes: Phrase Mode • Why do we need extensions? • The original design of FEX is “word-based” • Each element is a word, and so is the target • Phrase detection/classification problem: The target is a phrase. • E.g. Named Entity tagging, Shallow Parse tagging • Document classification problem:The target is the whole document. • Relations: Target is at some intermediate level of representation. • FEX also has an Entity-Relation mode…

Basic Structure • Two types of elements: phrases & words • FEX’s window semantics are different for phrase mode • Column format input only Phrase W1 W2 W7 W8 W3 W4 W5 W6

Changes to Fex for Phrase Mode • Only accepts COLUMN format input • 1st column is used to store (phrase) labels. • 2nd column is used to store named entity tags. • Both use BIO format. • Columns 2-6 have fixed meanings: • 2 NE; • 3 Index; • 4 Phrase boundary; • 5 POS; • 6 Word

Sample Column Format Data O 0 0 I-NP PRP He x TXT/1 0 O 0 1 I-VP VBD said x TXT/1 0 O 0 2 I-NP DT a x TXT/1 0 O 0 3 I-NP NN proposal x TXT/1 0 O 0 4 B-NP JJ last x TXT/1 0 O 0 5 I-NP NN month x TXT/1 0 O 0 6 I-PP IN by x TXT/1 0 B-ORG 0 7 I-NP NNP EU x TXT/1 0 O 0 8 I-NP NNP Farm x TXT/1 0 O 0 9 I-NP NNP Commissioner x TXT/1 B-PER 0 10 I-NP NNP Franz x TXT/1 0 I-PER 0 11 I-NP NNP Fischler x TXT/1 0

Phrase Mode Option • FEX command line option –P <length> • -P takes an integer as its argument, which stands for the maximum length of the candidate phrases. • For example, “fex -P 4” will generate examples for every phase of length 1, 2 ,3 and 4 from the corpus file. • If the length is equal to 0, then only positive examples will be generated. > fex –P 0 ne.scr ne.lex ne.corp ne.out

Window Range in Phrase Mode • The meaning of the offsets in the window is different in Phrase mode: w1 w2 w3 W4 W5 W6 w7 w8 w9 -3 -2 -1 0 0 0 1 2 3 “-1: w[0,0]” returns w[W4], w[W5], w[W6]. “-1 loc: w[0,0]” returns w[*W4]*, w[*_W5]*, w[*__W6]*. (NOTE: * after [] indicates ‘within phrase’) “-1 loc: w[-2,-1]” returns w[w2_*], w[w3*]. “-1 loc: w[1, 2]” returns w[*w7], w[*_w8].

Phrase Type Sensors • How to specify patterns within phrase? • Several phrase type sensors can be used. • “-1 phLen[0,0]” returns 3 for the above corpus file, since "W4 W5 W6" contains 3 words. • phNoSmall is active if all words in the target phrase are either capitalized (initial), symbols, or numbers. • phAllWord is active if all the elements in the target phrase are words (a-z,A-Z) • Many other custom sensors – check the FEX source code (Sensor.h)

RGF operator conjunct w1 w2 w3 W4 W5 W6 w7 w8 w9 -3 -2 -1 0 0 0 1 2 3 • “conjunct(-1:w[-2,-1]; -1:phLen[0,0]; -1:w[1,2])” generates • w[w2]--phLen[3]--w[7], w[w2]--phLen[3]--w[8] • w[w3]--phLen[3]--w[7], w[w3]--phLen[3]--w[8]

Choose FEX, SNoW parameters • Use FEX phrase mode: % ./fex –P 0 ne.scr ne.lex data.in ne-snow.ex • Train SNoW with the resulting examples: % ./snow –train –I ne-snow.ex –F ne.net –W:0-5 • Test SNoW with examples from test data: % ./snow –test –I ne-snow2.ex –F ne.net –o allpredictions –R ne.res

Improving Classifier Performance • Tune fex script: experiment with different sensors • InitialCapitalized, NotInitialCapitalized, AllCapitalized • Tune SNoW using Test data • analyze.pl – a tool to help with tuning • Gives accuracy for each label • Requires SNoW’s ‘-o allpredictions’ mode % ./analyze.pl snow.res

We now have a classifier… • Need a way to apply it to new text… • No formatting or Gold Standard labeling • Need to enrich with POS, SP • Need to track SNoW output and use it to label the data • Sample tools: • link: ‘NE tagging: tools for new data’ • file: tut_ne_postprocess.tar.gz

Classifying New Data First, let’s enrich our input: • POS-tagging – POS tagger • Chunking – Shallow Parser • NOTE: SP output format is not FEX-compatible • Convert to Column format • Tool available from ccg tools page % ./chunk-to-column.pl inputFile > outputFile • Run data through FEX and SNOW servers • One file at a time • Doesn’t reload lexicon/network each time • Can pipe test data through both together

Making life easier... • Starting SNoW server: % ./snow –server <port> -F network.net & • Starting FEX server: % ./fex –s <port> -P 0 <script> <lexicon> & • Need client scripts to interact with the servers • See Snow_v3.1/tutorial/example-client.pl for SNoW • See fex/fexClient.pl for FEX • Clean up after use… • ‘ps’ • kill server processes

Post-processing • SNoW ‘-o winners’ mode %./snow –test –I ... –F ... –R text.winners.res –o winners • Adding results to original data • SNoW output mode must be ‘winners’ % ./numbers-to-labels.pl text.winners.res ne.lex > text.lab % ./apply-labels.pl text.col text.lab > text.col.lab • In my solution, seeming disparity between performance on held-out data and on the completely unseen text • WHY? • What is the best way to improve the performance? (i.e., what is likely to give the best return per unit time invested?)

Summary: SNoW and FEX • SNoW is supervised learning system • Needs labeled data • Performance constrained by the quality of the features it is given • Works with numerical features – needs preprocessing stage to extract those features • Fast, and good performance • FEX provides a framework for feature engineering • Designed to represent examples in SNoW input format • Does *not* generate features automatically – not a replacement for human expert! • Requires certain input formats • Fairly modular – write new sensors to capture new feature types • Terse, expressive feature descriptors

Summary: solving NLP problems • Need to frame problem appropriately (e.g. NE as noun phrase tagging) • Need appropriate labeled data • If you want an application, will have to write pre- and post-processing • SNoW and FEX work close to the mathematical models underlying machine learning • User has good control over ML algorithms • Be prepared to spend some time on error analysis and feature engineering!

Understanding Named Entity Tagging in NLP: Techniques and Frameworks