LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong

Administrivia • Homework 2 graded • Today's topics • Homework 3 review • Named Entity Recognition (NER)

Homework 3 Review • extract all the dollar amounts from WSJ9_001e.txt within the <TEXT> </TEXT> markups. • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • Compute: • how many dollar amounts you extracted • The largest dollar amount • The smallest dollar amount • The median dollar amount • Appendix: • list all the dollar amounts

Homework 3 Review start with code template from last time…

Homework 3 Review • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • regex (part by part): numeric • \$ • \$[\d\.,]+ • \$[\d\.,]*\d word • (tr|[mb])illions? numeric (word) • \$[\d\.,]*\d\s*((tr|[mb])illions?|)

Homework 3 Review • regex so far: • \$[\d\.,]*\d\s*((tr|[mb])illions?|) • Decide on which parts we want to extract and store them: • while ($line =~ /\$([\d\.,]*\d)\s*((tr|[mb])illions?|)/g) { • $numeric = $1; • $numeric =~ s/,//g; # remove commas • $word = $2; • Compute value: • $word = million $numeric * 1000000 • $word = billion $numeric * 1000000000 • $word = trillion $numeric * 1000000000000 word numeric

Homework 3 Review • What about examples like? • 16.125 Canadian dollars • 100 million Canadian dollars • regex: • numeric: \d[\d\.,]* • word: (tr|[mb])illions? • dollars: ([Cc]anadian|)\s+dollars? • code: while ($line =~ /(\d[\d\.,]*)\s*((tr|[mb])illions?|)\s*([Cc]anadian|)\s+dollars?/g) { $numeric = $1; $numeric =~ s/,//g; # remove commas $word = $2; … }

Homework 3 Review Run: • perldollar.perl WSJ9_001e.txt • Number: 342 • Smallest: 1.55 • Largest: 3100000000000 • Median(#170,#171): 85944.5

Named Entity Recognition (NER)

Named Entity Recognition (NER) • Named-Entity Recognition (NER) • (also Identification and Extraction) tries to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. [paraphrased from http://en.wikipedia.org/wiki/Named-entity_recognition]

Example WSJ9_002.txt

Illinois NER System NLP systems might also compute: anaphora reference http://cogcomp.cs.illinois.edu/demo/ner/

Textbook • See JM Chapter 22: Information Extraction • 22.1 Named Entity Recognition • 22.2 Relation Detection and Classification • also Chapter 21 for Anaphora Resolution

JM Chapter 22

JM Chapter 22 Ambiguity: sometimes systematic, sometimes not

JM Chapter 22 • Word by word labeling (IOB “inside outside beginning”)

JM Chapter 22 POS information Shape Syntactic chunking

JM Chapter 22

JM Chapter 22 • What features to use in making a decision (used also for Machine Learning)

LING/C SC/PSYC 438/538