200 likes | 320 Vues
This lecture focuses on Named Entity Recognition (NER) and reviews Homework 3, which involves extracting dollar amounts from a text file. Students will utilize regex patterns to identify monetary values, quantify the results, and determine statistical insights like largest, smallest, and median dollar amounts. The session covers techniques for recognizing numerical patterns and handling variations in currency formats, including Canadian dollars. Additionally, students will engage with practical coding examples utilizing Perl to apply learned concepts effectively.
E N D
LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong
Administrivia • Homework 2 graded • Today's topics • Homework 3 review • Named Entity Recognition (NER)
Homework 3 Review • extract all the dollar amounts from WSJ9_001e.txt within the <TEXT> </TEXT> markups. • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • Compute: • how many dollar amounts you extracted • The largest dollar amount • The smallest dollar amount • The median dollar amount • Appendix: • list all the dollar amounts
Homework 3 Review start with code template from last time…
Homework 3 Review • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • regex (part by part): numeric • \$ • \$[\d\.,]+ • \$[\d\.,]*\d word • (tr|[mb])illions? numeric (word) • \$[\d\.,]*\d\s*((tr|[mb])illions?|)
Homework 3 Review • regex so far: • \$[\d\.,]*\d\s*((tr|[mb])illions?|) • Decide on which parts we want to extract and store them: • while ($line =~ /\$([\d\.,]*\d)\s*((tr|[mb])illions?|)/g) { • $numeric = $1; • $numeric =~ s/,//g; # remove commas • $word = $2; • Compute value: • $word = million $numeric * 1000000 • $word = billion $numeric * 1000000000 • $word = trillion $numeric * 1000000000000 word numeric
Homework 3 Review • What about examples like? • 16.125 Canadian dollars • 100 million Canadian dollars • regex: • numeric: \d[\d\.,]* • word: (tr|[mb])illions? • dollars: ([Cc]anadian|)\s+dollars? • code: while ($line =~ /(\d[\d\.,]*)\s*((tr|[mb])illions?|)\s*([Cc]anadian|)\s+dollars?/g) { $numeric = $1; $numeric =~ s/,//g; # remove commas $word = $2; … }
Homework 3 Review Run: • perldollar.perl WSJ9_001e.txt • Number: 342 • Smallest: 1.55 • Largest: 3100000000000 • Median(#170,#171): 85944.5
Named Entity Recognition (NER) • Named-Entity Recognition (NER) • (also Identification and Extraction) tries to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. [paraphrased from http://en.wikipedia.org/wiki/Named-entity_recognition]
Example WSJ9_002.txt
Illinois NER System NLP systems might also compute: anaphora reference http://cogcomp.cs.illinois.edu/demo/ner/
Textbook • See JM Chapter 22: Information Extraction • 22.1 Named Entity Recognition • 22.2 Relation Detection and Classification • also Chapter 21 for Anaphora Resolution
JM Chapter 22 Ambiguity: sometimes systematic, sometimes not
JM Chapter 22 • Word by word labeling (IOB “inside outside beginning”)
JM Chapter 22 POS information Shape Syntactic chunking
JM Chapter 22 • What features to use in making a decision (used also for Machine Learning)