200 likes | 316 Vues
LING/C SC/PSYC 438/538. Lecture 9 Sandiway Fong. Administrivia. Homework 2 graded Today's topics Homework 3 review Named Entity Recognition (NER). Homework 3 Review. extract all the dollar amounts from WSJ9_001e.txt within the <TEXT> </TEXT> markups. Examples (actual): $23 million
E N D
LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong
Administrivia • Homework 2 graded • Today's topics • Homework 3 review • Named Entity Recognition (NER)
Homework 3 Review • extract all the dollar amounts from WSJ9_001e.txt within the <TEXT> </TEXT> markups. • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • Compute: • how many dollar amounts you extracted • The largest dollar amount • The smallest dollar amount • The median dollar amount • Appendix: • list all the dollar amounts
Homework 3 Review start with code template from last time…
Homework 3 Review • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars • regex (part by part): numeric • \$ • \$[\d\.,]+ • \$[\d\.,]*\d word • (tr|[mb])illions? numeric (word) • \$[\d\.,]*\d\s*((tr|[mb])illions?|)
Homework 3 Review • regex so far: • \$[\d\.,]*\d\s*((tr|[mb])illions?|) • Decide on which parts we want to extract and store them: • while ($line =~ /\$([\d\.,]*\d)\s*((tr|[mb])illions?|)/g) { • $numeric = $1; • $numeric =~ s/,//g; # remove commas • $word = $2; • Compute value: • $word = million $numeric * 1000000 • $word = billion $numeric * 1000000000 • $word = trillion $numeric * 1000000000000 word numeric
Homework 3 Review • What about examples like? • 16.125 Canadian dollars • 100 million Canadian dollars • regex: • numeric: \d[\d\.,]* • word: (tr|[mb])illions? • dollars: ([Cc]anadian|)\s+dollars? • code: while ($line =~ /(\d[\d\.,]*)\s*((tr|[mb])illions?|)\s*([Cc]anadian|)\s+dollars?/g) { $numeric = $1; $numeric =~ s/,//g; # remove commas $word = $2; … }
Homework 3 Review Run: • perldollar.perl WSJ9_001e.txt • Number: 342 • Smallest: 1.55 • Largest: 3100000000000 • Median(#170,#171): 85944.5
Named Entity Recognition (NER) • Named-Entity Recognition (NER) • (also Identification and Extraction) tries to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. [paraphrased from http://en.wikipedia.org/wiki/Named-entity_recognition]
Example WSJ9_002.txt
Illinois NER System NLP systems might also compute: anaphora reference http://cogcomp.cs.illinois.edu/demo/ner/
Textbook • See JM Chapter 22: Information Extraction • 22.1 Named Entity Recognition • 22.2 Relation Detection and Classification • also Chapter 21 for Anaphora Resolution
JM Chapter 22 Ambiguity: sometimes systematic, sometimes not
JM Chapter 22 • Word by word labeling (IOB “inside outside beginning”)
JM Chapter 22 POS information Shape Syntactic chunking
JM Chapter 22 • What features to use in making a decision (used also for Machine Learning)