Information Extraction

Information Extraction CS511 - UIUC - Fall 2005 Group Presentation: Hieu Le Nhung Nguyen Mayssam Sayyadian (With most of slides from Anhai Doan, D.W. Embley, D.M Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith, S.W. Liddle, D.W. Quass, Yi , William W. Cohen & Scott Wen-Tau)

Road Map • Motivational Problem • What is Information Extraction (IE)? • IE History • Landscape of problems and solutions • Techniques in IE • Conclusion

Example: The Problem Martin Baker, a person Genomics job Employers job posting form

Example: A Solution

foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Extracting Job Openings from the Web

Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S.

IE from Research Papers

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft..

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Richard Stallman founder What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * *

IE History Pre-Web • Mostly news articles • De Jong’s FRUMP [1982] • Hand-built system to fill Schank-style “scripts” from news wire • Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] • Early work dominated by hand-built models • E.g. SRI’s FASTUS, hand-built FSMs. • But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web • AAAI ’94 Spring Symposium on “Software Agents” • Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. • Tom Mitchell’s WebKB, ‘96 • Build KB’s from the Web. • Wrapper Induction • Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],… • Citeseer; Cora; FlipDog; contEd courses, corpInfo, …

IE History Biology • Gene/protein entity extraction • Protein/protein fact interaction • Automated curation/integration of databases • At CMU: SLIF (Murphy et al, subcellular information from images + text in journal articles) Email • EPCA, PAL, RADAR, CALO: intelligent office assistant that “understands” some part of email • At CMU: web site update requests, office-space requests; calendar scheduling requests; social network analysis of email.

IE is different in different domains! Example: on web there is less grammar, but more formatting & linking Newswire Web www.apple.com/retail Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. "Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles." www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html The directory structure, link structure, formatting & layout of the Web is its own new grammar.

Landscape of IE Tasks (1/4):Degree of Formatting Text paragraphs without formatting Grammatical sentencesand some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets,rich formatting & links Tables

Landscape of IE Tasks (2/4):Intended Breadth of Coverage Web site specific Genre specific Wide, non-specific Formatting Layout Language Amazon.com Book Pages Resumes University Names

Landscape of IE Tasks (3/4):Complexity E.g. word patterns: Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama… The CALD main office can be reached at 412-268-1299 The big Wyoming sky… Ambiguous patterns,needing context andmany sources of evidence Complex pattern U.S. postal addresses Person names University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, SoftwareEngineer at WhizBang Labs. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210

Landscape of IE Tasks (4/4):Single Field/Record Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship N-ary record Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut “Named entity” extraction

Why is IE Difficult? • Different Languages • Morphology is very easy in English, much harder in German and Hebrew. • Identifying word and sentence boundaries is fairly easy in European language, much harder in Chinese and Japanese. • Some languages use orthography (like english) while others (like hebrew, arabic etc) do no have it. • Different types of style • Scientific papers • Newspapers • memos • Emails • Speech transcripts • Type of Document • Tables • Graphics • Small messages vs. Books

Techniques in IE

Overview of Techniques in IE • Wrapper Construction • NLP oriented • Conceptual Model Based

Overview: Wrapper Construction Approach • Heavily depend on the structure of HTML pages • Different wrappers for different sources even of the same domain • Wrapper maintenance is a big problem

Overview: NLP Approach • Heavily base on NLP & Machine Learning technique • Acquire deep understanding of the sentence. • Prefer complete sentences vs. partial ones

Overview: Conceptual Model Based • New • Depend on IA technique: Knowledge presentation. • More detail is comming

Overview of some ML techniques in IE • Rote Learning • Sliding window* • Naïve Bayes • HMM • ME • SRV • Multistrategy*

Rote Learning • Remember everything • Good for Person Name, City Name, Country Name

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Naïve Bayes (1) • Simple but powerful  Highly recommended • Have a strong assumption: all features contributing to the labels are totally independent

A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it.

“Naïve Bayes” Sliding Window Results Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field F1 Person Name: 30% Location: 61% Start Time: 98%

Hidden Markov Model (HMM) • A very popular ML algorithm in NLP and Speech Recognition. • Assumption: label of an state just depends on observation at that state and the state of the limited previous states.

IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayPedro Domingosspoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos

HMM for Segmentation • Simplest Model: One state per entity type

HMM Example: “Nymble” [Bikel, et al 1998], [BBN “IdentiFinder”] Task: Named Entity Extraction Transitionprobabilities Observationprobabilities Person end-of-sentence P(ot | st , st-1) P(st | st-1, ot-1) start-of-sentence Org P(ot | st , ot-1) or (Five other name classes) Back-off to: Back-off to: P(st | st-1 ) P(ot | st ) Other P(st ) P(ot ) Train on ~500k words of news wire text. Results: Case Language F1 . Mixed English 93% Upper English 91% Mixed Spanish 90% Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Maximum Entropy – Algorithm • Learning algorithm : Maximum entropy classifier • Estimates probabilities based on the principle of making as few assumptions as possible • Pick the probability distribution that has the highest entropy

SRV: a realistic sliding-window-classifier IE system [Frietag AAAI ‘98] • What windows to consider? • all windows containing as many tokens as the shortest example, but no more tokens than the longest example • How to represent a classifier? It might: • Restrict the length of window; • Restrict the vocabulary or formatting used before/after/inside window; • Restrict the relative order of tokens; • Use inductive logic programming techniques to express all these… <title>Course Information for CS213</title> <h1>CS 213 C++ Programming</h1>

SRV: a rule-learner for sliding-window classification • Primitive predicates used by SRV: • token(X,W), allLowerCase(W), numerical(W), … • nextToken(W,U), previousToken(W,V) • HTML-specific predicates: • inTitleTag(W), inH1Tag(W), inEmTag(W),… • emphasized(W) = “inEmTag(W) or inBTag(W) or …” • tableNextCol(W,U) = “U is some token in the column after the column W is in” • tablePreviousCol(W,V), tableRowHeader(W,T),…

courseNumber(X) :- tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true), some(X, B, <>. tripleton, true) Non-primitive conditions make greedy search easier SRV: a rule-learner for sliding-window classification • Non-primitive “conditions” used by SRV: • every(+X,f, c) = for all W in X : f(W)=c • some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c • tokenLength(+X, relop,c): • position(+W,direction,relop, c): • e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)

Multistrategy: consulting many experts • The real world show that there is no dominant ML algorithm. • Each ML algorithm has its own bias • Combing learners as a meta learner out perform individual learner (worst case?).

Conceptual Model Based

Motivation • Database-style queries are effective • Find red cars, 1993 or newer, < $5,000 • Select * From Car Where Color=“red” And Year >= 1993 And Price < 5000 • Web is not a database • Uses keyword search • Retrieves documents, not records • Assuming we have a range operator: • “red” and (1993 to 1998) and (1 to 5000)

GOALQuery the Web like we query a database Example: Get the year, make, model, and price for 1987 or later cars that are red or white. Year Make Model Price ----------------------------------------------------------------------- 97 CHEVY Cavalier 11,995 94 DODGE 4,995 94 DODGE Intrepid 10,000 91 FORD Taurus 3,500 90 FORD Probe 88 FORD Escort 1,000

PROBLEMThe Web is not structured like a database. Example: <html> <head> <title>The Salt Lake Tribune Classified’s</title> </head> … <hr> <h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888</h4> <hr> … </html>

Making the Web Look Like a Database • Web Query Languages • Treat the Web as a graph (pages = nodes, links = edges). • Query the graph (e.g., Find all pages within one hop of pages with the words “Cars for Sale”). • Wrappers • Find page of interest. • Parse page to extract attribute-value pairs and insert them into a database. • Write parser by hand. • Use syntactic clues to generate parser semi-automatically. • Query the database.

Automatic Wrapper Generation for a page of unstructured documents, rich in data and narrow in ontological breadth Application Ontology Web Page Ontology Parser Record Extractor Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme Constant/Keyword Recognizer Unstructured Record Documents Database-Instance Generator Populated Database Data-Record Table

Information Extraction