IE by Candidate Classification: Califf & Mooney

IE by Candidate Classification:Califf & Mooney William Cohen 1/21/03

Where this paper fits What happens when the candidate generator becomes very general? Candidate Generator Candidate phrase Learned filter Extracted phrase

Example task for Rapier

Rapier: the 3-slide version [Califf & Mooney, AAAI ‘99] A bottom-up rule learner: initialize RULES to be one rule per example; repeat { randomly pick N pairs of rules (Ri,Rj); let {G1…,GN} be the consistent pairwise generalizations; let G* = Gi that optimizes “compression” let RULES = RULES + {G*} – {R’: covers(G*,R’)} } where compression(G,RULES) = size of RULES- {R’: covers(G,R’)}and “covers(G,R)” means every example matching G matches R

Differences dropped Common conditions carried over to generalization <title>Course Information for CS213</title> <h1>CS 213 C++ Programming</h1> … courseNum(window1) :- token(window1,’CS’), doubleton(‘CS’), prevToken(‘CS’,’CS213’), inTitle(‘CS213’), nextTok(‘CS’,’213’), numeric(‘213’), tripleton(‘213’), nextTok(‘213’,’C++’), tripleton(‘C++’), …. <title>Syllabus and meeting times for Eng 214</title> <h1>Eng 214 Software Engineering for Non-programmers </h1>… courseNum(window2) :- token(window2,’Eng’), tripleton(‘Eng’), prevToken(‘Eng’,’214’), inTitle(‘214’), nextTok(‘Eng’,’214’), numeric(‘214’), tripleton(‘214’), nextTok(‘214’,’Software’), … courseNum(X) :- token(X,A), prevToken(A, B), inTitle(B), nextTok(A,C)), numeric(C), tripleton(C), nextTok(C,D), …

Rapier: an alternative approach • Combines top-down and bottom-up learning • Bottom-up to find common restrictions on content • Top-down greedy addition of restrictions on context • Use of part-of-speech and semantic features (from WORDNET). • Special “pattern-language” based on sequences of tokens, each of which satisfies one of a set of given constraints • < <tok{‘ate’,’hit’},POS{‘vb’}>, <tok{‘the’}>, <POS{‘nn’>>

Rapier: the N slide version

Rapier: IE with “rules” • Rule consists of • Pre-filler pattern • Filler pattern • Post-filler pattern • Pattern composed of elements, and each pattern element matches a sequence of words that obeys constraints on • value, POS, Wordnet class of words in sequence • Total length of sequence • Example: “IBM paid an undisclosed amount” matches • PreFiller: <pos=nn or nnp>:1 <anyword>:2 • Filler: <word=‘undisclosed’>:1 • PostFiller: <semanticClass=‘price’> • A rule might match many times in a document

Algorithm:1 Covers many pos, few neg examples ie. Redundant Start with a huge ruleset – one rule per example Every new rule “compresses” the ruleset Expect many high-precision, low-recall rules

PreFiller: <word=‘Subject’> <word=‘:’> … <word=‘com’> <word=‘>’> Filler: <word=‘SOFTWARE’> <word=‘PROGRAMMER’> PostFiller: <word=‘Position’> <word=‘available’> <word=‘for’> …. <word=‘memphisonline’> <word=‘.’> <word=‘com’> Plus POS, info for each word in each pattern (but not semantic class)

Algorithm:2 What is FindNewRule?

An “obvious choice” • Some sort of search based on this primitive step: • Pick a PAIR of rules R1,R2 • Form the most specific possible rule that is more general than either R1 or R2 • But in Rapier’s language there are multiple generalizations of R1 and R2…

Specialize by adding conditions from pairwise generalizations, starting with filler and working out Algorithm: 3

Algorithm 4: pairwise generalization of patterns Heuristic search for best (in some graph sense) possible generalization of Wordnet semantic classes

Algorithm 4: pairwise generalization of patterns Explore several ways to generalize differing POS of word values:

Algorithm 4: pairwise generalization of patterns Patterns of the same length can be generalized element-by-element. Patterns of different length…. - special cases when one list of length zero or one - “punt” and generate a single very general pattern if the lists are both long - if both patterns are moderately short, consider many possible pairings of pattern elements

ABCBA ABCBA ABD ABD Algorithm 4: pairwise generalization of patterns If both patterns are moderately short, consider many possible pairings of pattern elements A + <BC>? + B + [AD] A + B + <[ABCD]{1,4}>

Algorithm: 5 Evaluation of Rules

Results

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it. Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context)

“Naïve Bayes” Sliding Window Results Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field F1 Person Name: 30% Location: 61% Start Time: 98%

Results: second domain

Discussion questions • Is this candidate classification? • Is complexity good or bad in a learned hypothesis? In a learning system? • What are the tradeoffs in expressive rule languages vs simple ones? • Is RAPIER successful in using long-range information? • What other ways are there to get this information?

Wrapper learning Cohen et al, WWW2002

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Learner

Why learning from few examples is important At training time, only four examples are available—but one would like to generalize to future pages as well… Must generalize across time as well as across a single site

now some details….

Improving A Page Classifier with Anchor Extractionand Link Analysis William W. Cohen NIPS 2002

Previous work in page classification using links: • Exploit hyperlinks (Slattery&Mitchell 2000; Cohn&Hofmann, 2001; Joachims 2001): Documents pointed to by the same “hub” should have the same class. • What’s new in this paper: • Use structure of hub pages (as well as structure of site graph) to find better “hubs” • Adapt an existing “wrapper learning” system to find structure, on the task of classifying “executive bio pages”.

Intuition: links from this “hub page” are informative… …especially these links

Task: train a page classifier, then use it to classify pages on a new, previously-unseen web site as executiveBio or other Question: can index pages for executive biographies be used to improve classification? Idea: use the wrapper-learner to learn to extract links to execBio pages, smoothing the “noisy” data produced by the initial page classifier.

Background: “co-training” (Mitchell&Blum, ‘98) • Suppose examples are of the form (x1,x2,y) where x1,x2are independent(given y), and where each xiis sufficient for classification, and unlabeledexamples are cheap. • (E.g., x1 = bag of words, x2 = bag of links). • Co-training algorithm: 1. Use x1’s (on labeled data D) to train f1(x)=y 2. Use f1 to label additional unlabeledexamples U. 3. Use x2’s (on labeled part of U+D to train f1(x)=y 4. Repeat . . .

Simple 1-step co-training for web pages f1 is a bag-of-words page classifier, and S is web site containing unlabeledpages. • Feature construction. Represent a page xin S as a bag of pages that link tox(“bag of hubs”). • Learning. Learn f2 from the bag-of-hubs examples, labeled with f1 • Labeling. Use f2(x) to label pages from S. Idea: use one round of co-training to bootstrap the bag-of words classifier to one that uses site-specific features x2/f2

Improved 1-step co-training for web pages Feature construction. - Label an anchor a in S as positive iff it points to a positive page x (according to f1). Let D = {(x’,a): a is a positive anchor on x’}. - Generate many small training sets Di from D, by sliding small windows over D. - Let P be the set of all “structures” found by any builder from any subset Di - Say that p links to xif p extracts an anchor that points to x. Represent a page x asthe bag of structuresin Pthat link to x. Learning and Labeling. As before.

builder extractor List1

BOH representation: { List1, List3,…}, PR { List1, List2, List3,…}, PR { List2, List 3,…}, Other { List2, List3,…}, PR … Learner

No improvement Co-training hurts Experimental results

Summary - “Builders” (from a wrapper learning system) let one discover and use structure of web sites and index pages to smooth page classification results. - Discovering good “hub structures” makes it possible to use 1-step co-training on small(50-200 example) unlabeled datasets. – Average error rate was reduced from 8.4% to 3.6%. – Difference is statistically significant with a 2-tailed paired sign test or t-test. – EM with probabilistic learners also works—see (Blei et al, UAI 2002)

IE by Candidate Classification: Califf & Mooney