Automatic Wrappers for Large Scale Web Extraction

Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)

Task: Learn rules to extract information (e.g. Directors) from structurally similar pages. VLDB 2011, Seattle, USA

html body head class=‘head’ div class=‘content’ div title Godfather table width=80% table td td td td td td Title : Godfather Coppola Director : Runtime 118min • We can use the following Xpath rule to extract directors W1 = /html/body/div[2]/table/td[2]/text() VLDB 2011, Seattle, USA

Wrappers • Can be learned with a little amount of supervision. • Very effective for site-level extraction. • Have been extensively studied in literature. VLDB 2011, Seattle, USA

In This Work: • Objective: learn wrappers without site-level supervision. VLDB 2011, Seattle, USA

VLDB 2011, Seattle, USA

Idea • Obtain training data cheaply using dictionaries or automatic labelers. • Make wrapper induction tolerant to noise. VLDB 2011, Seattle, USA

Summary of Approach • A generic framework, that can incorporate wrapper inductors with plausible properties. • Input : A wrapper inductor Φ, a set of labels L • Idea: Apply Φ on all subsets of L and choose the wrapper that gives the best list. VLDB 2011, Seattle, USA

Summary of Approach • Two main problems: • Wrapper Enumeration: How to generate the space of all the possible wrappers efficiently? • Wrapper Ranking: How to rank the enumerated wrappers based on quality? VLDB 2011, Seattle, USA

Example : TABLE wrapper system • Works on a table. • Generates wrappers from the following space: a single cell, a row, a column or the entire table. VLDB 2011, Seattle, USA

Example : TABLE wrapper system • L = { n1, n2, n4, a4, z5} • 32 possible subsets • 8 unique wrappers : {n1, n2, n4, a4, z5, C1, R4, T} VLDB 2011, Seattle, USA

Wrapper Enumeration Problem • Input : A wrapper inductor, Φ and a set of labels L • Wrapper space of L is defined as W(L) = {Φ(S)| S ⊆ L} • Problem : Enumerate the wrapper space of L in time polynomial in the size of the wrapper space and L. VLDB 2011, Seattle, USA

Wrapper Inductors • TABLE : The wrapper inductor as defined before • XPATH : Learn the minimal xpath rule, in a simple fragment of Xpath, that covers all the training examples • LR : Find the maximal pair of strings preceding and following all the training examples. The output of the wrapper is all strings delimited by the pair. VLDB 2011, Seattle, USA

Well-behaved Inductor • A wrapper inductor Φ is well-behaved if it has following properties: • [Fidelity] L ⊆Φ(L) • [Closure] l ∈ Φ(L) ⇒Φ(L) = Φ(L ∪ l) • [Monotonicity] L1⊆ L2 ⇒Φ(L1) ⊆ Φ(L2) • Theorem : TABLE, LR and XPATH are well-behaved wrapper inductors. VLDB 2011, Seattle, USA

Bottom-up Algorithm • Start with singleton labels in L as candidate label sets • Learn wrappers by feeding candidate label sets to Φ • Incrementally apply one-label extensions to each candidate • Extend candidates with the closure of wrappers learned by Φ • Theorem : Bottom-up algorithm is sound and complete • Theorem : Bottom-up algorithm makes at most k.|L| calls to the wrapper, where k is the size of the wrapper space. VLDB 2011, Seattle, USA

Can we do better? • A wrapper inductor is a feature-based inductor if: • Every label is associated with a set of features ((attribute, value) pairs) • Φ(L) = intersection of all the features of L • Output of a wrapper w = text nodes satisfying all the features of w • E.g. TABLE can be expressed as a feature-based inductor with two features, row and col. • Both LR and XPW can be expressed as a feature-based inductor. VLDB 2011, Seattle, USA

Top-down Algorithm • We give a top-down algorithm for a feature-based wrapper that makes exactly k calls to the wrapper, where k is the size of the wrapper space. VLDB 2011, Seattle, USA

Wrapper Ranking Problem • Given a set of wrappers, we want to output one that gives the “best” list. • Let X be a list extracted by a wrapper w • Choose wrapper that maximizes P[X | L], or equivalently, P[L | X] P[X] VLDB 2011, Seattle, USA

Example: Extracting names from business listings • Let us rank the following three lists as candidates for the set of names: • X1 = first column • X2 = entire table • X3 = first two columns VLDB 2011, Seattle, USA

Example: Extracting names from business listings • X1 = first column • P[L | X1] : 2 wrong labels, 3 correct labels • P[X1] : nice repeating structure, schema size = 4 VLDB 2011, Seattle, USA

Example: Extracting names from business listings • X2 = entire table • P[L | X2] : 0 wrong labels, 5 correct labels • P[X2] : nice repeating structure, schema size =1 VLDB 2011, Seattle, USA

Example: Extracting names from business listings • X3 = first two columns • P(L | X3) : 1 wrong label, 4 correct labels • P(X3) : poor repeating structure, schema size = 1 or 3 VLDB 2011, Seattle, USA

Ranking Model • P[L | X] • Assume a simple annotator with precision p and recall r that independently labels each node. • Each node in X is added to L with probability r • Each node not in X is added to L with probability 1- p VLDB 2011, Seattle, USA

Ranking Model • P[X] • Define features of the grammar that describes X, e.g. schema size and repeating structure • Learn distributions on the values of features, or take it as input as part of domain knowledge. VLDB 2011, Seattle, USA

Experiments • Datasets: • DEALERS : Used automatic form filling techniques to obtain dealer listings from 300 store locator pages • DISCOGRAPHY : Crawled 14 music websites that contain track listings of albums. • Task : Automatically learn wrappers to extract business names/track titles for each of the website. VLDB 2011, Seattle, USA

Summary • A new framework for noise-tolerant wrapper induction • Two efficient wrapper enumeration algorithms • Probabilistic wrapper ranking model • Web-scale information extraction • No site-level supervision  No manual labeling • Tolerating noise in automatic labeling VLDB 2011, Seattle, USA

Bottom-up Algorithm • INPUT : Φ, L • Z = all singleton subsets of L • W = Z • while (Z not empty) Remove the smallest set S from Z For each possible single-label expansion S’ of S Add Φ(S’) to W Add (Φ(S’) ∩ L) back to Z VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={n1, n2, n4, a4, z5} n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={n2, n4, a4, z5, {n1, n2, n4}} C1 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={n2, n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 a4 z5 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 a4 z5 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={a4, z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={{n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={{n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={{n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Bottom-up Algorithm Z={} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

Top-down Algorithm n1, n2, n4, a4, z5 row column n1, n2, n4 n4, a4 a4 z5 row n2 n1 n4 VLDB 2011, Seattle, USA

non-labeled nodes outside X All nodes H Wrapper Ranking labeled nodes outside X X A2 • argmaxX P(L|X) P(X) ? • Possible values of X are the possible wrappers computed byΦ • P (L |X ): probability of observing L given that X is the right wrapper • The annotator has precision p, and recall r (estimated from tested labelings) • Independent annotation process: • Decide on labeling nodes independently • Each node in X is added to L with probability r • Each node not in X is added to L with probability 1-p L A1 X1 X2 Non-labeled nodes in X labeled nodes labeled nodes in X VLDB 2011, Seattle, USA

Automatic Wrappers for Large Scale Web Extraction