Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model

Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia

Introduction • Script generated websites have html tree structure • Wrappers are used to extract information • Xpath expression to extract director information w1=/html/body/div[2]/table/td[2]/text() • Works for similar pages

Introduction • Evolution cause wrappers to break so high maintenance • Other wrappers w2=//div[@class=‘content’]/*/td[2]/text() w3=//table[@width=‘80%’/td[2]/text() w4=//text()[psib::*[1][text()=‘director’]]

Introduction • This paper discuss • use temporal snapshot of WebPages to develop probabilistic tree edit model • use this model to improve wrapper construction • Method estimates efficiently in quadratic time in the size of the tree • When applied to IMDB it was 86% robust whereas traditional wrappers were 40% robust

Robust Extraction Framework

Change Model • Change model is defined in terms of conditional transducer ‘п’ process • When a forest T is given to П process it converts into forest S • П process is defined into 2 sub process пins ,пds

Change Model • To summarize, the generative process π is characterized by following parameters θ = (pstop, {pdel(l)}, {pins(l)}, {psub(l1, l2))} for l, l1, l2 ∈∑ along with the following conditions: • 0 < pstop < 1 • 0 ≤ pdel(l) ≤ 1 • pins(l) ≥ 0, ∑L pins(l) = 1 • psub(l1, l2) ≥ 0,∑L2 psub(l1, l2) = 1 ……..eq(A)

Model Learner • Archival data contains {S,T} pairs were S is old versions and T is new versions • Model is specified in terms of set of parameters θ • We want to find θ* θ∗ = arg max Π(T,S)∈ArchivalData Pθ(T | S) • Pθ(T | S) is a Computing Transformation Probability

Computing Transformation Probabilities • The transducer π performs a sequence of edit operations consisting of insertions, deletions and substitutions to transform a tree S into another tree T. • Use dynamic programming to compute probabilities as there various ways

Computing Transformation Probabilities • Let DP1(Fs, Ft) denote the probability that π(Fs) = Ft due πins ,πsub • two cases: • The node v was the result of an insertion by πins operator. Let p be the probability that πins inserts the node v in Ft−v to form Ft.Then, the probability of this case is DP1(Fs, Ft −v) ∗ p. • The node v was the result of a substitution. The probability of this case is DP2(Fs, Ft). Hence, we have DP1(Fs, Ft) = DP2(Fs, Ft) + p ∗ DP1(Fs, Ft − v) ……..Eq(1)

Computing Transformation Probabilities • Let DP2(Fs, Ft) denote the probability that π(Fs) = Ft πsub • two cases: • v was substituted for u. In this case, we must have Fs − [u] transform to Ft − [v] and ⌊u⌋ transform to ⌊v⌋. Denoting psub(label(u), label(v)) with p1, the total probability of this case is p1 ∗ DP1(Fs −[u], Ft −[v]) ∗ DP1(⌊u⌋, ⌊v⌋) • v was substituted for some node other than u. we have DP2(Fs, Ft) = p1DP1(Fs − [u], Ft − [v])DP1(⌊u⌋, ⌊v⌋)+ p2DP2(Fs − u, Ft) ……..Eq(2)

Computing Transformation Probabilities

Computing Transformation Probabilities • Let T1 be the tree with the nodes a and b, • let T2 be the tree with single node c. • Let us compute the probability that π(T1) = T2, which is denoted by DP1(T1, T2). Applying Eq (1) we get DP1(T1, T2) = DP2(T1, T2) + pins(c) ∗ DP1(T1, ∅) • Let T3 denote the tree with single node b. Then, DP2(T1, T2) = psub(a, c) ∗ DP1(∅, ∅) ∗ DP1(T3, ∅)+ pdel(a) ∗ DP2(T3, T2) • To compute DP2(T3, T2), we get DP2(T3, T2) = psub(b, c) ∗ DP1(∅, ∅) ∗ DP1(∅, ∅)+ pdel(b) ∗ DP2(∅, T2) • Total probability DP1(T1, T2) = psub(a, c) ∗ pdel(b) ∗ p2 stop + psub(b, c) ∗ pdel(a) ∗ p2 stop+ pdel(a) ∗ pdel(b) ∗ pins(c) ∗ pstop

Parameter estimation • θ∗ = arg max θ N∑n=1logPθ(Tn | Sn) • It is difficult to calculate θ∗so we calculate by Gradient ascent θt+1 = θt + ηg(θt)…..eq(3) g(θ) =∂ log ℓ(θ)/∂θ = N∑n=1∂ logP(Tn | Sn)/ ∂θ • Θ has to satisfy eq(A) • So we use variable reparameterization θij = e αij /N∑j=1 eαij • Eq(3) becomes αt+1 = αt + ηg(αt)

Generating Candidate Wrappers • We use bottom up algorithm starting from general Xpath and specializing it till it matches only the target node • w0 = //table/ ∗ /td/text() //table/tr/td/text() //table[bgcolor =′ red′]/ ∗ /td/text() //table/ ∗ /td[2]/text() • Algorithm maintains a set P of partial wrappers which has recall=1 and precision<1 • Algorithm applies specialization steps to Xpaths in P to convert into new Xpath such that precision becomes 1

Generating Candidate Wrappers

Evaluating robustness of Wrappers • Rob X,θ(ϕ) =∑XY | ϕ |=Pθ(Y | X) • Algorithm for calculating robustness

Experimental Evaluation Change model

Experimental Evaluation Generating Robust Wrappers

Experimental Evaluation Evaluation of Model Learner

Thank youAny Questions?

Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model

Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model

Presentation Transcript

Structured Data Extraction From Web Based on Partial Tree Alignment

Probabilistic models in phonology

Models of Evolution

Bayesian Networks

Advanced Mobile Robotics

A Probabilistic Model for Component-Based Shape Synthesis

Querying Probabilistic Information Extraction

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Model-based approach

Tracking with Online Appearance Model

Chapter 14 Part 1 Statistical Parsing

Model-Based ECG Fiducial Points Extraction Using a Modified EKF Structure

Conceptual-Model-Based Web Data Extraction by Example

Probabilistic Prediction

MRI Brain Extraction using a Graph Cut based Active Contour Model

Possibilistic and probabilistic abstraction-based model checking

AMTEXT: Extraction-based MT for Arabic

Collaborative Filtering: Latent Variable Model

A Systematic Approach For Feature Extraction in Fingerprint Images

Probability-based approach for solving the Rectilinear Steiner tree problem

Robust Semantic Processing for Information Extraction

Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition