Web Data Extraction Based On Partial Tree Alignment

Web Data Extraction Based On Partial Tree Alignment To appear in WWW2005, May10-14, Chiba, Japan Authors : Yanhong Zhai, Bing Liu Presenters: Waseem Ahmad, Reem Jaghlit

Introduction • A large amount of information on the Web is contained in regularly structured data objects, which are often data records retrieved from a backend database and converted to HTML. • Such Web data records are important because they often present the essential information of their host pages, e.g., lists of products and services. • Mining such data records and extract data from them enable one to provide value-added services. E.g., • Comparative shopping, meta-search, meta-query, etc.

Examples

Wrapper generation A wrapper is a program that extracts data from a web site and puts them in a database. Two approaches: • Wrapper induction • Automatic extraction

Wrapper induction • Learn data extraction rules from a set of manually labeled positive and negative examples. • Labor intensive & time consuming • Need to be repeated since some pages for the same site might follow different patterns.

Automatic extraction

MDR • Only identifies data records by making use of the HTML tag tree of the web page. • Has 2 shortcomings: 1. Erroneous tags build wrong trees 2. Noise may cause wrong combinations of sub-trees if a single data record is composed of multiple sub-trees.

MDR (continued) Follows two observations: • Defines a data record region as a group of similar data records. These data records are in a contiguous region and are formatted using almost same sequence of HTML tags.

MDR (continued) • A set of similar data records are formed by some child sub-trees of the same parent node. It’s unlikely that a data record starts in the middle of a child sub-tree and ends in the middle of another child sub-tree.

MDR (the 3-steps algorithm) • Build a HTML tag tree of the page • Mine data regions in the page using the tag tree. • Identify data records from each data region. The main enhancement of the MDR algorithm is the use of visual information to build robust trees and find accurate data regions.

MDR-21. Building a HTML Tag Tree • Call embedded parsing & rendering engine of a browser to find 4 boundaries of the rectangle of each HTML element. • Detect containment relationship among the rectangles.

2. Mining Data Regions We can find each data region by comparing tag strings of individual nodes (including their descendants) and combination of multiple adjacent nodes. Generalized node denotes each similar individual node and each node combination.

Visual enhancement for mining data regions The gap between 2 data records in a data region should be no smaller than any gap within a data record.

3. Identifying Data Records Non-contiguous Data Records Case 1: Name and description of a data record is not in a contiguous segment of the HTML code. Corresponding children nodes of every tag node in a generalized node are joined together to form non-contiguous data record.

Case 2: Two or more data regions form multiple data records. (again corresponding generalized nodes of each data region are joined together to form non-contiguous data records).

Note on MDR and MDR-2 • The algorithm finds all data records, therefore a simple heuristics can be designed to output only required type of data records.

DEPTA (Data Extraction Based on Partial Tree Alignment): Extract Data from Data Records • Once a list of data records are identified, we can also extract data items in them. • The key task is to match corresponding data items or fields from all data records. • Produce one rooted Tag tree for each data record • Partial Tree Alignment • Approaches (align multiple data records): • Multiple string alignment • There are many ambiguities due to pervasive use of table related tags. • Authors’ Contribution: Multiple tree alignment (partial tree alignment) • Together with visual information is an effective approach

DEPTA (Data Extraction Based on Partial Tree Alignment): Extract Data from Data Records Problem is difference in amount of information associated with each item.

Tree Matching (edit distance) • Minimum cost mapping between two trees. • Formally: Let X be a tree and let X[i] be the ith node of tree X in a preorder walk of the tree. A mappingM between a tree A of size n1 and a tree B of size n2 is a set of ordered pairs (i, j), one from each tree, satisfying the following conditions for all (i1, j1), (i2, j2) M: • i1 = i2 iffj1 = j2; • A[i1] is on the left of A[i2] iffB[j1] is on the left B[j2]; • A[i1] is an ancestor of A[i2] iffB[j1] is an ancestor of B[j2].

Intuitive idea • The definition requires that each node can appear no more than once in a mapping and the order between sibling nodes and the hierarchical relation between nodes are both preserved A B p p e b h a d a c c d

Simple Tree Matching (Yang 1991) • General Tree Matching is computationally intensive. Authors propose the use of Simple Tree Matching • No node replacement and no level crossing are allowed. • Dynamic programming solution cost O(n1n2) where n1,n2 = sizes of trees A and B respectively Algorithm: Simple_Tree_Matching(A, B) if the roots of the two trees A and B contain distinct symbols or have visual conflict thenreturn (0); else m:= the number of first-level sub-trees of A; n:= the number of first-level sub-trees of B; Initialization: M[i, 0]:= 0 for i = 0, …, m; M[0, j] := 0 for j = 0, …, n; fori = 1 to mdo forj = 1 to ndo M[i,j]:=max(M[i,j-1], M[i-1, j], M[i-1, j-1]+W[i, j]); where W[i,j] = Simple_Tree_Matching(Ai, Bj) return (M[m, n]+1) • We can trace back in the M matrices to find the matched/aligned nodes in the two trees.

Multiple alignment • We need to align multiple tag trees in order to produce a single database table with all the corresponding data items/fields in the same column of the table. Each row in the table represents a tree and each column represents nodes in the tree (data fields in the data record) • Most multiple alignment works like hierarchical clustering, and require n2 pair-wise matching. • Too expensive. • Optimal matching is exponential • A partial tree matching is proposed in the DEPTA system to perform multiple alignment.

Partial tree alignment • The seed tree, denoted by Ts, is initially picked to be the tree with the maximum number of data fields (items). • Then for each tree Ti (i ≠ s), the algorithm tries to find for each node in Ti a matching node in Ts. • When a match is found for node ni, a link is created from ni to ns to indicate its match in the seed tree. • If no match can be found for node ni, then the algorithm attempts to expand the seed tree by inserting ni into Ts. • We only insert ni into Ts if a position for inserting ni can be uniquely determined in Ts. • The expanded seed tree Ts is then used in subsequent matching.

Ts Ti p p e d a c b e b Insertion is possible New part of Ts p e d c b a Ti p p Ts Insertion is not possible x a a b e e

p p T2 p T3 Ts = T1 d … h k b c g d k c x n b b Ts p No node inserted … x b d p c, h, and k inserted New Ts … T2 is matched again c b x d h k T2 p g k c b n p … g n x c d h k b

Output Data Table The final tree may also be used to match and extract data from other similar pages. Note: the MDR (and DEPTA) framework can handle nested data records using post-order tree traversal

Empirical Evaluations • Step 1 is data record extraction. • Step 2 is data items alignment and extraction. • Num of sites is 49 • Num of pages is 72 • x/y means that x is num of extracted results that are incorrect and y is num of results that are not extracted.

Conclusions • Data extraction is highly effective. Almost all the errors are due to data record extraction. • DEPTA does not make any assumptions. It only requires that the page contains more than one data record. • Results show that the new 2-step technique can segment data records and extract data from them very accurately.

Limitations and issues • Not for a page with only a single data record • may find unwanted data • not able to generate attribute names for the extracted data. • extracted data from multiple sites need integration. • In general, automatic integration is hard. • It is possible in restricted domains, e.g., • products sold online. • want “product name”, “image”, and “price”. • identify only these three fields may not be too hard.

Web Data Extraction Based On Partial Tree Alignment