Web Data Extraction

Web Data Extraction Aki Hecht Seminar in Databases (236826) January 2009

Agenda • Introduction • Building Tag Trees • Mining Data Regions • Partial Tree Alignment • Extraction Given Multiple Pages

Introduction • Enormous amount of data is stored in open databases. • Most databases retrieve web pages with structured data objects. • Usually “Deep Web” pages • Non trivial task to crawl those pages • The data is important and useful for many applications: • Price comparison engines • Collecting individuals information

The goal Given a HTML page containing multiple data records – insert the data into a table. • No assumptions allowed on the amount of data records in the page nor on their structure/content. • The extraction should be done automatically • Human intervention can help in getting more accurate results, but the cost is too high.

Example 1

Example 2 More than one data region!

General idea • Given a Web page: • Build the HTML tag tree • Mine data regions • Mining data records directly is hard • Identify data records from each data region • Learn the structure of a general data record • A data record can contain optional fields • Extract the data

Building a tag tree • Most HTML tags work in pairs. Within each corresponding tag-pair, there can be other pairs of tags, resulting in a nested structure. • Some tags do not require closing tags (e.g., <li>, <hr> and <p>) although they have closing tags. • Additional closing tags need to be inserted to ensure all tags are balanced. • Building a tag treefrom a page using its HTML code is thus natural.

An example

The tag tree

Building trees using visual cues • The HTML code can contain errors. • Browsers are sophisticated enough to display pages with HTML errors. • We can build the tag tree using the browser’s mechanism. • Each HTML element is rendered as a rectangle. • Containments of rectangles representing nesting.

An example

Tree Edit Distance • Tree edit distance between two trees A and B is the cost associated with the minimum set of operations needed to transform A into B. • The set of operations used to define tree edit distance includes three operations: • node removal • node insertion • node replacement A cost is assigned to each of the operations.

Finding Tree Edit Distance • Tree edit distance is very similar to string edit distance. • Can be found in the same way • Done by finding the minimal cost mapping between the two trees.

Finding Tree Edit Distance cont. • The algorithm for finding the minimal cost mapping is identical for both trees and strings. • Based on dynamic programming

Mining Data Regions • Definition: A generalized node of length r consists of r (r  1) nodes in the tag tree with the following two properties: • the nodes all have the same parent. • the nodes are adjacent. • Definition: A data region is a collection of two or more generalized nodes with the following properties: • the generalized nodes all have the same parent. • the generalized nodes all have the same length. • the generalized nodes are all adjacent. • the similarity between adjacent generalized nodes is greater than a fixed threshold.

An Example 1 The regions were found using tree edit distance. For example, nodes 5 and 6 are similar (low cost mapping), they also suit the above definition and therefore they define a data region 2 3 4 6 9 10 5 7 8 12 11 Region 1 Region 2 13 16 17 14 15 18 19 Region 3

Partial Tree Alignment • For each data region we have found we need to understand the structure of the data records in the region. • Not all data records contain the same fields (optional fields are possible) • We will use (partial) tree alignment to gather the structure.

The algorithm • Choose a seed tree: • A seed tree, denoted by Ts, is picked with the maximum number of data items. • Tree matching: • For each unmatched tree Ti (i ≠ s), • match Ts and Ti. • Each pair of matched nodes are linked (aligned). • For each unmatched node nj in Ti do • expand Ts by inserting nj into Ts if a position for insertion can be uniquely determined in Ts. The expanded seed tree Ts is then used in subsequent matching.

Partial Tree Alignment of two trees Ti Ts p p e d a c b e b Insertion is possible p New part of Ts e d c b a Ti p p Ts Insertion is not possible e a b a x e

Full algorithm

A complete example p T2 p Ts = T1 p T3 … k c n x b g d b d h k b c Ts p No node inserted … x b d p New Ts c b x d h k … T2 is matched again p T2 p … g n x c d h k b g k c b n

Output data table Different data records contain different fields!

Extraction given multiple pages • The described technique is good for a single list page. • It can clearly be used for multiple list pages. • Templates from all input pages may be found separately and merged to produce a single refined pattern. • Extraction results will get more accurate. • In many applications, one needs to extract the data from the detail pages as they contain more information on the object.

Detail pages – an example More data in the detail pages A list page

Extraction from detail pages • For extraction, we can treat each detail page as a data record, then extract using partial tree alignment. • For instance, to apply the algorithm, we simply create a rooted tree as follows: • create an artificial root node, and • make the tag tree of each page as a child sub-tree of the artificial root node.

An example r … We already know how to extract data from a data region

Difficulty with detail pages • Although a detail page focuses on a single object, the page may contain a large amount of “noise”, at the top, on the left and right and at the bottom. • Mostly in commercial websites • Since we treat each page as a data record, the algorithm will also extract the “noise”.

An example (a lot of noise)

The solution • To start, a sample page is taken as the wrapper. • The wrapper is then refined by solving mismatches between the wrapper and each sample page, which generalizes the wrapper. • A mismatch occurs when some token in the sample does not match the grammar of the wrapper.

Wrapper generalization • Different types of mismatches: • Text string mismatches: indicate data fields (or items). • Tag mismatches: indicate list of repeated patterns or optional elements. • Find the last token of the mismatch position and identify some candidate repeated patterns from the wrapper and sample by searching forward.

An example

Summary • Automatic extraction of data from a web page requires understanding of the data records’ structure. • First step is finding the data records in the page. • Second step is merging the different structures and build a generic template for a data record. • Partial tree alignment is one method for building the template.

Summary cont. Automatic extraction • Advantages: • It is scalable to a huge number of sites due to the automatic process. • Disadvantages: • It may extract a large amount of unwanted data because the system does not know what is interesting to the user. Domain heuristics or manual filtering may be needed to remove unwanted data. • Extracted data from multiple sites need integration, i.e., their schemas need to be matched.

Thank you! Question?

Bibliography • Y. Zhai, B. Liu “Web data extraction based on partial tree alignment”. International World Wide Web Conference (2005) • Y. zhai, B. Liu "Structured data extraction from the web based on partial tree alignment," IEEE Transactions on Knowledge and Data Engineering (2006) • DC Reis, PB Golgher, AS Silva, AF Laender “Automatic web news extraction using tree edit distance” Proceedings of the 13th international conference on World Wide Web Conference (2004)

Web Data Extraction