Structured Data Extraction From Web Based on Partial Tree Alignment by Yanhong zhai and Bing Liu
Introduction • A large amount of information on the Web is contained in regularly structured data objects • Which are data records retrieved from databases. • Such Web data records are important because • They often present the essential information of their host pages, e.g., lists of products and services. • Applications: integrated and value-added services, • e.g., Comparative shopping, meta-search & query, etc.
Existing Methods • Wrapper Programming languages • This approach provides some languages to facilitate the construction of data extraction programs. • Wrapper Induction • This approach use machine learning techniques to learn data extraction rules from set of manually labeled examples. • Automatic Extraction • This approach is based on the idea of automatic pattern discovery.
Proposed Method: • DEPTA (Data extraction based on partial tree alignment • This method consists of two steps: 1)Identifying individual records in a page. 2)Aligning and extracting data items from the Identified records.
Architecture of DEPTA System Input: a web page DOM Tree Builder Data Region Identifier Data Records Identifier Output: Data Tables Data Items Extractor
DATA RECORD IDENTIFICATION • MDR: Mining Data Records • Given a single page with multiple data records, MDR extracts data records ,but not data items(step1). • MDR is based on • two observations about data records in a Web page and • a tree matching algorithm • Consider both • Contiguous • non – contiguous records
Two Observations • A group of data recordsare presented • In a contiguous region (a data region) of a page and • are formatted using similar HTML tags • A set of similar data records are formed by some child sub trees of the same parent node.
DOM tree of the previous page: TABLE TBODY TR TR TR TR TR TR TD TD TD Data record2 Data record1 TD TD TD TD TD
The approach • Given a page , • Building the Dom Trees Based on Visual Information • Mining Data Regions • Identifying Data Records Rendering (or Visual) information is very useful in the whole process.
Building Dom Trees Based on Visual Information 1.<table> 2.<tr> 3.<td>data1</td> 4.<td>data2</td> 5.<tr> 6.<td>data3</td> 7.<td>data4</td> 8.</tr> 9.</table> Left right top bottom table 100 300 200 400 100 300 200 300 100 300 200 400 200 300 200 300 tr tr 100 300 300 400 100 200 300 400 tr tr tr tr 200 300 200 400
Enhanced Simple Tree Matching T1 T2 p p T2 T1 p p a a a a a a a b a b <data1> <data2> <data3> <data2> <data3> <data4> c c g c <data1> <data2> <data1> Wrong alignment Correct alignment (b) (a) Alignment using tags only can produce wrong alignments Two trees with more than one possible matches
Mining Data Regions • Find every data region with similar data records. Definition:A generalized node (or a node combination) of length r consists of r (r≥1)nodes in the HTML tag tree with the following two properties: 1. the nodes all have the same parent and 2. the nodes are adjacent. Definition: A data region is a collection of two or more generalized nodes with the following properties : 1.The generalized nodes all have the same parent. 2.The generalized nodes are all adjacent. 3.Adjacent generalized nodes are similar.
Determining Data Regions • To find each data region , the algorithm needs to find the following . 1. Where does the first generalized node of the data region start? • Try to start from each child node under a parent 2. How many tag nodes or components does a generalized node have? • We try: one node, two node,., K node combinations
An illustration of generalized nodes and data regions Shades nodes are generalized nodes data regions 1 2 3 4 5 6 7 8 9 10 Region 1 Region 2 11 12 13 14 15 16 17 19 18 Region 3
Identifying Data Records • A generalized node may not be a data record. • Extra mechanisms are needed to identify true atomic objects • Some highlights: contiguous non-contiguous data records Name1 Description of object 1 Name2 Description of object2 Name3 Description of object3 Name4 Description of object4 Name1 Name2 Description Of object 1 Description Of object2 Name3 Name4 Description Of object 3 Description Of object4
DEPTA: Extract Data from Data Records • Once a list of data records are identified, we can align and extract items in them • Multiple tree alignment: • We need multiple alignment as we have multiple data records • Most multiple alignment methods work like hierarchical clustering , and require n2 pair wise matching. • Too expensive • Optimal alignment/ matching is exponential • A partial tree matching algorithm is proposed in Depta to perform multiple tree alignment
The partial Tree Alignment Approach • Choose a seed tree: A seed tree , denoted by Ts, is picked with the maximum number of data items. • Tree matching: • For each unmatched tree Ti (i≠s), • Match Ts and Tr • Each pair of matched nodes are linked (aligned) • For each unmatched node nj in Ti do • Expand Ts by inserting n into Ts if a position for insertion can be uniquely determined in Ts. • The expanded seed tree Ts is then used in subsequent matching.
Illustration of partial tree alignment TS Ti p p a b e b c d e New part of Ts Insertion is possible p a b c d e Ts p Ti p Insertion is not possible a b e a x e
A complete example Ts = T1 p T2 p T3 p ….. X b d b n c k g b c d h k Ts p No node inserted … X b d New Ts p C, h and k inserted T2 is matched again X b c d h k T2 p b n c g k p … X b n c d h k g
Output data table …. X b n c d h K g T1 1 1 1 …. 1 1 1 1 1 T2 1 1 1 1 1 T3 • The final tree may also be used to match and extract data from other similar pages
Conclusion • Existing techniques either inaccurate or make several assumptions. • Our method does not make these assumptions • Our technique consists of two steps • Identifying data records • Aligning corresponding data items from multiple data records. • Step1 is based on visual cues • Step2 is based on partial tree aligment
References • . Arasu, A. and Garcia-Molina, H. Extracting Structured Data • from Web Pages. SIGMOD-03, 2003. • . Baeza-Yates, R. Algorithms for string matching: A survey. • ACM SIGIR Forum, 23(3-4):34-58, 1989. • . Barton, G., Sternberg, M. A strategy for the rapid multiple • alignment of protein sequences: confidence levels from • tertiary structure comparisons. J. Mol. Biol. 1987, 327-337. • . Bar-Yossef, Z. and Rajagopalan, S. Template Detection via • Data Mining and its Applications, WWW 2002, 2002. • . Buttler, D., Liu, L., Pu, C. A fully automated extraction • system for the World Wide Web. IEEE ICDCS-21, 2001. • . Carrillo, H., Lipman, D. The multiple sequence alignment • problem in biology. SIAM J. Applied Math., 1988;48(5). • . Chakrabarti, S. Mining the Web: Discovering Knowledge • from Hypertext Data. Morgan Kaufmann Publishers, 2002. • . Chang, C. and Lui, S-L. IEPAD: Information extraction • based on pattern discovery. WWW-10, 2001. • . Chen, H.-H., Tsai, S.-C., and Tsai, J.-H. Mining tables from • large scale html texts. COLING-00, 2000. • . Chen, W. New algorithm for ordered tree-to-tree correction • problem. Journal of Algorithms, 40:135–158, 2001. • . Cohen, W., Hurst, M., and Jensen, L. A flexible learning