A Survey of WEB Information Extraction Systems

A Survey of WEB Information Extraction Systems Chia-Hui Chang National Central University Sep. 22, 2005

Introduction • Abundant information on the Web • Static Web pages • Searchable databases: Deep Web • Information Integration • Information for life • e.g. shopping agents, travel agents • Data for research purpose • e.g. bioinformatics, auction economy

Introduction (Cont.) • Information Extraction (IE) • is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form • An IE task is defined by its input and output

An IE Task

Web Data Extraction Data Record Data Record

IE Systems • Wrappers • Programs that perform the task of IE are referred to as extractors or wrappers. • Wrapper Induction • IE systems are software tools that are designed to generate wrappers.

Various IE Survey • Muslea • Hsu and Dung • Chang • Kushmerick • Laender • Sarawagi • Kuhlins and Tredwell

Related Work: Time • MUC Approaches • AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995] • Post-MUC Approaches • WHISK [Soderland, 1999], RAPIER [califf, 1998], SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

Related Work: Automation Degree • Hsu and Dung [1998] • hand-crafted wrappers using general programming languages • specially designed programming languages or tools • heuristic-based wrappers, and • WI approaches

Related Work: Automation Degree • Chang and Kuo [2003] • systems that need programmers, • systems that need annotation examples, • annotation-free systems and • semi-supervised systems

Related Work: Input and Extraction Rules • Muslea [1999] • IE from free text using extraction patterns that are mainly based on syntactic/semantic constraints. • The second class is Wrapper induction systems which rely on the use of delimiter-based rules. • The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

Related Work: Extraction Rules • Kushmerick [2003] • Finite-state tools (regular expressions) • Relational learning tools (logic rules)

Related Work: Techniques • Laender [2002] • languages for wrapper development • HTML-aware tools • NLP-based tools • Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER), • Modeling-based tools • Ontology-based tools • New Criteria: • degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

Related Work: Output Targets • Sarawagi [VLDB 2002] • Record-level • Page-level • Site-level

Related Work: Usability • Kuhlins and Tredwell [2002] • Commercial • Noncommercial

Three Dimensions • Task Domain • Input (Unstructured, semi-structured) • Output Targets (record-level, page-level, site-level) • Automation Degree • Programmer-involved, learning-based or annotation-free approaches • Techniques • Regular expression rules vs Prolog-like logic rules • Deterministic finite-state transducer vs probabilistic hidden Markov models

Task Domain: Input

Task Domain: Output • Missing Attributes • Multi-valued Attributes • Multiple Permutations • Nested Data Objects • Various Templates for an attribute • Common Templates for various attributes • Untokenized Attributes

Classification by Automation Degree • Manually • TSIMMIS, Minerva, WebOQL, W4F, XWrap • Supervised • WIEN, Stalker, Softmealy • Semi-supervised • IEPAD, OLERA • Unsupervised • DeLa, RoadRunner, EXALG

Automation Degree • Page-fetching Support • Annotation Requirement • Output Support • API Support

Technologies • Scan passes • Extraction rule types • Learning algorithms • Tokenization schemes • Feature used

A Survey of Contemporary IE Systems • Manually-constructed IE tools • Programmer-aided • Supervised IE systems • Labeled based • Semi-supervised IE systems • Unsupervised IE systems • Annotation-free

Manually-constructed IE Systems • TSIMMIS [Hammer, et al, 1997] • Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 1998] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]

A Running Example

TSIMMIS • Each command is of the form: [variables, source, pattern] where • source specifies the input text to be considered • pattern specifies how to find the text of interest within the source, and • variables are a list of variables that hold the extracted results. • Note: • # means “save in the variable” • * means “discard”

Minerva • The grammar used by Minerva is defined in an EBNF style

Tag: Body, Source: <Body>…</Body> Text: Book Name … Tag: OL, Source: <ol>…</ol> Text: Reviewer Name … Tag: Source:Book Name Text: Book Name Tag: NOTAG Source: Databases Text: Database Tag: Source:Reviews Text: Reviews Tag: LI, Source: <li>…</li> Text: Reviewer Name … Tag: Source:Reviewer Name Text: Reviewer Name Tag: NOTAG Source: … Text: … Tag: NOTAG Source: John Text: John Tag: Source:Rating Text: Rating Tag: Source:Text Text: Text Tag: NOTAG Source: 7 Text: 7 WebOQL Select [ Z!’.Text] From x in browse (“pe2.html”)’, y in x’, Z in y’ Where x.Tag = “ol” and Z.Text=”Reviewer Name”

W4F • Wysiwyg support • Java toolkit • Extraction rule • HTML parse tree (DOM object) • e.g. html.body.ol[0].li[*].pcdata[0].txt • Regular expression to address finer pieces of information

Supervised IE systems • SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997] • WHISK [Soderland, 1999] • NoDoSE [Adelberg, 1998] • Softmealy [Hsu and Dung, 1998] • Stalker [Muslea, 1999] • DEByE [Laender, 2002b ]

SRV • Single-slot information extraction • Top-down (general to specific) relational learning algorithm • Positive examples • Negative examples • Learning algorithm work like FOIL • Token-oriented features • Logic rule Rating extraction rule:- Length(=1), Every(numeric true), Every(in_list true).

Rapier • Field-level (Single-slot) data extraction • Bottom-up (specific to general) • The extraction rules consist of 3 parts: • Pre-filler • Slot-filler • Post-filler Book Title extraction rule:- Pre-filler slot-filler post-filler word: Book Length=2 word= word: Name Tag: [nn, nns] word:

WIEN • LR Wrapper • (‘Reviewer name ’, ‘’, ‘Rating ’, ‘’, ‘Text ’, ‘</li>’) • HLRT Wrapper (Head LR Tail) • OCLR Wrapper (Open-Close LR) • HOCLRT Wrapper • N-LR Wrapper (Nested LR) • N-HLRT Wrapper (Nested HLRT)

WHISK • Top-down (general to specific) learning • Example • To generate 3-slot book reviews, it start with empty rule “*(*)*(*)*(*)*” • Each parenthesis indicates a phrase to be extracted • The phrase in the first set of parenthesis is bound to variable $1, and 2nd to $2, etc. • The extraction logic is similar to the LR wrapper for WIEN. Pattern:: * ‘Reviewer Name ’ (Person) ‘’ * (Digit) ‘Text’(*) ‘</li>’ Output:: BookReview {Name $1} {Rating $2} {Comment $3}

NoDoSE • Assume the order of attributes within a record to be fixed • The user interacts with the system to decompose the input. • For the running example • a book title (an attribute of type string) and • a list of Reviewer • RName (string), Rate (integer), and Text (string).

?/next_token ?/next_token ?/next_token ?/ε ?/ε ?/ε s<,T>/ “T=”+ next_tokn s<b,N>/ “N=”+ next_tokn s<,R>/ “R=”+ next_tokn s<N, > / ε s<T,e> / ε e N R R T b N s<R, e>/ ε Softmealy • Finite transducer • Contextual rules s<,R>L ::= HTML() C1Alph(Rating) HTML() s<,R>R ::= Spc(-) Num(-) s<R,>L ::= Num(-) s<R,>R ::= NL(-) HTML()

Stalker • Embedded Category Tree • Multipass Softmealy

DEByE • Bottom-up extraction strategy • Comparison • DEByE: the user marks only atomic (attribute) values to assemble nested tables • NoDoSE: the user decomposes the whole document in a top-down fashion

Semi-supervised Approaches • IEPAD [Chang and Lui, 2001] • OLERA [Chang and Kuo, 2003] • Thresher [Hogue, 2005]

IEPAD • Encoding of the input page • Multiple-record pages • Pattern Mining by PAT Tree • Multiple string alignment • For the running example • <li>TTTTTT</li>

OLERA • Online extraction rule analysis • Enclosing • Drill-down / Roll-up • Attribute Assignment

Thresher • Work similar to OLERA • Apply tree alignment instead of string alignment

Unsupervised Approaches • Roadrunner [Crescenzi, 2001] • DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]

Terminal search match Wrapper after solving mismatch <html><body> Book Name #PCDATA Reviews <OL> ( <LI> Reviewer Name #PCDATA Rating #PCDATA Text #PCDATA </LI> )+ </OL></body></html> Roadrunner • Input: multiple pages with the same template • Match two input pages at one time Wrapper (initially) 01: <html><body> 02: 03: Book Name 04: 05: Databases 06: 07: Reviews 08: 09: <OL> 10: <LI> 11: Reviewer Name 12: John 13: Rating 14: 7 15: Text 16: … 17: </LI> 10: </OL> 11:</body></html> Sample page 01: <html><body> 02: 03: Book Name 04: 05: Data mining 06: 07: Reviews 08: 09: <OL> 10: <LI> 11: Reviewer Name 12: Jeff 13: Rating 14: 2 15: Text 16: … 17: </LI> 18: <LI> 19: Reviewer Name 20: Jane 21: Rating 22: 6 23: Text 24: … 25: </LI> 26: </OL> 27:</body></html> parsing String mismatch String mismatch String mismatch String mismatch tag mismatch

DeLa • Similar to IEPAD • Works for one input page • Handle nested data structure • Example • <A>T</A><A>T</A> T<A>T</A>T • <A>T</A>T<A>T</A>T • ((<A>T</A>)*T)*

EXALG • Input: multiple pages with the same template • Techniques: • Differentiating token roles • Equivalence class (EC) form a template • Tokens with the same occurrence vector

DEPTA • Identify data region • Allow mismatch between data records • Identify data record • Data records may not be continuous • Identify data items • By partial tree alignment

Comparison • How do we differentiate template token from data token? • DeLa and DEPTA assume HTML tags are template while others are data tokens • IEPAD and OLERA leaves the problems to users • How to apply the information from multiple pages? • DeLa and DEPTA conduct the mining from single page • Roadrunner and EXALG do the analysis from multiple pages

Comparison (Cont.) • Techniques improvement • From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher) • From full alignment (IEPAD) to partial alignment (DEPTA)

Task domain comparison • Page type • structured, semi-structured or free-text Web pages • Non-HTML support • Extraction level • Field level, record-level, page-level

A Survey of WEB Information Extraction Systems