580 likes | 595 Vues
This survey explores web information extraction systems, including manual, supervised, semi-supervised, and unsupervised systems, with a focus on automation degree, techniques, and output targets. It covers technologies, tools, and related work in the field as of 2005.
E N D
A Survey of WEB Information Extraction Systems Chia-Hui Chang National Central University Sep. 22, 2005
Introduction • Abundant information on the Web • Static Web pages • Searchable databases: Deep Web • Information Integration • Information for life • e.g. shopping agents, travel agents • Data for research purpose • e.g. bioinformatics, auction economy
Introduction (Cont.) • Information Extraction (IE) • is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form • An IE task is defined by its input and output
Web Data Extraction Data Record Data Record
IE Systems • Wrappers • Programs that perform the task of IE are referred to as extractors or wrappers. • Wrapper Induction • IE systems are software tools that are designed to generate wrappers.
Various IE Survey • Muslea • Hsu and Dung • Chang • Kushmerick • Laender • Sarawagi • Kuhlins and Tredwell
Related Work: Time • MUC Approaches • AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995] • Post-MUC Approaches • WHISK [Soderland, 1999], RAPIER [califf, 1998], SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]
Related Work: Automation Degree • Hsu and Dung [1998] • hand-crafted wrappers using general programming languages • specially designed programming languages or tools • heuristic-based wrappers, and • WI approaches
Related Work: Automation Degree • Chang and Kuo [2003] • systems that need programmers, • systems that need annotation examples, • annotation-free systems and • semi-supervised systems
Related Work: Input and Extraction Rules • Muslea [1999] • IE from free text using extraction patterns that are mainly based on syntactic/semantic constraints. • The second class is Wrapper induction systems which rely on the use of delimiter-based rules. • The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.
Related Work: Extraction Rules • Kushmerick [2003] • Finite-state tools (regular expressions) • Relational learning tools (logic rules)
Related Work: Techniques • Laender [2002] • languages for wrapper development • HTML-aware tools • NLP-based tools • Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER), • Modeling-based tools • Ontology-based tools • New Criteria: • degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.
Related Work: Output Targets • Sarawagi [VLDB 2002] • Record-level • Page-level • Site-level
Related Work: Usability • Kuhlins and Tredwell [2002] • Commercial • Noncommercial
Three Dimensions • Task Domain • Input (Unstructured, semi-structured) • Output Targets (record-level, page-level, site-level) • Automation Degree • Programmer-involved, learning-based or annotation-free approaches • Techniques • Regular expression rules vs Prolog-like logic rules • Deterministic finite-state transducer vs probabilistic hidden Markov models
Task Domain: Output • Missing Attributes • Multi-valued Attributes • Multiple Permutations • Nested Data Objects • Various Templates for an attribute • Common Templates for various attributes • Untokenized Attributes
Classification by Automation Degree • Manually • TSIMMIS, Minerva, WebOQL, W4F, XWrap • Supervised • WIEN, Stalker, Softmealy • Semi-supervised • IEPAD, OLERA • Unsupervised • DeLa, RoadRunner, EXALG
Automation Degree • Page-fetching Support • Annotation Requirement • Output Support • API Support
Technologies • Scan passes • Extraction rule types • Learning algorithms • Tokenization schemes • Feature used
A Survey of Contemporary IE Systems • Manually-constructed IE tools • Programmer-aided • Supervised IE systems • Labeled based • Semi-supervised IE systems • Unsupervised IE systems • Annotation-free
Manually-constructed IE Systems • TSIMMIS [Hammer, et al, 1997] • Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 1998] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]
TSIMMIS • Each command is of the form: [variables, source, pattern] where • source specifies the input text to be considered • pattern specifies how to find the text of interest within the source, and • variables are a list of variables that hold the extracted results. • Note: • # means “save in the variable” • * means “discard”
Minerva • The grammar used by Minerva is defined in an EBNF style
Tag: Body, Source: <Body>…</Body> Text: Book Name … Tag: OL, Source: <ol>…</ol> Text: Reviewer Name … Tag: <b> Source:<b>Book Name</b> Text: Book Name Tag: NOTAG Source: Databases Text: Database Tag: <b> Source:<b>Reviews</b> Text: Reviews Tag: LI, Source: <li>…</li> Text: Reviewer Name … Tag: <b> Source:<b>Reviewer Name</b> Text: Reviewer Name Tag: NOTAG Source: … Text: … Tag: NOTAG Source: John Text: John Tag: <b> Source:<b>Rating</b> Text: Rating Tag: <b> Source:<b>Text</b> Text: Text Tag: NOTAG Source: 7 Text: 7 WebOQL Select [ Z!’.Text] From x in browse (“pe2.html”)’, y in x’, Z in y’ Where x.Tag = “ol” and Z.Text=”Reviewer Name”
W4F • Wysiwyg support • Java toolkit • Extraction rule • HTML parse tree (DOM object) • e.g. html.body.ol[0].li[*].pcdata[0].txt • Regular expression to address finer pieces of information
Supervised IE systems • SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997] • WHISK [Soderland, 1999] • NoDoSE [Adelberg, 1998] • Softmealy [Hsu and Dung, 1998] • Stalker [Muslea, 1999] • DEByE [Laender, 2002b ]
SRV • Single-slot information extraction • Top-down (general to specific) relational learning algorithm • Positive examples • Negative examples • Learning algorithm work like FOIL • Token-oriented features • Logic rule Rating extraction rule:- Length(=1), Every(numeric true), Every(in_list true).
Rapier • Field-level (Single-slot) data extraction • Bottom-up (specific to general) • The extraction rules consist of 3 parts: • Pre-filler • Slot-filler • Post-filler Book Title extraction rule:- Pre-filler slot-filler post-filler word: Book Length=2 word=<b> word: Name Tag: [nn, nns] word: </b>
WIEN • LR Wrapper • (‘Reviewer name </b>’, ‘<b>’, ‘Rating </b>’, ‘<b>’, ‘Text </b>’, ‘</li>’) • HLRT Wrapper (Head LR Tail) • OCLR Wrapper (Open-Close LR) • HOCLRT Wrapper • N-LR Wrapper (Nested LR) • N-HLRT Wrapper (Nested HLRT)
WHISK • Top-down (general to specific) learning • Example • To generate 3-slot book reviews, it start with empty rule “*(*)*(*)*(*)*” • Each parenthesis indicates a phrase to be extracted • The phrase in the first set of parenthesis is bound to variable $1, and 2nd to $2, etc. • The extraction logic is similar to the LR wrapper for WIEN. Pattern:: * ‘Reviewer Name </b>’ (Person) ‘<b>’ * (Digit) ‘<b>Text</b>’(*) ‘</li>’ Output:: BookReview {Name $1} {Rating $2} {Comment $3}
NoDoSE • Assume the order of attributes within a record to be fixed • The user interacts with the system to decompose the input. • For the running example • a book title (an attribute of type string) and • a list of Reviewer • RName (string), Rate (integer), and Text (string).
?/next_token ?/next_token ?/next_token ?/ε ?/ε ?/ε s<,T>/ “T=”+ next_tokn s<b,N>/ “N=”+ next_tokn s<,R>/ “R=”+ next_tokn s<N, > / ε s<T,e> / ε e N R R T b N s<R, e>/ ε Softmealy • Finite transducer • Contextual rules s<,R>L ::= HTML(<b>) C1Alph(Rating) HTML(</b>) s<,R>R ::= Spc(-) Num(-) s<R,>L ::= Num(-) s<R,>R ::= NL(-) HTML(<b>)
Stalker • Embedded Category Tree • Multipass Softmealy
DEByE • Bottom-up extraction strategy • Comparison • DEByE: the user marks only atomic (attribute) values to assemble nested tables • NoDoSE: the user decomposes the whole document in a top-down fashion
Semi-supervised Approaches • IEPAD [Chang and Lui, 2001] • OLERA [Chang and Kuo, 2003] • Thresher [Hogue, 2005]
IEPAD • Encoding of the input page • Multiple-record pages • Pattern Mining by PAT Tree • Multiple string alignment • For the running example • <li><b>T</b>T<b>T</b>T<b>T</b>T</li>
OLERA • Online extraction rule analysis • Enclosing • Drill-down / Roll-up • Attribute Assignment
Thresher • Work similar to OLERA • Apply tree alignment instead of string alignment
Unsupervised Approaches • Roadrunner [Crescenzi, 2001] • DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]
Terminal search match Wrapper after solving mismatch <html><body><b> Book Name </b> #PCDATA<b> Reviews </b> <OL> ( <LI><b> Reviewer Name </b> #PCDATA <b> Rating </b> #PCDATA <b> Text </b> #PCDATA </LI> )+ </OL></body></html> Roadrunner • Input: multiple pages with the same template • Match two input pages at one time Wrapper (initially) 01: <html><body> 02: <b> 03: Book Name 04: </b> 05: Databases 06: <b> 07: Reviews 08: </b> 09: <OL> 10: <LI> 11: <b> Reviewer Name </b> 12: John 13: <b> Rating </b> 14: 7 15: <b>Text </b> 16: … 17: </LI> 10: </OL> 11:</body></html> Sample page 01: <html><body> 02: <b> 03: Book Name 04: </b> 05: Data mining 06: <b> 07: Reviews 08: </b> 09: <OL> 10: <LI> 11: <b> Reviewer Name </b> 12: Jeff 13: <b> Rating </b> 14: 2 15: <b>Text </b> 16: … 17: </LI> 18: <LI> 19: <b> Reviewer Name </b> 20: Jane 21: <b> Rating </b> 22: 6 23: <b>Text </b> 24: … 25: </LI> 26: </OL> 27:</body></html> parsing String mismatch String mismatch String mismatch String mismatch tag mismatch
DeLa • Similar to IEPAD • Works for one input page • Handle nested data structure • Example • <P><A>T</A><A>T</A> T</P><P><A>T</A>T</P> • <P><A>T</A>T</P><P><A>T</A>T</P> • (<P>(<A>T</A>)*T<P>)*
EXALG • Input: multiple pages with the same template • Techniques: • Differentiating token roles • Equivalence class (EC) form a template • Tokens with the same occurrence vector
DEPTA • Identify data region • Allow mismatch between data records • Identify data record • Data records may not be continuous • Identify data items • By partial tree alignment
Comparison • How do we differentiate template token from data token? • DeLa and DEPTA assume HTML tags are template while others are data tokens • IEPAD and OLERA leaves the problems to users • How to apply the information from multiple pages? • DeLa and DEPTA conduct the mining from single page • Roadrunner and EXALG do the analysis from multiple pages
Comparison (Cont.) • Techniques improvement • From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher) • From full alignment (IEPAD) to partial alignment (DEPTA)
Task domain comparison • Page type • structured, semi-structured or free-text Web pages • Non-HTML support • Extraction level • Field level, record-level, page-level