Information Extraction on the Web

Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Outline • What is information extraction? • Document types • Applications • Wrapper induction • Automatic Wrapper generator • Conclusions

What’s information extraction? • An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically. • Example-- Parser • input a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete

Modules • Text Zonerturn a text into a set of text segments • Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes • Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

Document types • Plain text: (一句一句，平鋪直述) • 利用lexical、semantic analysis。 • AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTAL(Soderland 95), HASTEN(Krupka 95)。 • Web page：(半結構性文件) • 利用html語法特性-tag。 • 觀察所得之heuristics: Layout。

Applications • Meta Search Engines • Information Agents • 以特定目的為導向，例: • 新聞代理人(News spider) • 網羅新聞 • 購物比價 • 找工作 • ShopBot (Doorenbos 97), Software LEGO(Hsu 99)。

Human & Computer Users • User Services: • Query • Monitor • Update Information Integration Service Mediator Mediator Mediator Wrapper Wrapper SQL ORB Text, Images/Video, Spreadsheets Hierarchical & Network Databases Object & Knowledge Bases Relational Databases Heterogeneous Data Sources Information Integration Systems Abstracted Information Agent/Module Coordination Mediation Semantic Integration Translation and Wrapping Unprocessed, Unintegrated Details

What is a wrapper? • Wrapper • An extracting program to extract desired information from Web pages. Semi-Structure Doc.– wrapper→ Structure Info.

Web Wrappers • Web wrappers wrap... • “Query-able’’ or “Search-able’’ Web sites • Web pages with large itemized lists • The primary issues are: • How to build the extractor quickly?

Free Text Extraction v.s. Semi-structured Text Extraction • Example: to extract attributes --- job title, employer and phone number --- from a job item list • Free text extraction can depend on NL knowledge “The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details.” • Semistructured text extraction? --- depend on appearance and regularity “Faculty position, department of computer science, Cranberry Lemon University. Call (555)333-5555”

skip extract skip extract 1 2 3 4 Wrapper Representations • Delimiter-based finite state automata <HTML><TITLE>Some Country Codes</TITLE><BODY> Congo242 Egypt20 Belize501 Spain34 </BODY></HTML>

Related Work • Shopbot • Doorenbos, Etzioni, Weld, AA-97 • Ariadne • Ashish, Knoblock, Coopis-97 • WIEN • Kushmerick, Weld, IJCAI-97

Related Work (Cont.) • SoftMealy wrapper representation • Hsu, IJCAI-99 • STALKER • Muslea, Minton, Knoblock, AA-99 • A hierarchical FST • IEPAD • Chang, WWW01

WIEN • HLRT (Head-Left-Right-Tail) • Labeling: by PageOracle, LableOracle. • PAC analysis • Extract 48% web pages successfully. • Weakness: • Missing attributes, attributes not in order, tabular data..etc.

Softmealy Chun-Nan Hsu, 1998 Arizona State University

Softmealy • Finite-State Transducers for Semi-Structured Text Mining • Labeling: use a interface to label example by manually. • FST (Finite-State Transducer) • Sigle-pass • Multi-pass

SoftMealy wrapper representation • Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path • Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

Example

Output 4種情形

b Finite State Transducer 多解決了(N, M)、(N, A, M)2個情形 skip extract skip extract U -U N skip -N extract skip extract skip -A e M A

Find the starting position -- Single Pass 新增的定義

Taxonomy Tree

Stalker Muslea, Minton, Knoblock, AA-99 A Hierarchical FST

STALKER • STALKER • “STALKER: Learning Extraction Rules for Semi-structured, Web-based Information Sources”. AAAI-98, Muslea. • Embeded Catalog Description is a tree-like structure.

EC Tree of a page

Multi-Pass or Hierarchical Wrapper Pass1: extract U 先extract Body Pass2:extract N Pass3:extract A 再extract Tuples Pass4:extract M

Rule Generating Extract Credit info. 1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; _Symbol_ _HtmlTag_} perfect Disj:{ _HtmlTag_} positive example: D3, D4 2nd: uncover{D1, D2} Candicate:{; _Symbol_}

Possible Rules

Features • Process is performed in a hierarchical manner. • 沒有Attributes not in order的問題。 • Use disjunctive rule 可以解決Missing attributes的問題。

Comparison • Both : • can handle irregular missing attributes. • 對於未見過的attribute，需要training • Single-pass : • 允許的attribute permutations 有限 • Single-pass is good for tabular pages • 比較快 • Multi-pass: • Attribute permutations沒有影響 • Multi-pass is good for tagged-list pages • 比較慢

Comparison • Quote Server • Stalker: 10 example tuples, 79%, 500 test • WIEN: the collection beyond learn’s capablity • SoftMealy: multi-pass 85%, single-pass97% • Internet Address Finder • Stalker: 80% ~ 100%, 500 test • WIEN: the collection beyond learn’s capablity • SoftMealy: multi-pass 68%, single-pass 41%,

Comparison • Okra(tabular pages) • Stalker: 97%, 1 example tuple • WIEN: 100% , 13 example tuples, 30 test • SoftMealy: single-pass 100%, 1 example tuple, 30 test • Big-book(tagged-list pages) • Stalker: 97%, 8 example tuples • WIEN: perfect, 18 example tuples, 30 test • SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test

Information Extraction on the Web

Information Extraction on the Web

Presentation Transcript

Towards Web-Scale Information Extraction

Information Extraction from Web Documents

Information Extraction from the World Wide Web

Extraction of Opinions on the Web

Open Information Extraction from the Web Oren Etzioni

Open Information Extraction from the Web

information extraction

Information Extraction

Information Extraction from the World Wide Web

Information Extraction from the World Wide Web

Toward Semantic Web Information Extraction

Integration of Friendly Data Islands on the Web. Information Extraction.

Information extraction from web pages using extraction ontologies

Web scale Information Extraction

Information Extraction from Multimedia Content on the Social Web

Information Extraction, Language Technology and the Semantic Web

Information Extraction from the World Wide Web

Web Information Extraction Learning based on Probabilistic Graphical Models

Information extraction from web pages using extraction ontologies