1 / 34

Information Extraction on the Web

Information Extraction on the Web. Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw. Outline. What is information extraction? Document types Applications Wrapper induction Automatic Wrapper generator Conclusions.

floyd
Télécharger la présentation

Information Extraction on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

  2. Outline • What is information extraction? • Document types • Applications • Wrapper induction • Automatic Wrapper generator • Conclusions

  3. What’s information extraction? • An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically. • Example-- Parser • input a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete

  4. Modules • Text Zonerturn a text into a set of text segments • Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes • Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

  5. Document types • Plain text: (一句一句,平鋪直述) • 利用lexical、semantic analysis。 • AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTAL(Soderland 95), HASTEN(Krupka 95)。 • Web page:(半結構性文件) • 利用html語法特性-tag。 • 觀察所得之heuristics: Layout。

  6. Applications • Meta Search Engines • Information Agents • 以特定目的為導向,例: • 新聞代理人(News spider) • 網羅新聞 • 購物比價 • 找工作 • ShopBot (Doorenbos 97), Software LEGO(Hsu 99)。

  7. Human & Computer Users • User Services: • Query • Monitor • Update Information Integration Service Mediator Mediator Mediator Wrapper Wrapper SQL ORB Text, Images/Video, Spreadsheets Hierarchical & Network Databases Object & Knowledge Bases Relational Databases Heterogeneous Data Sources Information Integration Systems Abstracted Information Agent/Module Coordination Mediation Semantic Integration Translation and Wrapping Unprocessed, Unintegrated Details

  8. What is a wrapper? • Wrapper • An extracting program to extract desired information from Web pages. Semi-Structure Doc.– wrapper→ Structure Info.

  9. Web Wrappers • Web wrappers wrap... • “Query-able’’ or “Search-able’’ Web sites • Web pages with large itemized lists • The primary issues are: • How to build the extractor quickly?

  10. Free Text Extraction v.s. Semi-structured Text Extraction • Example: to extract attributes --- job title, employer and phone number --- from a job item list • Free text extraction can depend on NL knowledge “The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details.” • Semistructured text extraction? --- depend on appearance and regularity “Faculty position, department of computer science, Cranberry Lemon University. Call (555)333-5555”

  11. skip extract skip extract 1 2 3 4 <B> </B> <I> </I> Wrapper Representations • Delimiter-based finite state automata <HTML><TITLE>Some Country Codes</TITLE><BODY> <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> <B>Belize</B><I>501<I><BR> <B>Spain</B><I>34</I><BR> </BODY></HTML>

  12. Related Work • Shopbot • Doorenbos, Etzioni, Weld, AA-97 • Ariadne • Ashish, Knoblock, Coopis-97 • WIEN • Kushmerick, Weld, IJCAI-97

  13. Related Work (Cont.) • SoftMealy wrapper representation • Hsu, IJCAI-99 • STALKER • Muslea, Minton, Knoblock, AA-99 • A hierarchical FST • IEPAD • Chang, WWW01

  14. WIEN • HLRT (Head-Left-Right-Tail) • Labeling: by PageOracle, LableOracle. • PAC analysis • Extract 48% web pages successfully. • Weakness: • Missing attributes, attributes not in order, tabular data..etc.

  15. Softmealy Chun-Nan Hsu, 1998 Arizona State University

  16. Softmealy • Finite-State Transducers for Semi-Structured Text Mining • Labeling: use a interface to label example by manually. • FST (Finite-State Transducer) • Sigle-pass • Multi-pass

  17. SoftMealy wrapper representation • Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path • Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

  18. Example

  19. Output 4種情形

  20. b Finite State Transducer 多解決了(N, M)、(N, A, M)2個情形 skip extract skip extract U -U N skip -N extract skip extract skip -A e M A

  21. Find the starting position -- Single Pass 新增的定義

  22. Taxonomy Tree

  23. Stalker Muslea, Minton, Knoblock, AA-99 A Hierarchical FST

  24. STALKER • STALKER • “STALKER: Learning Extraction Rules for Semi-structured, Web-based Information Sources”. AAAI-98, Muslea. • Embeded Catalog Description is a tree-like structure.

  25. EC Tree of a page

  26. Multi-Pass or Hierarchical Wrapper Pass1: extract U 先extract Body Pass2:extract N Pass3:extract A 再extract Tuples Pass4:extract M

  27. Rule Generating Extract Credit info. 1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D4 2nd: uncover{D1, D2} Candicate:{; _Symbol_}

  28. Possible Rules

  29. Features • Process is performed in a hierarchical manner. • 沒有Attributes not in order的問題。 • Use disjunctive rule 可以解決Missing attributes的問題。

  30. Comparison • Both : • can handle irregular missing attributes. • 對於未見過的attribute,需要training • Single-pass : • 允許的attribute permutations 有限 • Single-pass is good for tabular pages • 比較快 • Multi-pass: • Attribute permutations沒有影響 • Multi-pass is good for tagged-list pages • 比較慢

  31. Comparison • Quote Server • Stalker: 10 example tuples, 79%, 500 test • WIEN: the collection beyond learn’s capablity • SoftMealy: multi-pass 85%, single-pass97% • Internet Address Finder • Stalker: 80% ~ 100%, 500 test • WIEN: the collection beyond learn’s capablity • SoftMealy: multi-pass 68%, single-pass 41%,

  32. Comparison • Okra(tabular pages) • Stalker: 97%, 1 example tuple • WIEN: 100% , 13 example tuples, 30 test • SoftMealy: single-pass 100%, 1 example tuple, 30 test • Big-book(tagged-list pages) • Stalker: 97%, 8 example tuples • WIEN: perfect, 18 example tuples, 30 test • SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test

More Related