1 / 14

An Automatic Wrapper Constructor Agent for E-trading

An Automatic Wrapper Constructor Agent for E-trading. Elektrotehniška in Računalniška Konferenca 2002 Portorož, Slovenija. Aleksander Pivk Department of Intelligent Systems Jozef Stefan Institute Ljubljana, Slovenia. 25. september 2002. What is an (intelligent) agent?.

ryanday
Télécharger la présentation

An Automatic Wrapper Constructor Agent for E-trading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Automatic Wrapper Constructor Agent forE-trading Elektrotehniška in Računalniška Konferenca 2002Portorož, Slovenija Aleksander PivkDepartment of Intelligent SystemsJozef Stefan InstituteLjubljana, Slovenia 25. september 2002

  2. What is an (intelligent) agent? • An intelligent agent is a computer system capable of flexible,autonomous action in some environment. • Examples: • Environment: internet agent, OS agent, desktop agent, www agent, etc. • Task: information agent, shopping agent, interface agent, email agent, notification agent, etc.

  3. Information agents • Task: • access/integrate information from a variety of data sources • Types: • Information Retrieval Agents • search engines • Information Filtering Agents • mail agents, news-delivery agents • Information Extraction Agents • wrappers • Information Integration Agents • meta-search engine, comparison-shopping

  4. Information Extraction • IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. Examples: a) from weather report identify locations, dates, temperatures (high and low); b) from online stores get product names, their images, and prices. NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751

  5. Wrappers • A wrapper is … • a procedure or a rule that explains how to extract information from an information source • tailored to a particular document collection • appropriate to semi-structured information source • Why using wrappers? • heterogeneous information sources • different styles of user interface and different formats of output display

  6. Implemented Systems • EMA – Employment Agent • memory-based approach • hand-coded wrappers • depends upon the profession ontology (domain-knowledge) • ShinA – Customized Comparison Shopping Agent • simple heuristic-based approach • little domain-knowledge used

  7. ShinA – Shopping Assistant

  8. Our focus • Wrapper learning in real time • to realize customized comparison shopper • Little use of domain knowledge • rather use simple heuristics • exploit the characteristics of semi-structured documents • Flexible and Practical • handle both table-type and list-type displays • handle noisy product description (missing attributes) • handle single product description in multiple lines

  9. Learning Query Scheme Templates <form site= "amazon.com"> <name>searchform</name> <method>post</method> <action>www.amazon.com/exec/obidos/search-handle-form</action> <input type= "text" name="field-keywords" size=“15" /> <input type= "image" name= "Go"/> <select name= "index"> <option value= “all products" selected /> <option value= "books" /><option value= "…" /> </select> </form>

  10. Learning product descriptions • Table-type display of 5 different PDU’s • Task • recognize each PDU • recognize attributes within PDU • learn rules to extract attributes PDU - Product Description Unit

  11. PDU Pattern Learning: Algorithm • First phase • remove irrelevant parts of HTML source (header, advertisements, footer) • the remaining HTML source is broken into logical lines • Second phase • categorize each logical line • 9 different categories (PRICE, TITLE, IMAGE, URL_LINK, TTAG, LBTAG, etc.) • Third phase • find most frequent pattern(s) for PDU(s) in the sequence of logical line categories

  12. PDU Pattern Learning: Example A fragment of the HTML source of the search result for the query “intelligent agent“ to Amazon bookstore. <img src="http://g-images.amazon.com/images/G/01/v9/130668.jpg" width="80“ height="80" vspace="2" alt=""> --2 </td> --4 <td> --4 <p> --5 <a href="http://www.amazon.com/book.asp?id=010101&book=130668"> --3 Intelligent Internet Agents: Agent-Based Information Discovery on the Internet --1 </a> --9 <br> --5 $59.95 --0 { 0:price; 1:title; 2:image; 3:link; 4:table tag; 5:line tag, 9:other tag; } Extracted PDU pattern: 244531950

  13. Simple Heuristics • Recognizing a title • contains at least one query word • text line that corresponds to pre-determined pattern’s title • Recognizing a price • contains a currency symbol ($, €) • contains a currency token (EUR, SIT) • contains digit(s) with relevant delimiters (‘,’; ‘.’) • Recognizing an image • unique image url-address within pattern • Able to recognize attributes with heuristic rules • examples: ISBN numbers, dates, discount rates • Unable to recognize other attributes • authors, review comments, recommendation status

  14. Conclusion • Limitations • query search box must exist • price information must exist • extracts only a few attributes (title,price,image,link,…) • Future work • more use of domain knowledge (ontologies) • extract other non-price attributes • use of XML-based wrappers • applications to other domains

More Related