Web Crawler Agent (WCA)

Web Crawler Agent(WCA) Presented by Kirk Martinez University of Southampton

Introduction • WCA searches for missing information (fragments) on the Web • WCA structures information into ontology “place_of_birth” (Person,Place) • Techniques used: NLP (Natural Language Processing), Information extraction, relation extraction, question answering

Overview

Is it something like “Google”? • Search “date_of_birth” (when Rembrandt was born) with Google

Searching information with Google • The “old” Web Search (eg Google) is good for getting documents but NOT for extracting concise answers • (e.g. “15-July-1606”) • No analysis to “understand” the documents (e.g. “Rembrandt” can mean “hotel” or “bookstore”)

Information extraction on the Web • data may be low quality and repeated • e.g. Seurat Georges’s date of death • 29, March 1891(http://www.ibiblio.org/wm/paint/auth/seurat/) • 19, March 1891 (http://www.rickdoble.net/influence/20seurat.htm) • WCA depends on: • Well-structured sentences and documents • Good named-entity recognisers

Future work • verification • performance • autonomous

Web Crawler Agent (WCA)