340 likes | 430 Vues
Learn about Thresher, a tool to automate unwrapping semantic content from the web. See how it transforms complex web data into a user-friendly interface. Discover the underlying algorithms and techniques used for wrapper induction and pattern matching. Explore how Thresher ties wrappers to semantic content, making it easier to interact with web objects. Find out how additional examples can be automatically added to improve the wrapper's accuracy and efficiency. With Thresher, you can simplify the process of extracting and interacting with semantic content from the web.
E N D
Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue Google MIT CSAIL WWW 2005 -- Chiba, Japan
Acknowledgments • David Karger (karger@csail.mit.edu) • Haystack Group (http://haystack.csail.mit.edu) WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Unwrapping the Web • Majority of semantic content in “deep web” • Transformed into human-readable HTML by scripts • HTML is difficult for automated agents to understand • Little incentive for content providers to provide RDF markup • How to “unwrap” this content? WWW 2005 -- Chiba, Japan
Thresher • Simple UI for wrapper induction on structured web content • “Demonstrate” examples of objects • Induce wrapper, or pattern, based on DOM • User may also label properties with RDF WWW 2005 -- Chiba, Japan
Thresher • Built on Haystack Semantic Web client • Everything is RDF • Everything has context menus • Thresher brings RDF into the web browser • Wrappers reify web objects for full interaction WWW 2005 -- Chiba, Japan
Thresher • Underlying wrapper algorithm based on tree edit distance • Align user’s examples • Keep aligned nodes (layout elements) • Wildcard non-aligned nodes (content) • Pattern matching is also alignment WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Wrapper Induction • Wrapper: pattern created from examples • User provides positive examples • Generalize examples into reusable pattern • Existing techniques: • head-left-right-tail (HLRT) descriptors • Hidden Markov models • Support Vector Machines • Other Machine Learning WWW 2005 -- Chiba, Japan
Wrapper Induction • Our approach: take advantage of hierarchical structure of HTML • Each example picks out a subtree of DOM • Calculate tree edit distance between examples • Least-cost edit distance gives best mapping • Remove unmapped nodes to make pattern WWW 2005 -- Chiba, Japan
Tree Edit Distance • Calculate cost ( ) of sequence of operations to transform one tree into the other • Operations: insert, delete, change a node • Cost of an operation = size of subtree it affects • Least-cost set of operations gives best mapping between elements WWW 2005 -- Chiba, Japan
Mapping Examples WWW 2005 -- Chiba, Japan
Mapping Examples WWW 2005 -- Chiba, Japan
Mapping Examples WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Pattern Matching • Look for document subtrees with similar structure • Find alignments of wrapper in tree • Require every node in wrapper be mapped to some node in document subtree • Wildcards match zero or more times • Each valid alignment is a match WWW 2005 -- Chiba, Japan
Matching Example WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Adding Semantics • How to tie wrappers to semantic content? • Assert RDF statements about unwrapped objects • Tied to wrapper structure • Classes bound to wrappers • Properties bound to wildcards WWW 2005 -- Chiba, Japan
Semantic Labels WWW 2005 -- Chiba, Japan
Semantic Matching WWW 2005 -- Chiba, Japan
Semantic Matching WWW 2005 -- Chiba, Japan
Semantic Matching [ <rdf:type> <TalkAnnouncement> ; <series> “Dertouzos Lect…” ; <dc:title> “Distributed Hash…” ; <time> “3:30 PM” ] WWW 2005 -- Chiba, Japan
Agenda • Overview • Demo • Details • Induction • Matching • Semantics • Heuristics WWW 2005 -- Chiba, Japan
Automatically Adding Examples • Find additional examples automatically • Consider nodes neighboring the example • Require low normalized cost: • Often allows us to create wrappers with a single example WWW 2005 -- Chiba, Japan
Automatically Adding Examples T TR WWW 2005 -- Chiba, Japan
List Collapse • Current wrappers generalize well for single elements • Will not recognize variable length lists • Collapse neighboring nodes with low normalized cost • For matching, allow nodes to match more than once WWW 2005 -- Chiba, Japan
Wrapper Wrap-up • Gather user example(s) • Automatically find additional examples • Generalize examples using best mapping • Add semantic labels • Match by finding alignments • Overlay objects on the page for interaction WWW 2005 -- Chiba, Japan
Additional Tools • Wrapper Sharing • RSS • Web Operations WWW 2005 -- Chiba, Japan
Our Contributions • End-user wrapper induction • Few examples required • Bring object interaction into the browser • Wrappers bridge syntactic-semantic gap WWW 2005 -- Chiba, Japan
Future Work and Applications • Document-level classes • Page reformatting • Autonomous agent interaction • Negative examples • Automatic wrapper induction WWW 2005 -- Chiba, Japan
ahogue@google.com http://haystack.csail.mit.edu WWW 2005 -- Chiba, Japan