280 likes | 390 Vues
Explore the utilization of site-level knowledge for extracting structured data from web forums. Understand forum crawling, data extraction quality, and assessment challenges. Discover techniques for forum sitemap creation, layout characterization, and data record relationships analysis.
E N D
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04
Web Forum Data • An important information resource with a lot of human knowledge. • These information include recreation, sports, games, computers, art, society, science, home, health; • 20% pages on the search results are from forums
Understanding Forum Crawling Data Extraction Quality Assessment WWW’08 iRobot: An Intelligent Crawler for Web Forums SIGIR’08 Exploring Traversal Strategy KDD’09 Incremental Crawling WWW’09, Automation Data Extraction SIGIR’09 Quality Assessment
Challenge • Leverage more site-level knowledge
ForumSitemap • A sitemap is a directed graph corresponding consisting of a set of vertices and the links • Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
PageClustering • Forum pages are based on database & template • Layout is robust to describe template • Layout can be characterized by the HTML elements in different DOM paths
Page Clustering Dom Path Feature Discovery Clustering by Virtual Tables
Link Analysis A Link = URL Pattern + Location
Inner-Page Features • The inclusion relation. Data records usually have inclusion relations. • The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page. • Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.
Problem Setting Author Title Content
Formulas of list page • Formulas for identifying list record • Formulas for identifying list title
Formulas of post page • Formulas for identifying post record • Formulas for identifying post author
Formulas of post page • Formulas for identifying post time • Formulas for identifying post content
Markov Logic Networks • An MLN can be viewed as a template for constructing Markov Random Fields. • With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:
Markov Logic Networks • Divide DOM tree elements into three categories : • Text element • Hyperlink element • Inner element • Benefit • Reduce the number of possible groundings in inference. • Reduce the ambiguity and achieve better performance.
Experiments List Pages Post Pages
Future works http://discussions.apple.com/
Conclusion • A template-independent approach to extract structured data from web forum sites. • we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap. • http://research.microsoft.com/people/jmyang/