1 / 15

Block-level Link Analysis

Explore how different semantic blocks on a web page affect its importance and topic relevance using link analysis at a block level. This study presents a method to segment web pages into semantic blocks, construct a semantic tree, and apply PageRank and HITS algorithms to understand the semantic structure of web pages. Experiments conducted on TREC2003 dataset show the effectiveness of block-level analysis in improving search relevance.

arielleg
Télécharger la présentation

Block-level Link Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University

  2. Introduction • Web page often contains multiple semantics • Different parts of the page have different importance and topic • Links contained in different semantic blocks point to pages of different topics • Importance of page may be mis-calculated by PageRank and topic drift may happen in HITS • Split page into semantic blocks • Apply link analysis on block-level

  3. Vision-Based Page Segmentation Construct a semantic tree for a page based on layout structure • Extract blocks from the html DOM tree • Constructed blocks into a semantic tree based on seperators • Node: block with a value (DOC) to indicate how coherent of the content in the block.

  4. Block Level Web GraphP: set of all the pages B: set of all the blocks X: page-to-block matrix (layout structure) f is block importance function: big size and centered position vs small size and margin position Z: block-to-page matrix (link structure) Is the number of pages that block i links to

  5. WP:Page-to-Page Graph A weighted adjacency matrix: Links in blocks with high importance value get more weights than those in blocks with low importance value

  6. WB:Block-to-Block Graph (didn’t use in this paper) Extension: the probability of jump from a block a to block b within a page is DOC value of the smallest block containing both block a and block b

  7. Block Level Page Rank(BLPR) • Apply PageRank on weighted adjacency matrix WP • Edge is weighted by block’s importance value. • Pages pointed by advertisement hyperlinks might not be assigned a large score since such links are always in less important blocks. • Block level PageRank can reflect the semantic structure of the web

  8. Block level HITS(BLHITS) • Apply HITS on block-to-page matrix Z • A page will have only authority score A and a block will have only hub score H • Different parts of the page are treated differently, thus the links in these hubs are treated differently.

  9. Main difference between BLHITS and HITS • Links from blocks to pages vs Links from pages to pages • Root set is made up of top ranked blocks rather than top ranked pages. • When expanding the root set, only consider out-links contained in top ranked blocks of a page instead of all links. • Combine content analysis in block-level instead of page-level. • Weight links: importance value of the block /maximum block importance value

  10. Experiments • DataSet: TREC2003 • Relevance weighting: BM2500 • PR and BLPR • HITS and BLHITS • Size of rootset:200 • In-link parameter d:50 • Adopting Bharat and Henzinger’s idea • Eliminate mutually reinforcing relationship between hosts • Combine connectivity and content analysis

  11. Results on PR & BLPR 1. First 15 pages in .GOV dataset

  12. 2. Results on TREC2003 Combine relevance score (using BM2500) and importance score (using ranking algorithm)

  13. Results on HITS & BLHITS

  14. summary

More Related