1 / 1

Finding Related Communities on the Web

Finding Related Communities on the Web. Masashi Toyoda. We propose a new web search technique, which finds related communities from a given URL.

stacia
Télécharger la présentation

Finding Related Communities on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Related Communities on the Web Masashi Toyoda We propose a new web search technique, which finds related communities from a given URL. A community is a set of web pages written by authors who have a common interest on a specific topic, such as fan pages of a professional baseball team. Our technique finds a community that includes the given URL, and communities on related topics, using hyperlink analysis. A community of VAIO fans What proposed technique finds Communities related to a given seed page A fan of SONY VAIO PC PC vendors HITS [Kleinberg ’97] Sub-graph for finding related pages Extracts good authorities and hubs from a given subset of web graph Authorities: pages pointed by many good hubs Hubs: pages pointing many good authorities Seed Hub Authority Hub Authority Hub Authority • auth(n) = Σ hub(m), for all m pointing to n • hub(n) = Σ auth(m), for all m pointed to by n Typical Behavior Algorithm 10 of Top 10 authorities Result communities Result communities URL1.1 URL1.2 … URL1.10 Clustering VAIO fan A VAIO fan B VAIO fan C VAIO fan D … URL1.1 … URL1.10 URL2.1 … URL.2.10 HITS The first top 10 authorities Top 10 authorities VAIO fan A VAIO fan B VAIO official page VAIO fan C VAIO and WinCE … … URL1 URL2 … URL10 URL2.1 URL2.2 … URL2.10 Seed URL http://foo.bar/ VAIO official page SONY IBM TOSHIBA … HITS Use each authority as a next seed VAIO fan A Merge two “top 10 authorities” into a cluster when they share 3 or more URLs URL10.1 URL10.2 … URL10.10 VAIO and WinCE WinCE fan A WinCE fan B … URL10.1 … URL10.10 Data Set Web Graph Experiment • 17 million web pages (90GB) • Crawled from July to September, 1999 • Pages in jp domain, or pages in other domain including Japanese characters • Root URL: http://www.yahoo.co.jp/ • Crawling strategy: breadth first • 38 M URLs • 23 M inter-server links • Mapped on main memory (2.5 GB) • Randomly select 50 moderately popular pages as seed • 10 ≦# of in-links ≦ 50 • Examine whether result communities are related 21M pages pointed to by retrieved pages Result 17M pages retrieved by the crawler • 35 seeds produce related communities • 15 seeds produce unrelated communities

More Related