290 likes | 418 Vues
This paper discusses VIPAS, an advanced search mechanism developed at National Taiwan University, which enhances traditional link-analysis methods by incorporating "virtual links". Traditional search engines rely heavily on keyword-based ranking, often overlooking relevant pages without matching keywords. VIPAS analyzes user interaction data to create virtual links that bolster authority scores for frequently accessed pages, ultimately aiming to improve the relevance of search results. The paper details its framework, algorithms, and experimental findings demonstrating VIPAS's effectiveness in meeting user information needs.
E N D
VIPAS: Virtual Link Powered Authority Search in the Web Chi-Chun Lin and Ming-Syan Chen Network Database Laboratory National Taiwan University
Outline • Motivation and Goal • Preliminaries and Related work • Introduction to Link-analysis • Defects of Traditional Link-analysis and Ideas for Improvement • System Framework and Algorithms • Implementation and Experimental Results • Conclusions NTU
Motivation and Goal • To find the most relevant pages satisfying the user’s information need in the Web • Traditional means for this task • Keyword-based search engines • Problems • Some relevant pages do not contain the keywords in the page text • An alternative method • Analyze the links contained in Web pages instead of ranking by keywords NTU
HITS (1/3) • Authority pages • A page pointed to by many other pages • Hub pages • A page pointing to many other pages • Mutual reinforcement • An authority pointed to by many hub pages is an even better authority • A hub pointing to many authority pages is an even better hub • Based on this argument, the goal of HITS is to find the set of best authority pages NTU
HITS (2/3) • Let xp and yp denote the authority and hub score of page p, respectively q1 page p xp := sum of yqfor all qp q1 q2 q3 q2 page p yp := sum of xqfor all pq q3 NTU
HITS (3/3) • Iterative algorithm • Obtain a set of Web pages using a keyword-based query and expand it to form a base set • Assign each page of the base set an initial authority and hub score of 1 • According to its links, update the scores of each page • Normalize the scores so that(xp)2=1 and (yp)2=1 for all p in the base set • Do steps 3 and 4 iteratively until the scores converge NTU
The Problem with HITS • Links in Web pages only reflect page creators’ judgment • Sometimes a link will not be put in the page even though its destination is very relevant • e.g: There will be no link to a company’s competitor in the same industry in its homepage • We argue: Page readers’ considerationshould be of equal importance NTU
The Notion of Virtual Links • The basic idea • Identify pages that are heavily accessed within a period, and form a “hot set” from these pages • Create “virtual links” for pages in the hot set and incorporate them into the computation of authority scores • Design a Web warehouse for this task and utilize it to identify authoritative Web pages NTU
System Framework Page Archive Query Interface Web Pages page content & links keywords virtual links Keyword & Ranking Database Virtual Link Creator Authority Evaluator scores query results Clickstream Database Clicking Observer NTU
Creating Virtual Links • Scenario: A user interested in Java-related Web pages came to our system • She submitted a query with keyword “java” • Assume that the query result contains 100 URLs • She clicked top 1-10 of the 100 URLs except the 6th • The hot set consists of the 9 URLs clicked NTU
Creating Virtual Links (cont’d) • 2 criteria URL 1 URL 1 URL 2 URL 2 Hub 1 URL 5 URL 5 Hub 2 Virtual Hub URL 6 URL 6 Hub n URL 7 URL 7 URL 10 URL 10 NTU
Algorithm VIPAS(Virtual LInk Powered Authority Search) • Initialization Phase • For a query term, perform the regular HITS analysis • Collect a base set of pages with computed authority and hub scores and store them in the database • Virtual Link Collection Phase • Monitor the user behavior to see whether a URL in the list is clicked by the user or not • After a period of user behavior observation, put URLs that are often accessed into the “hot set” • Create virtual links for pages in the hot set NTU
Algorithm VIPAS (cont’d) • Refinement Phase • For each page in the hot set, compute its new authority and hub scores • Run several iterations of score updating for pages in the base set • 2flavors • VIPAS-VH(VIPAS with virtual links from a Virtual Hub) • VIPAS-TH(VIPAS with virtual links from Top Hubs) NTU
Finding Hot Sets • In an observing period, pay attention to clicks of continuous URLs in the list • When a user continuously clicks several URLs and then skips some URLs following, we mark those that have been skipped • Exclude pages marked with a frequency greater than from the forming of hot sets • Among pages left, those that are accessed by at least % users are put into the hot set • Some relevant URLs that have already been browsed by the user will be skipped NTU
Finding Hot Sets (cont’d) • http://java.sun.com/ • http://www.sun.com/java/ • http://www.javaworld.com/ • http://java.oreilly.com/ • http://www.jars.com/ • ………….. clicked clicked URL 4 is marked clicked skipped clicked • http://java.sun.com/ • http://www.sun.com/java/ • http://www.javaworld.com/ • http://java.oreilly.com/ • http://www.jars.com/ • ………….. skipped clicked URL 4 is marked,but URL 1 is not clicked skipped clicked NTU
Assigning Weights to Virtual Links n pages in the hot set: t1,t2,…,tn Clickstream 1: (t1,t2,t3,t4,x1,x2) Clickstream 2: (t3,x1,t1) NTU
Assigning Weights to Virtual Links (cont’d) • Final weight: • For period Ti where i 2 (1/3 is the degeneration factor) NTU
Computing the New Scores • Let xp and yp denote the authority and hub score of page p, respectively • For each page p, we update p’s authority score by • Similarly, we update p’s hub score by NTU
Query result for keyword: “Java” plain URL http://java.sun.com/ replaced by wrapper.asp?URL=http://java.sun.com/ • The Source of Java(TM) Technologyhttp://java.sun.com/ • ………………….http://…. • ………http://… • Increment the click count ofhttp://java.sun.com/ • Record the time • Redirect the user tohttp://java.sun.com/ Query result page User-behavior Observation • Use an ASP script NTU
Implementation and Experiments • Experimental testbed • NTUEE website(http://www.ee.ntu.edu.tw/) • Data collection • 03/28/’02 ~ 05/31/’02 • Parameters NTU
Evaluation Method • For a keyword, we manually select a list of authority pages and compare it with the output of each algorithm • Discrepancycoefficient NTU
Discrepancy Coefficient –Regular HITS R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228) NTU
Discrepancy Coefficient –VIPAS-VH R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228) NTU
Evaluation Method • Grouping coefficient • Stability • The standard deviation of each algorithm’s discrepancy coefficients for all of the keywords NTU
Grouping Coefficient –Regular HITS R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228) NTU
Grouping Coefficient –VIPAS-VH R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228) NTU
Conclusions • Link-analysis algorithms are popular in Web information retrieval • But they need further improvement • In our work, we built a Web warehouse • Incorporate user feedback into the identification of authoritative resources(Algorithm VIPAS) • Experimental results show that VIPAS is very effective and the warehouse is able to retrieve much more valuable information for users NTU