Usage Meets Link Analysis:Towards Improving Intranet and Site Specific Search via Usage Statistics by Bilgehan Uygar Oztekin Thesis committee: Vipin Kumar, George Karypis, Jaideep Srivastava, Gediminas Adomavicius Dept. of Computer Science, University of Minnesota 200 Union Street SE, Minneapolis, MN 55455 Final Oral Presentation July 20, 2005
Outline • Aim: Improve ranking quality by incorporating usage statistics. • Summary of what is and is not addressed in this study. • Ranking methods • Existing link analysis approaches: PageRank and HITS • Usage based modifications/approaches: UPR, UHITS, and Counts • Infrastructure/modules. • Experimental result • Global comparisons/observations • Query dependent comparisons • Discussions
Ranking in search engines • Given a description of what the user is looking for: • How to select relevant documents? • How to order them? • Typically, a large number of items that match the criteria. • Coverage/recall tend to be relatively high without much effort. • Precision/ordering are the primary concerns.
Quality signals come to aid • Modern search engines use various quality measures. • Some are used to increase recall • Stemming • Synonyms/concept indexing • Query modification (expansion/rewrite). • More effort is spent for better precision/ordering: • One of the important milestones: • Link analysis, in particular, PageRank [Brin98]. • Topic sensitive link analysis 2002. • What is next? Usage statistics?
Why usage? • Most classical link analysis approaches see the world from author’s or site administrator’s point of view. • Algorithms mostly depend on static link structure, and to a small degree, on the page content. • Usage statistics offer additional, valuable information, telling the story from the user’s perspective. • If collected properly, it can be used to estimate • The probability that the average user will go to a particular page directly (page visit statistics). • The probability that the average user will follow a particular link (link usage statistics).
Question • Can we improve ranking quality by employing usage statistics? • How can we design and test usage based ranking algorithms? • Don’t have the resources, more importantly, the data to test it on a global scale. • We can test it in our own domain, i.e. intranet/site specific search. • Full usage statistics are available.
Internet vs. intranets • Are intranets similar to the Internet? In some ways, yes. • We still have documents and links of similar nature. • Smaller scale. • Major differences of particular interest: • Intranets tend to have poorer connectivity • Link analysis algorithms are not as effective as in the case of web search. • Heuristics that work well for link analysis on web search may not be applicable (e.g. site level aggregation approaches, applying link analysis on the site level). • Extensive usage statistics for the domain/intranet is available. • There is no incentive for spamming the search results within a single organization. Aim is to provide the information as best as possible (mostly cooperation, not competition). • Some of these observations make intranet/site specific search a prime candidate for usage based signals.
Scope of the project: Algorithms • Developed and implemented a number of usage based quality measures: • UPR, Usage Aware PageRank [Oztekin03], a novel PageRank variant incorporating usage statistics. • UHITS, usage augmented HITS (Hypertext Induced Topic Search), a generalization of [Miller02]. • A naïve usage based quality measure, a baseline benchmark. • Developed an improvement to UPR, which makes spamming more difficult/expensive using a modified counting scheme. • Implemented two major classical link analysis algorithms: • PageRank • HITS
Scope: Infrastructure • Implemented a simple crawler and took a snapshot of *.cs.umn.edu domain (20K core URLs with full usage statistics, 65K total URLs). • Processed 6 months worth of usage logs around the snapshot. • Built static and usage based site-graphs. • Developed a full fledged site specific search engine, USearch, http://usearch.cs.umn.edu/, along with various supporting components to evaluate the algorithms in a real-life search setting.
Scope: Experiments • Studied the effects of adding various degrees of usage information. • Sample the parameter space. • Studied the effects of the suggested improvement (modified counting scheme). • Compared various methods in terms of • Distinguishing power they offer. • The nature of the score distributions they provide. • How they are correlated with other methods. • How well they perform in a query based setting. • Designed 3 sets of query based experiments: • two experiments based on blind human evaluations. • one experiment independent of user judgments.
Items that are not addressed • Test the algorithms on a larger scale • Large intranets (e.g. a few million pages). • Internet. • Test a few aspects that did not occur in our domain. • Spamming. • Uneven usage availability. • Gracefully converges to PageRank, but studying this on larger scale would have been interesting. • Conduct implicit user judgment based evaluations which may offer higher coverage/statistical significance. • Not enough usage base. • Parallel/distributed implementations of UPR or other methods. • Have ideas/suggestions for reasonable implementations.
PageRank [Brin98] • Based on a random walk model • Random walk user has: • Equal probability to follow each link in a given page • Equal probability to go to each page directly • Under certain conditions, forms a probability distribution. • Relatively inexpensive. • Stable: In general, changing a subsets of nodes / links does not affect overall scores dramatically. • It scales well to the Web. • Relatively resistant against spamming.
PageRank (PR) Usage aware PageRank (UPR) Usage aware PageRank d is the damping factor. n is the total number of pages in the dataset. a1 controls the usage emphasis in initial weights of pages. a2 controls the usage emphasis on the links. C(i) is the number of outgoing links from page i. Wdirect(p) : estimated probability to go to page p directly. Wlink(i→p) : weight of the link from page i to page p in the usage graph. Wtotal(i) : total weight of outgoing links from page i in the usage graph.
UPR implementation • Web logs: • Referrer field is empty: user visited the page directly. • Referrer field is non-empty: user followed a link from the referrer page to the target page. • Simple approach: Directly use counts to approximate the probabilities. • Wdirect(p)=count(directly visiting p)/∑icount(directly visiting i) • Wlink(i→p)=count(link from page i to p is traversed) • Wtotal(i)=total number of times an outgoing link from page i is traversed. • Wlink(i→p)/Wtotal(i)=estimated probability of following the link i→p, given that the user follows a link from page i.
Improvement • Idea: Many people accessing a page/link a total of x times should count more than few people accessing the page/link a total of x times. Divide the dataset into time windows. For each window, use modified_count=log2(1+count) • Same as counts if everybody accesses the page once. • If the same person (IP number) accesses the page/link more than once in the same time window, subsequent contributions are lowered. • Makes the system more robust against usage based spamming and reduces some undesirable effects (e.g. browser homepages).
UPR properties • Inherits basic PageRank properties. • Usage information can be updated incrementally and efficiently. • UPR iterations have similar time complexity as PageRank iterations (only a very small constant times more expensive). • Usage importance can be controlled smoothly via the parameters a1 and a2(used as sliders, 0=pure static structure, 1=pure usage). • Can also work with limited usage information (it gradually converges to PR, as less and less usage information is available. At the extreme case it reduces to PR).
HITS [Kleinberg99] • Two scores are assigned to each page: Hub and authority. • A good authority is a page that is pointed by good hubs, and a good hub is a page that points to good authorities. • Not as stable and scalable as PageRank • Mostly used for limited number of documents (e.g. in the context of focused crawling).
HITS UHITS UHITS a is the authority score vector. h is the hub score factor. A is the adjacency matrix. A(i,j) is nonzero if there is a link from i to j, 0 otherwise. Am is the modified adjacency matrix. Astatic (same as A) is the adjacency matrix in static linkage graph. Ausage is the adjacency matrix in usage based linkage graph. α controls usage emphasis. Adjacency matrices are normalized on outgoing links. UHITS is a generalization of [Miller01]
Naïve approach • Counts: use number of visits to a page as a direct quality measure for that page. • Score(p)=count(p)/∑count(i) • MCounts: instead of direct counts, use modified counts (log transform) • Score(p)=mcount(p)/∑mcount(i) • For a given time window, contribution of subsequent accesses from the same IP number is reduced. • Sum of scores is 1.
Previous usage based approaches • [Schapira99] or Directhit approach. • A page’s score is proportional to the number of times users clicked the URL of the page from search engine results (if you record them). • Somewhat similar to naïve approaches considered in this study, but instead of using global statistics, statistics available to the search engine’s web server are used. • Potential self-reinforcement problem. • [Miller01]: Modification to HITS. Adjacency matrix is replaced by usage based adjacency matrix (similar to UHITS where parameter α is set to 1). • Unlike UHITS, if a link does not appear within the usage based graph, it is ignored. Thus scores of many pages may converge to zero. • Practically no experimental results. • [Zhu01]: PageRate, claims to be a modification of PageRank, but has very different properties. • Does not have the basic PageRank properties. • Normalization is done on incoming links. • Requires non-trivial additional nodes, or normalization steps. • Score of a page is not distributed to the pages it points to, proportional to the weights of the links. • Unlike UPR, it does not use page visit statistics. • No experimental results provided.
Impact of proposed algorithms • In particular UPR has a number of properties previous two approaches of incorporating usage statistics into link analysis did not offer: • Ability to control usage emphasis. • Ability to work with limited usage. • Inherits basic PageRank properties: • Stable • Scalable • Normalized scores • Intuitive interpretation
Overview • Global comparisons • Effects of usage. • Correlation between various methods. • Distinguishing power different methods offer. • Query dependent comparisons • Compare a representative of each method in a query dependent setting via blind evaluations. • Design an experiment in which the relevant documents can be selected objectively without user feedback (A very specific query type).
Experimental Setup • Methods under comparison: • Counts: use number of visits to pages as a direct quality measure • MCounts: similar to Counts, but contribution of the same IP in the same time window is decaying after the first visit (modified counting scheme). • PR: PageRank (Brin and Page) • UPR: sampled a1 and a2 in 0.25 increments. A total of 25 score vectors computed for UPR (a1=a2=0 is the same as PR). Damping factor was set to 0.85. • HITS: authorities and hubs (Kleinberg). • UHITS: HITS modified via usage statistics. Usage emphasis is sampled in 0.25 increments (0 is the same as HITS). Total of 5 samples from each of the score vectors (authority, hub).
Effects of usage: PR vs. UPR • Using PR, pages with high connectivity such as manual pages, FAQs, and discussion groups dominate top ranks. Department’s main page is at the 136th position. • As more usage emphasis is introduced, increasing number of highly used pages (e.g. user home pages) start to appear in top ranks. For a1=a2=0.25, department’s main page is at the 6th position, for values above 0.5, it is the highest ranked page. • Divided PR by UPR(a1=a2=1.0) and sorted the list in ascending order: 389 out of top 500 pages have “~” in the URL, compared to 79 out of 500 at the bottom of the list. UPR boosts highly used pages including user homepages.
Effects of suggested improvement UPR simple/UPR modified. Does not affect most URLs.
Effects of suggested improvement UPR simple/UPR modified. Top and bottom 500 pages.
Effects of suggested improvement • Scores of most pages did not change dramatically. • However, it helped reduce the scores of pages that are accessed by very few users a large number of times. • www-users.cs.umn.edu/~*userone*/ip • Likely to be used as a poor man’s dynamic DNS solution: Home computer’s IP address uploaded periodically, and checked frequently to obtain the latest IP. • Ranks with emphasis values 1, 0.75, 0.5: • UPR simple: 2nd, 6th, 10th • UPR modified: 130th, 180th, 329th • www-users.cs.umn.edu/~*usertwo*/links.html • Gateway to various search engines, as well as local and global links. Used as homepage by a few graduate students, producing a hit every time they open up a browser. • Ranks with emphasis values 1, 0.75, 0.5: • UPR simple: 3rd, 10th, 18th • UPR modified: 40th, 67th, 116th
Pairwise correlations • Focusing on the internal 20k nodes, for each method pair, calculate • Pearson correlation (compares scores) • Spearman correlation (compares ordering) • Pearson correlation does not tell the whole story: Score distribution in link analysis is often exponential. Pearson correlation effectively measures the correlation between highly scored pages. • Spearman correlation focuses on the ordering. Seemingly highly correlated scores may suggest relatively different rank order: • UPR(0.5, 0.5) vs. Counts has a Pearson correlation of 0.96, the Spearman correlation is relatively lower, 0.38. It is likely that different methods suggest different orderings even in top 5 positions.
Observations • PageRank variants offer a global ordering and behaves smoothly as the parameters are changed. • UPR is more correlated with PR when usage emphasis is low, and more correlated with Counts/MCounts when usage emphasis is high. • Counts/MCounts behave reasonably well, but have a step-like behavior. • HITS variants fail to offer a smooth global ordering between pages. A significant portion of scores converge to zero, especially with higher usage emphasis.
Query dependent evaluations • Evaluation version of USearch • Top 50 documents according to similarity to query are presented to the user. • no selection of methods, always uses cosine similarity. • User selects up to 5 relevant documents. • Top 50 results are reranked with methods under comparison. • Average position of the selected documents are calculated for each method (lower is better). • Selected methods: • PageRank, UPR(0.75), MCounts, HITS authority/hub, UHITS(0.75) authority/hub.
Query dependent comparisons • 3 set of experiments with different characteristics: • Public evaluations (106 queries): • Everyone can participate. Announced to grads/faculty. • Queries selected by users. • Less accurate results per query. • Queries tend to be more general. • Presentation bias probable. • Arbitrated set (22 queries) • Queries selected and evaluated by 5 grad students depending on their interests. • More specific queries. • Good quality results are discussed before issuing the query. • Results examined and discussed by at least 2 raters. • Presentation bias is less likely. • Name search (25 queries) • 25 users are randomly selected from CS directory. • Finger utility is used to obtain the username. • Last name is issued as the query. • Average position of the user homepages is calculated for each method. • Presentation bias is not applicable.
Results • UPR behaves consistently better than other methods. • MCounts behaves reasonably well. • UHITS fail to offer a distinction between pages in general except top few documents with very high usage (cosine order is preserved for the rest). • Cosine order is better than other methods in Public evaluation set. Potential bias towards presentation order.
Discussions • Applicability to web search • Search engines can also collect general usage information: • Specialized browsers/tools/proxies (e.g. toolbars, Google web accelerator). • There is no reason why UPR can not be computed on a larger scale: • Leading search engines have the infrastructure to compute UPR. For Google: • MapReduce [Dean 2004]: Divide the data set between large number of machines. Minimal inter-communication requirements. UPR can be implemented in terms of almost embarrassingly parallel tasks. • GFS [Ghemawat 2003]: Transparent, large scale storage.
Discussions (cont) • Is there a reason why web search engines may be reluctant to use extensive usage information? • Spamming… • An arms race. • As long as cost of spamming is relatively low compared to its commercial benefits, spamming is likely to remain an issue. • Search engines introduce new signals to increase the cost of spamming.