320 likes | 332 Vues
This paper proposes a framework for mining user behavior data from web sessions to improve web search ranking and user experience. It focuses on session context models and uses session clustering and ClickRank algorithm for effective results.
 
                
                E N D
Mining Rich Session Context to Improve Web Search Guangyu Zhu University of Maryland GiladMishne Yahoo! Labs
Motivations • To propose an efficient and scalable framework for mining general web user behavior data • Query/click logs are useful, but limited (< 5% of traffic) • All user actions count • The web and web user behaviors both constantly evolve • Focus on sessions of general web browsing activities • A logical unit that is general across all categories • To learn the preferences, intents, and judgment of users from rich contextual information • To learn session context models to improve core web search ranking, and other web search experience
Roadmap • Motivations • Mining web sessions • ClickRank • Applications to web search • Site ranking • Page ranking • Mining dynamic quicklinks
Session identification • We define session as an active trail of user clicks presented by the URL referral structure • A new session starts • After 30 minutes of inactivity • Occurrence of a URL without the referrer URL • We used aggregate, anonymous general user behavior data collected by Yahoo! Toolbar • 30 billion events over 6 month period in 2008 • {cookie, timestamp, URL, referral URL, event attributes} • No personal information in source data
Session characteristics • Search sessions is only less than 5% of user on-line activities • A web session contains significantly richer activity context and diversity than a search session
Session characteristics • The events per session and session duration exhibit power law behaviors in web-scale general user behavior data sources
Histogram session representation • We compute a distribution of activities over structured intents, given a list of URLs and their intent interpretations 7 dimensional feature vector for each session Histogram representation of the session • Sessions are highly diverse • Use PCA to reduce dimensions • The first 6 eigenvalues are significant Total number of events in the session Session duration
Session categorization Cluster centroids Cluster# Attribute Full Data 1 2 3 4 5 6 7 8 9 10 100% 29.8% 16.6% 14.3% 11.9% 11.0% 4.7% 4.6% 3.5% 2.1% 1.5% ========================================================================================================================== Search 23.630 0.340 98.430 1.190 2.350 2.350 56.180 41.520 52.230 6.460 0.090 Mail 16.810 0.070 0.660 97.250 0.390 0.400 1.290 51.790 0.710 9.790 0.080 Information 12.260 0.040 0.270 0.390 1.030 96.500 24.580 2.650 0.500 5.970 0.020 Rich content 34.320 99.420 0.370 0.650 0.450 0.360 0.640 0.950 45.250 60.510 99.540 Shopping 12.850 0.080 0.240 0.410 95.670 0.290 16.920 2.600 0.860 16.840 0.060 Total events 9.040 11.140 2.890 5.660 6.250 5.330 4.240 5.380 4.260 7.850 151.680 Total time 420.300 532.490 261.370 303.850 235.780 298.910 228.400 455.580 218.010 439.780 4237.650 Addiction to content rich websites Collecting info during shopping Browsing content rich websites Reformulating search queries Reading email Informational queries Navigational queries • Intent-driven web browsing patterns emerge from session clusters • K-means clustering is sufficient to reveal meaningful intent patterns, such as long sessions of content browsing and query reformulation • Simple and effective
Roadmap • Motivations • Mining web sessions • ClickRank • Applications to web search • Site ranking • Page ranking • Mining dynamic quicklinks
ClickRank Overview • ClickRank is derived from contextual indicators of user preferences and judgment in general web sessions • Dwell time on the page • Click order in the session • Page load time • Frequency of occurrence in the session • We compute a local ClickRank function for each visited page in a session by incorporating session context models, and then aggregate these values to obtain the global ClickRank
Local ClickRank • We define the local ClickRank function as • The weight function is computed from the rank of the page visit event in session • The weight function is computed from temporal information associated with browsing of the page • is the indicator function
ClickRank incorporates click order • We define the weight function for an event in rank of a session with a total of events as where • Motivated by experiments on implicit user preference judgments in Joachims etc, SIGIR 2005 • is a monotonically decreasing function w.r.t. the rank of the event within a session • and the mean and variance of the local ClickRank function is finite
ClickRank incorporates temporal signals • We define another weight function to incorporate more temporal information where and are normalized dwell time on the page and page load time w.r.t. the entire session • The indicator function above defines a filter that factors in the time range of interest
Global ClickRank • Given a set of web sessions , the global ClickRank is computed from local ClickRank functions by an aggregation function • Aggregation operators to compute global ClickRank are more general • Sum, average, and filter, e.g. by criterion like time and demography • Filtering sessions is much flexible compared to filtering links
Theoretical framework of ClickRank • The local ClickRank function defines a random variable a associated with the web page , given an observed session • and • Convergence Property: As converges to by the strong law of large numbers
Relation to graph-based models • ClickRank is based on an intentional surfer model • ClickRank is data driven • ClickRank does not embed rigid assumptions on the traversing scheme over the web • Better reflects users’ information need and adapts faster to constantly changing user behaviors • Significantly more efficient and scalable compared to approaches based on explicit graph formulations • The ClickRank computational framework is well suited for distributed computing • ClickRank can be computed incrementally • One pass over entire data and memory friendly
Roadmap • Motivations • Mining web sessions • ClickRank • Applications to web search • Site ranking • Page ranking • Mining dynamic quicklinks
Applications to web search • Datasets • 3.3 billion web sessions extracted from Yahoo! Toolbar data over 6 months in 2008 • Site ranking • Compute ClickRank of 16.3 million websites in 56 minutes • Page ranking • Compute ClickRank of 3.1 billion web pages in 1 hour and 32 minutes
Site ranking • ClickRank is more reliable and richer than results computed using only static link structure * The BrowseRank results are cited from Liu etc, SIGIR’08, which used MSN Toolbar data
Page ranking methodology • We evaluated ClickRank with a state-of-the-art search engine with hundreds of ranking signals • We learn the ranking model using gradient boosted decision trees (GDBT) • Quantify the variable importance of individual feature
Page ranking • We used a set of 9,000+ randomly sampled queries from search logs • We computed ClickRank feature only for documents that are visited by more than 5 users over time Summary of the page ranking experiment
Page ranking • The ClickRank value is quantized within the range of [0, 255], to mirror the setting in a production system • We used DCG and NDCG to quantitatively evaluate ranking performance
Page ranking • The ClickRank feature brings 1.02%, 0.97%, 1.11%, and 1.331% web search improvements in DCG(1), DCG(5), DCG(10), and NDCG • 1% gain over a production system is very significant • ClickRank affects 81.2% out of over 9, 000 queries and covers 62.5% of documents
Competitive insights of ClickRank • ClickRank brings higher improvements to long queries • Ranked 25th in variable importance among several hundreds ranking signals • The highest-ranking feature derived from page visit count (ranked 56th) and a feature based on propagation of authority through web link graph (ranked 108th)
Mining dynamic quicklinks • Many commercial search engines provide quick access links to popular destinations within the site • These links are traditionally mined from search engine query logs • Query or search session logs are limited in scope and coverage • Query logs favor old, navigational links
Mining dynamic quicklinks • We demonstrate ClickRank for discovering recent, dynamic content • We adapt the time range in the temporal weight function w.r.t. the content refresh rate found by crawler • Use the indicator function as a term that specifies recency of the content
Mining dynamic quicklinks Search results with quicklinks mined by ClickRank for August 10, 2008
Mining dynamic quicklinks Search results with quicklinks mined by ClickRank for August 10, 2008
Mining dynamic quicklinks Search results with quicklinks mined by ClickRank for August 16, 2008
Mining dynamic quicklinks Search results with quicklinks mined by ClickRankd for August 16, 2008
Conclusion • We expand the use of general user behavior data for web search ranking and other applications • We introduce ClickRank, an efficient, scalable algorithm for estimating web page importance by incorporating rich contextual information • ClickRank is shown to be a novel and effective query-independent ranking signal, especially on long queries • Our results highlight the potential of data-driven user behavior modeling at the web scale