410 likes | 515 Vues
Predictive Caching and Prefetching of Query Results in Search Engines. Based on a Paper by: Ronny Lempel and Shlomo Moran Presentation: Omid Fatemieh CS598CXZ Spring 2005. Outline . Introduction Some analysis on the query log
E N D
Predictive Caching and Prefetching of Query Results in Search Engines Based on a Paper by: Ronny Lempel and Shlomo Moran Presentation: Omid Fatemieh CS598CXZ Spring 2005
Outline • Introduction • Some analysis on the query log • Existing and the proposed approaches for caching and prefetching • Results • Conclusion
Introduction • Search engines receive millions of queries per day • From millions of unrelated people • Studies show that a small set of popular queries account for a significant fraction of the query stream • Answering some queries from cache instead of through index can: • Lower the response time • Reduce hardware requirements
Caching vs Prefetching • Caching: storing the results that were requested by the users in the a cache in case other user requested same pages. • E.g. many people would search for information about Illinois and NCA on the match day. • Prefetching: storing the results that we predict they would be requested shortly. • E.g. prefetching the second page of the results whenever a new query is submitted by a user.
The Query Log • Over 7 million keyword driven search queries submitted to Alta Vista in Summer 2001 • Size of the results: A multiple of 10 • For r>=1, the results whose rank is 10(r-1)+1,…, 10r : r th page. • Queries asking for a batch of 10k results: asking for k logical result pages. • Each Query: q=(z, t, f, l) • t: timestamp • z: topic of the query • f and l: the range of result pages requested.
Analysis on the Query Log • 97.7 percent requested 10 results (one logical page) • This implies over 7,500,000 logical pages (also called page views) were requested. • 63.5% of which were for the first pages of the results • 11.7% of the views were of second pages…
Views of Result Pages The sequential browsing behavior allows for predictive prefetching of result pages.
Population of Topics • The distribution of distinct result pages viewed per topic = population of the topic. • In 78% of topics, only a single page was viewed (usually the first page). • Next slide shows the rest.
Population of Topics (Number of Pages Requested per Topic) Note the unusual strength of topics with population s of 10, 15, and 20
Topic Popularities • The log contained over 2,600,00 distinct topics • Over 67% were only requested once, in a single query. • The most popular topic was queried 31546 times. • The plot for this, conforms to the power-law except for the most popular topics. • Power Law: the probability of a topic being requested x times is proportional to x-c. • c ~ 2.4 for topic popularities.
Page popularities • Similar phenomenon is observed when counting the number of requests for individual result pages. • 48% of the result pages are only requested once. • The 50 most requested pages account for almost 2% of the total number of page views. • Again, distribution follows a power law for all but the most frequently requested pages. • Power law in this log: • c ~ 2.8 for page popularities.
Behaviour of the Most Popular Result Pages • The log contained 200 result pages that were requested more than 512 times • These pages do not obey the power law • The number of the requests for the rth most popular page is proportional to r-c. • Here, r is 0.67.
Fetch Units and Result Prefetching • Fetching more results than requested may be relatively cheap. • The dilemma is whether storing the extra results in cache is worthwhile? • This would be at the expense of evicting previously stored results. • All the caching schemes fetch results in bulks whose size is a multiple of k, the basic fetch unit.
Fetch Units and Result Prefetching • q: a query requesting result pages f through l for some topic. • Let a and b be the first and last uncached pages in that range. • A k-fetch policy effectively fetches a, a+1, …a+mk-1, such that a+mk-1>=b. • K=1: fetching on demand
Upper Bounds on Hit Ratios • Cache of infinite size • For each topic t, Pt is the subset of 32 potential result pages that were actually requested in the log. • Cover Pt with minimal number of fetch units possible : f k (t) • Σt fk(t) is a close approximation to minimal number of faults any policy whose fetch unit is k will have on this query log.
Cache Replacement Policies • Page Least Recently Used (PLRU) • Page Segmented LRU (PSLRU) • Topic LRU (TLRU) • Topic SLRU (SLRU) • All the above approaches require O(1) for treating each query. • Probability Driven Cache (PDC) • The novel approach presented in the paper. • Requires O(log(size-of-cache)) operation per query.
Terminology • C(q) is the subset of the requested pages P(q) that are cached when q is submitted. • Let G(q) denote the set of pages that are fetched as a consequence of q. F(q) is the subset of the uncached pages of G(q).
PLRU • PLRU: The pages of C(q), merged with pages of F(q) are moved back to the tail of the queue. • Once the queue is full, cached pages are evicted from the head of the queue. • Tail: the most recently requested (and prefetched) pages • Head: The least recently requested pages.
SLRU • Two LRU segments • A protected segment • A probationary segment. • F(q): inserted into the probationary segment (tail). • C(q): transferred to the protected segment (tail). • Evicted from the protected segment: • Remain cached • Moved to the tail of the probationary segment. • Pages are removed from the cache only when they are evicted from the probationary segment. • The Pages in the protected segment were requested at least twice since they were last fetched.
TLRU • The Topic LRU (TLRU) policy is a variation on the PLRU scheme. • Let t(q) denote the topic of the query q. TLRU performs two actions for every query q: • The pages of F(q) are inserted into the page queue. • Any cached result page of t(q) is moved to the tail of the queue. • Each topic's pages will always reside contiguously in the queue. • blocks of different topics ordered by the LRU policy.
TSLRU • The Topic SLRU (TSLRU) policy is a variation on the PSLRU scheme. It performs two actions for every query q (whose topic is t(q)): • The pages of F(q) are inserted into the probationary queue. • Any cached result page of t(q) is moved to the tail of the protected queue.
Probability Driven Cache (PDC) • qiu=( ziu, tiu, fiu, liu) } i >=1 the sequence of queries that user u issues • Consider qiu and qi+1u two successive queries issued by user u. • qi+1u may be a folllow-up, or start of a new session. • If qi+1u is a follow-up on qiu, then qi+1u is submitted no more that W time units after qiu (that is, zi+1u <= ziu+W) and fi+1u = liu +1. Obviously, in this case ti+1u = tiu. • qi+1u starts a new search session whenever fi+1u =1. • We also assume that the first query of every user u, q1u, requests the top result page of some topic.
Probability Driven Cache (PDC) • W=attention span: users do not submit follow-up queries after being inactive for W time units. • The result pages viewed in every search session are pages (t,1),..,(t,m) for some topic t and m>=1. • At any given moment z, every user has at most one query that will potentially be followed upon.
Probability Driven Cache (PDC) • The set of queries that will potentially be followed upon is defined by • Q = { qzu, qzu was submitted after z-W } • The model assumes that there are topic and user independent probabilities sm, m>=1, such that: • sm is the probability of a search session requesting exactly m result pages. • sm values are known
Probability Driven Cache (PDC) • For a query q: • t(q): query's topic • l(q): the last result page requested in q. • For every result page (t,m), we can now calculate PQ(t,m), the probability that (t,m) will be requested as a follow-up to at least one of the queries in Q: • P[m|l] is the probability that a session will request result page m, given that the last result page requested so far was page l.
Implementation • PQ(t,1) cannot be determined by this model. • It cannot predict the topics that will be the focus of future search sessions. • PDC prioritizes cached (t,1) pages by a different mechanism. • PDC maintains two buffers: • A priority Queue PQ for prioritizing the cached pages. • And SLRU buffer for caching (t,1) pages. • The relative size of these buffers: subject to optimization
Implementation • probabilities sm, m>=1 are determined based on the characteristics of the log. • PDC tracks the set Q by maintaining a query window QW, • QW holds a subset of the queries submitted during the last W time units. • For every kept query q=(z,t,f,l), its time z and last requested page (t,l) are saved.
Steps for a Query (z,t,f,l) • q is inserted into QW, • queries submitted before z-W are removed from QW. • If there is a query q' in QW such that the last page requested by q' is (t,f-1), the least recent such query is also removed from QW.. • T: The set of topics whose corresponding set of QW queries has changed. • The priorities of all T-pages in PQ are updated. • If f=1 and page (t,1) is not cached, (t,1) is inserted at the tail of the probationary segment of the SLRU. • If (t,1) is already cached, it is moved to the tail of the protected segment of the SLRU. • Let (t,m), 1<m<=l be a page requested by q that is not cached. • Its priority is calculated and if it merits so, it is kept in PQ (causing perhaps an eviction of a lower priority page).
Results • Hits and Faults • Query counts as a cache hit only if it can be fully answered from the cache, otherwise it is a fault. • Cold and Warm caches • The reported hit ratios are for warm caches. • The definition of a warm cache • PLRU and TLRU: • Once the page queue is full. • PSLRU and TSLRU: • Once the probationary segment of the SLRU becomes full for the first time. • PDC cache becomes warm • Once either the probationary segment of the SLRU or the PQ component reach full capacity.
Results (PLRU and TLRU) • LRU(s,3) is always higher than LRU(4s,1). In fact, LRU(16000,3) is higher than LRU(512000,1), despite the latter cache being 32 times larger. • For s=4000, the optimal fetch unit for both schemes (PLRU, TLRU) is 3. • The optimal fetch unit increases as the size of the cache increases. • The increase in performance that is gained by doubling the cache size is large for small caches
Results for PSLRU and TSLRU • The Ratio:The ratio of probationary segment to the whole cache. • For larger cache sizes, increased fetch unit helps more. • Lower ratios help more influential when the cache size is large.
Results for PDC • Two new degrees of freedom: • Length of the Query Window (QW) • The ratio between the capacity of the SLRU, which holds the (t,1) pages, and the capacity of the priority queue PQ, that holds all other result pages. • Again the most significant degree of freedom: fetch unit.
Results for PDC • Again the optimal fetch unit increases as the cache size increases. • The results indicate that the optimal window length grows as the cache size grows. • With small cache sizes, it is best to consider only the most recent requests. As the cache grows, it pays to consider growing request histories when replacing cached pages. • The optimal PQ size shrinks as the cache grows. • The probationary segment of the SLRU should be dominant for small caches, but both SLRU segments should be roughly of equal size in large caches.
Conclusion • Five replacement policies for cached search result pages. • PDC is superior to the other tested caching schemes. • For large cache sizes, PDC outperforms LRU-based caches that are twice as large. • It achieves hit ratios of 0.53 on a query log whose theoretic hit ratio is bounded by 0.629. • The impact of fetch unit is much greater than any other internal partitioning of the cache. • The optimal fetch unit depends only on the total size of the cache.
Discussion • Is the computational complexity of O(log(cache size) acceptable? • Considering the memory getting cheaper, does the slight improvement worth these computations? • They don’t show how they used their initial analysis on the logs for presenting their method.