1 / 23

Improved Techniques for Result Caching in Web Search Engines

Improved Techniques for Result Caching in Web Search Engines. Qinqing Gan Torsten Suel. Presenter: Arghyadip ● Konark. Summary:. Result caching in web search engines Query Result Caching of search engines to improve the query processing performance.

maik
Télécharger la présentation

Improved Techniques for Result Caching in Web Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improved Techniques for Result Caching in Web Search Engines Qinqing Gan Torsten Suel Presenter: Arghyadip ● Konark

  2. Summary: Result caching in web search engines • Query Result Caching of search engines to improve the query processing performance. • To increase the effective throughput of the entire search engine system. • Discussion of various weighted ,un-weighted and hybrid query result caching techniques. • Performance Evaluation.

  3. Query Processing • Main challenge for query processing is the significant size of the index data for a query • Need to optimize to scale with users and data • Caching is one of such optimizations • Result caching: has query occurred before? • List caching: has index data for term been accessed before?

  4. Query Co-ordinator

  5. Related Work • Number of subsequent papers on result caching: (Cache Hit only) • Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003) • Fagni et al. (TOIS 2006) • Lempel/Moran (WWW 2003) • Saraiva et al. (SIGIR 2001) • Xie/Hallaron (Infocom 2002) • Fagni el al. proposes hybrid methods that combine a dynamic cache with a more static cache • Baeze-Yates et al. (Spire 2007) use some features for cache admission policy

  6. Caching Basics LRU: least recently used LFU: least frequently used Can be implemented using basic data structures score defined as the time since last occurrence of the same query in LRU, or the frequency of a query in LFU. Evict query with smallest score Recency (LRU) vs. frequency (LFU) Various hybrids: Combines two or more.

  7. SDC (Static and Dynamic Caching) Fagni et al. (TOIS 2006) LFU LRU Alpha = 0.7

  8. Characteristics of Queries(AOL Query Log) • Query frequencies follow Zipf distribution • While a few queries are quite frequent, most queries occur only once or a few times Double Logarithmic Scale

  9. Characteristics of Queries • Query traces exhibit some amount of burstiness, i.e., most of the queries occur only once or twice • A significant part of this burstiness is due to the same user reissuing a query to the engine. • With an assumed query arrival rate at 132 Queries per minute • Most queries repeat within few minutes/hour

  10. Only Cache Hit? • Query Result Fails. • Frequent Admission and Eviction Occurs.

  11. Ideology: • Study result caching as a weighted caching problem - Hit ratio - Cost saving • Hybrid algorithms for weighted caching

  12. Weighted Caching • Assume all cache entries have same size. • Standard caching: all entries also same cost • Weighted caching: different costs. • Result caching: some queries more expensive to recompute than others • In fact, costs highly skewed • Should keep expensive results longer

  13. Weighted Caching Algorithms • LFU_w: evict entry with smallest value of past frequency * cost (weighted version on LFU) • Landlord • On insertion, give entry a deadline equal to its cost • Evict entry with smallest deadline, and deduct this deadline from all other deadlines in the cache Weighed version of LFU (Young, Cao/Irani 1998) • SDC_w: Combination of LFU_w and Landlord.

  14. Hit Ratio of Basic Algorithms

  15. Cost Reduction

  16. New Hybrid Algorithms • SDC • lru_lfu • landlord_lfu_w

  17. Weighted Caching and Power Laws • Problem with weighted caching with high skew • Suppose q_1 has occurred once and has cost 10, and q_2 has occurred 10 times and has cost 1 • LFU_w gives same priority  is that right? • Lottery: • Multiple rounds, one winner per round • Some people buy more tickets than others • But each person buys same number each week • Given past history, guess future winners • Suppose ticket sales are Zipfian

  18. Weighted Caching and Power Laws • Compare: smoothing techniques in language models • Three solutions: • Good-Turing estimator • Estimator derived from power law • Pragmatic: fit correction factors from real data • Last solution subsumes others

  19. Weighted Zipfian Caching E.g, in LFU_w, Priority score = cost * frequency * g()

  20. Hybrid Algorithms After Adding Correction

  21. Dataset and Evaluations • 2006 AOL query log with 36 million queries • 4GB of Data Collected as HTML Pages from Quora • Lemur Search Engine has no support for Result Caching • Plan to Develop Weighted LRU, LFU and SDC Result Caching on top of Lemur • Compare the performance with different weights assigned to Hit Ratio and Load over all the above caching variants • Evaluate which weight metric works best

  22. Evaluation Methodology

  23. Questions?

More Related