Web Warehouse : Non-Transparent Cache with Weak Storage Capacity Bound

Web Warehouse:Non-Transparent Cache with Weak Storage Capacity Bound Yahiko Kambayashi, Kai Cheng Sinotaro Hirano Graduate School of Informatics Kyoto University, Japan

Motivation

Background What Can Data Management Technologies Contribute ? Cache/Data duplication is important x9/Year Internet Content Bandwidth Internet Backbone X1.4/Year Web Characteristics • Open • Everyone Can Publish Content Freely • No Centralized “Data Dictionary” • Biased Usage • Dynamically Changing “Hot-Spots” • The Web Contents Doubles Every 3-6 Months. Bandwidth is Only x1.4/Year  Increasing Gap Between Increasing Traffic and Available Bandwidth Gap

Cache is Everywhere!! Caches in Multi-Tiered Ｗｅｂ Architecture Client Caches Client Caches Proxy Caches Internet Front-End Caches Web Server Caches Application Server Mid-Tier Caches DB

Major Results

Assumptions on cache algorithms are no more true Cache and Web • Traditional Cache • Simple and Fast Algorithms (e.g., LRU) are Required • Strict Limitation on Storage Size • Transparency (You Cannot See the Contents for Efficient Use) • All the data are treated equally when retrieved • Web Environment • Complicated Algorithms are Permitted • Disks can be Used and Cache Size is not a Limiting Factor anymore • Data in the Cache can be used (Non-Transparent) • Most of cache storage are occupied by non-important data

Cache+Database may be suitable for web applications Databases/Data Warehouses and Web • Traditional Databases • Large amount of selected data are stored following DB schema • Data are shared and user-friendly query interface is supplied • DB is designed by properties of data, not by applications. Application specific characteristics are handled by query processors • Query processor usually do not use statistics of past queries • Web Environment • Large amount of data should be shared • Dynamically Changing Hot Spots should be handled by Advanced Self-Organizing Structure (Composite Pages, Linked Pages) • Similar Contents Tend to be Accessed Subsequently • Usage data are important and should be shared Also to be used for definitions of dynamically changing priority

Contribution of This Paper 1. Dynamically changing Strongly Biased Hot Spot DataSelf-Organizing Capability Priority determined by data usage of the past and popular topics. Topic Sensors for Detecting Global Hot Topics in the Web. 2. Handling of Link StructuresPriority is determined by logical web structure 3. Small reuse ratio in Web cache / Many similar topicsNon-transparency Use of DB systems for web cache 4. Priority based file organizationUnlike LRU (priority is determined at the last moment), the initial value of priority should be determined when web page is retrieved. Data with high priority will be located at fast access storages. 5. Application sensitive data management Store and Manage both Data and Usage Data (metadata) Usage/Popularity-Aware Queries

Data Sources Selection DW Modeling DB Storage FS Retrieval Past Usage Patterns are not used in conventional DBMS Data Management Systems

Priority-Based Data Selection Web Object Hierarchy Model Self-Organizing Storage Management Popularity-Aware Query/ Navigation System Overview • A New Data System, Called Capacity Bound-free Web Warehouse (CBFWW), • A Cache without Capacity Bounds Capable of Storing All Important Data • A Data ManagementSystem with Priority Decision and Usage Data Priority Decision Description Topic Sensor Storage Retrieval Self-Organization

Overview of Web Cache

Existing Researches Focus on • Getting More Hits (Collaborative cache, Caching Uncacheable) • Increasing Freshness of Each Hit (Consistency Management) History of Web Cache Research • Cache Algorithms (90s) • Replacement Algorithms, e.g., LRU, LFU, LRU-SIZE, SIZE, etc. • As Storage Space is No Longer a Limiting Factor --“Publish No More Papers on Cache Replacement Algorithms ” (Panel Discussion, 2001 Web Cache Workshop ) • Consistency Management e.g., Client Polling, Server Invalidation • Caching of Uncacheable Contents • E.g, Using Proxylets, Active Cache (P. Cao, et al ) • Collaborative Caching • Hierarchical Cache (e.g., Harvest Project )

Characteristics of Web Cache

Factors for Web Cache Evaluation Traditional Factor • Recency: The More Recently an Object was Used, the More Likely It will be Used Again New Factors • Popularity: The More Popular an Object has been, the More Likely It will Get More Accesses in the Future. • Size : Caching a Larger Object may Displace Many Smaller Ones • Update Frequency

Algorithms for Web Cache

Web Warehouse

Architecture The Architecture of Ｗｅｂ　Ｗａｒｅｈｏｕｓｅ Topic Sensor Web Requester (Proxy) Recommender Topic Manager Constraint Manager Data Analyzer Priority Manager Data/Usage Query Processor Storage Manager Version Manager Data/Usage • Memory • Disks • Tertiary Storage

Data warehouse capability is required Most Contents In Web Caches Never Reused 70% of HTML Files Never Reused Zipf’s Distribution • Data Obtained From A Large ISP Kyoto-Inet • Only HTML Documents Considered

Data for Web Warehouse

lgPath= d1, d2, d3 Logical Document Title = Anch_text1+ Anch_text2+ title) Body Contents Corresponding to Logical Pages d1 Anch_text1 Hyperlink d2 Hyperlink Anch_Text2 d3 Body Title Frequently-Used Path Toward d3

2 1 1 0 1 Physical Documents • Container • Textual Content • Anchor • Holder Place for Other Media • Components • Media Files Other than Text • Use Counter Contentof(d) = < title, body>, for (physical) document d Both Container and Components are Called Raw Data

Data Organization Organization of Web Data in CBFWW • Data Organization Based 0n Locality of Reference • Page Embedded ObjectsAccess to a Page Causes Embedded Components Accessed • Page Linked PagesAccess to a Page Enables Linked Pages More Likely to Be Accessed • Page Similar PagesAccess to a Page Entails Interest to Similar Pages Semantic Region (Topic) -- Cluster of Similar Logical Pages Logical Pages -- Frequently-Used Path to a Physical Page Physical Pages -- Composite Page: Container(1)+Components(M) Raw Data -- Undividable Web Objects (e.g., Files) 1 1 2 0 1

Computation of Priority Priority for various storage levels

Priority Decision When Retrieved • LRU determines the priority at the last moment • A Semantic Region (R) is a Cluster of Semantically Close Logical Documents. • Each Document Belongs to Exactly One Cluster • A New Document Belongs to a Cluster whose Centroid is the Nearest • The Number of Semantic Regions is Given. • Existed High Performance Single-Pass Randomized K-Median Clustering Algorithms can be Adopted (e.g LSEARCH).

Topic sensor • Priority Decision by Global Popularity • Analysis of Data provided by a provider Kyoto I-net • Very popular web pages are influenced by news on TV and newspapers • Especially web pages related to some local events are accessed only during a short period of time • Priority by past usage is not enough • Topic Sensor finds important topics from news sites • Contents GraphOnly keywords are not enoughKeywords with co-occurrence relationships are expressed

Similarity by Concept Graphs • Keywords Keyphrases • Co-occurrenceAssociation rules Concept Graphs for Extracting Topics, By Y. Lee. And Y. Kambayashi 2002

2 1 1 0 1 Priority Decision by Usage History -History of Keyword Usage Popular web search technique-Depending on the interval and selected patterns, priority values will be different-Various kinds of priority functions can be defined using past usage data. It can be dynamically modified depending on the occupation rate of storage. Freq Freq Freq Average Hot M W D M W D M W D [Going Down] [No Change] [Going Up] Freq Freq Freq Average New Obsolete Change of Topic Popularity w.r.t Usage Patterns M W D M W D M W D [No Change] [Obsolete Topic] [New Topic]

Experiments and Prototype TOP 10 Search Results for “Sports” 2002 Jan.14 ~ Feb.14 (a)Usage-Blind Search (b) Usage-Aware Search 2002 World Cup Skiing Season * Gray:Disappeared Items** Red: New Items Local Baseball News

Consistency Management • Consistency: Data in CBFWW Should Keep Up to Date with Data in Origin Sites. With Usage Data Available, Consistency Management Can be Done Adaptively Dependent on • Frequency of Updates : How Often the Data are Updated • Frequency of Reference : How Often the Data are Used • Time Interval of Reference: When the Data are Used (day or night) Similar to View Selection Problem Materialized view Computational view Updates References

Storage Management Priority Management • Priorites Based On • Sizes (Raw Data and Physical Pages) • Recency (All Objects) • Frequency (All Object) • Link Structure Based Ranks (Physical Pages) • Importance of Topics Obtained from Topic Sensor • Priorities of Lower Level Objects Depend On Those on Upper Levels • Raw Data Can be Higher in Priority when Belonging to A High Priority Physical Page • Physical Pages Can be Higher in Priority when Belonging to A High Priority Logical Page • Logical Pages Can be Higher in Priority If Belonging to A High Priority Topics

Storage Organization

Storage Management Mappings Self-Organizing Storage Management • Adaptively Mapping Object Hierarchy to Storage Hierarchy • Mapping Based On Priorities of Data Objects • Data Migration to Higher Levels Not Cause to Delete Physical Data in Lower Storage Levels • Data in Main Memory have Exact Copies in the Disk. • Data in Disks have Backup Copies in the Tertiary Storage Priorities In All Levels Raw Data Physical Page Logical Pages Semantic Regions Storage Hierarchy

Level of Details • Data in CBFW Can be Preprocessed to Provide Different Data Format and Level of Details to Users • E.g., If the Size of A is Very Big, We May not be Able to Store it at the same Storage device. • We can Generate A’, which only Contains Word/Phrase Information of A. Since A’ is Small, It can be Stored at the Same Level as A, Although A Should be Stored as well. A’ can be Regarded as an Index for A. For Pictures, We may be Able to Use Pictures of Low Resolution. • Transcoding : Generating New Formats for Original Data • Summarizing : Generating Text Only Summary of Original Data

Queries for Data and Usage Data

Queries 2 1 1 0 1 Queries to Data Objects In CBFWW • A Salient Feature of CBFWW to Distinguish it From a Cache is the Query Capability. • Caches Not Allowing Direct Use of Cached Data • Caches Causes the Majority of Data Waste • Our Analysis of ISP Data Reveals that Nearly 70% of Cached Contents Never Being Reused • The Rareness ( Reverse of Frequency) Also Obeys A Zipf-Like Distribution • Using Usage Information Maintained By System, We Can Introduce New Queries • Popularity-Aware Queries • Guided Navigation • Topic Sensor What and How Popular) Usage Results with Popularity

Queries Popularity-Aware Queries • Assume An OQL(Object Query Language)-Like Language By Adding The Following Modifiers (Like DISTINCT in SQL) and Variables • Modifiers: MRU,LRU, MFU, LFU • Variables: Lastref, firstref A CBFWW Enables Popularity-Aware Queries, e.g., SELECT MRU p.oid, p.title FROM Physical_Page p WHERE p.title MENTION ‘‘data warehouse’’ This Is To Find Most Recently Retrieved Physical Pages Whose Titles Contain Phrase “Data Warehouse”,

Queries Query for Logical Pages • Queries for Logical Pages Are Useful for Finding Cut-Paths in Finding Information, e.g., SELECT MFU l.path FROM Logical_Page l WHERE end_at(l.oid) IN ( SELECT p.oid FROM Physical_Page p WHERE p.url="http://www-db.cs.wisc.edu/cidr/"); This Is to Find the Most Frequently Traversed Paths That Target Towards the Home Page of CIDR Conference

Experiments and Prototype Experiments and Prototype Implementation • Experiments • Demonstrate the Limit of Cache-Only Approach -- Majority of Cache Data Never Reused • Prototype Implementation • Show the Benefit from Management of History-Rich Web Data – Develop a Usage-Aware Search Engines Queries/Results with Usage Constraints Indices of Keywords For a Set of Documents Sampled from Proxy Logs Usage-Aware Search Engine Frequency of Reference For those Documents

Experiments and Prototype Usage-Aware Usage-Blind Usage-Aware Web Search sports sports Usage Data from A Large ISP: Kyoto-Inet (Jan. 14 2002 ~ Feb. 14, 2002)

Conclusions Conclusion • To Meet the Challenges Posed By the Web, We Proposed to Include Data Selection Capability of Cache to Data Management, Developed A New Data System, Called Capacity Bound-free Web Warehouse (CBFWW) • We Have Addressed the Following Issues Involved in the System • An Architecture • Data Management • Storage Management • Query Using Usage Data • We wre Currently Developing A Prototype to be used by a Provider Kyoto-inet.

Web Warehouse : Non-Transparent Cache with Weak Storage Capacity Bound