Web Caching

Web Caching Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002

Agenda • Caching: Why, Where, How, What • Some empirical data: Zipf’s Law • Content Delivery Networks • Bibliography

Why cache? • Number of unique pages: 800M < X < 2.2B • Number of unique web sites: 8,500,000 • static pages: %30 - %40 • pages revisited: %80 • expected hit-rate: %24 - %32

Why cache? • Bandwidth • Latency • Performance = Response Time • Server Load • Failure Redundancy

Content Server Reverse Proxy Content Server Reverse Proxy Content Server Reverse Proxy Where Local ISP Content Server Reverse Proxy cache cdn L4 Switch Data Center ISP Intranet cache Browser cache Browser cache cache Browser cdn

Hot-potato routing • Get traffic off of your network as soon as possible • Bounces traffic around the internet • Increases chance of dropped packet • Increases latency Destination You are here

How: Types of Caches • Simple Proxy • Transparent Proxy • Reverse Proxy • Adaptive Caching • Push Caching • Active Caching • Streaming Caches

How: Simple Proxy • Harvest/Squid • Provide web content for a fixed user base • Standalone operation • May be transparent • Commodity product/technology • Easy to get 90% correct

How: Transparent Proxy • No client configuration • Violates end-to-end paradigm • Client thinks it is talking directly to server • Server thinks it is talking to cache • Implemented as • Pass-through unit • L4 switch

How: Reverse Proxy • Designed to offload duties from one or more specific servers • Data size is limited to size of static content on the server • Challenge is fast, disk-less operation • Cache consistency is easy • Single point of failure

How: Adaptive Caching • ISP Level caching • Cooperating multiple distributed caches • Operate as a cache-mesh based on content demand • Multicast for group membership (GCS) • Content Routing Protocol sends request to the appropriate cache within the mesh

How: Push Caching • Send the data out proactively • Content Delivery Networks • Paid for by data providers • More on this later!

How: Active Caching • Use an applet inside of the cache to customize dynamic pages on the fly • How do you identify dynamic pages? • Where does the custom data come from? • Who is going to pay for this service?

How: Streaming Caches • What about streaming content • Movies • Audio • Proprietary streaming protocols • Challenge is to maintain Quality of content and service • Who pays for this?

What: Content and Protocols • Mostly Static Content • HTML • XML • GIF • AVI • EXE • Etc.

What: Content and Protocols • HTTP 1.0 Basic protocol • Send Request based on fix number of verbs • GET • HEAD • POST • Receive response, meta-data, content

What: Content and Protocols • HTTP Request Request = Simple-Request | Full-Request Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line ; * ( General-Header ; | Request-Header ; | Entity-Header ) ; CRLF [ Entity-Body ]

What: Content and Protocols • Example: GET /pub/www/index.html HTTP/1.0 • Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Sat, 19 Oct 2002 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private

What: Content and Protocols • Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT • Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Thu, 13 Jul 2000 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private

What: Content and Protocols • Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT • Response: HTTP/1.1 304 Not Modified

Basic caching algorithm Pages may be • Fresh: up-to-date • Expired: current date > expiration date • Stale: “old”

Basic caching algorithm - #2 If (page is in the cache) if ( page is expired or stale ) Get from server - if-modified-since If not modified, Get from cache Get from cache Else Get from Server Soft Miss

Basic caching algorithm - #3 If cache has space Store the file Else • Delete expired from cache • Delete stale from cache • Delete LRU from cache • Delete largest/smallest from cache?

Zipf’s law • Zipf’s law: The frequency of an event P as a function of rank i is a power law function: Pi = Ω / iα where α ≤ 1

Zipf’s law • Observed to be true for • Frequency of written words in English texts • Population of cities • Income of a company as a function of rank

Zipf’s law and web access • For a given server, page access by rank follows Zipf’s law • Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83

Observations • Top %1 of all documents account for %20 - %35 of proxy requests • Top %10 account for %45 - %55 of requests • It takes %25 to %40 of all documents to account for %70 of requests • It takes %70 to %80 of all documents to account for %90 of requests

Observations

Observations • For an infinite sized cache, the hit-ratio for a web-proxy grows in a log-like fashion as a function of the client population of the proxy and the number of requests seen by the proxy.

Observations • The hit-ratio of a web cache grows in a log-like fashion as a function of the cache size.

Observations Locality of Reference • The probability that a document will be referenced k requests after it was last referenced is roughly proportional to 1/k.

Observations - NOT • There is very little correlation between access frequency and document size • There is no correlation between access frequency and the change rate of a document • No single web server contributes to most of the popular pages

Zipf’s Law and Caching Discussion • How does this help in cache design? • Are there any business implications?

CDN • “Traditional” CDN • Dirty Secrets • P2P content delivery systems

Content Server Reverse Proxy Content Server Reverse Proxy Content Server Reverse Proxy Why use a CDN? Local ISP Content Server Reverse Proxy cache cdn L4 Switch Data Center ISP Intranet cache Browser cache Browser cache cache Browser cdn

What is CDN? Content Deliver Networks = PUSH PUSH = Prefetch

CDNMechanisms • DNS redirection • Complete • Partial • URL rewrite

Network Model HTTP server example.com ? A B HTTP server B GET http://example.com/foo HTTP server C A DNS-redirecting CDN DNS redirector Original server Client http://example.com/foo Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

CDN DNS Full Redirection • (Semi)automatic mechanism to replicate original site on CDN servers • Replace original DNS entry with enhanced DNS server that uses knowledge of network and server load to direct clients to appropriate CDN server • TTL on DNS entries are very short • Adero, NetCaching, IntelliDNS

CDN DNS Partial Redirection • Statically modify selected URL’s within pages to point to CDN service • Replicate selected objects to CDN service • Redirect clients of selected URL’s using enhanced DNS server that uses knowledge of network and server load • Akamai, Digital Island, MirrorImage, SolidSpeed, Speedera

CDN rewrite • Modify pages at the origin server on the fly • Change embedded URL’s based on up-to-date knowledge of the network and CDN server loads • Does not require additional DNS lookups • Fasttide, Clearway

Measuring a CDN’s performance • Two papers • K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. • B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001.

The measured performance of content distribution networks Client Actions • R: Resolve domain name • F: Fetch content • Ordinary client use of CDN: RF • Instead of doing (RF)+ we do R+ then F+ • This allows us to compare the server chosen to some other servers that could have been chosen, over a large number of fetches. Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

The measured performance of content distribution networks Procedure • R+: Collect a set of servers by repeated DNS queries • to a variety of name servers • over a number of hours • F+: Fetch a particular piece of content from each member of the set, measuring latency Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

The measured performance of content distribution networks Important Details • Interleaved fetches • Fetch1 at server1, fetch1 at server2, etc. • Not fetch1 at server1, fetch2 at server1, etc. • Unmeasured fetch before measured fetch • Avoids cache misses • Measure only HTTP fetch latency • CDN not penalized for cost of DNS resolution Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

The measured performance of content distribution networks: Looking at these graphs • Note: log plot of latency • Gray line: cumulative distribution at one server • Red line: cumulative distribution at all servers • Blue line: cumulative distribution at CDN Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

The measured performance of content distribution networks Cumulative Distribution • Right way to look at this data • Want to understand frequency and magnitude of bad choices • Consistent = vertical • Fast = to the left Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

The measured performance of content distribution networks Results • Akamai does a better job than Digital Island • Neither does a particularly good job of selecting the optimal server Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

Web Caching

Web Caching

Presentation Transcript

Web Caching and CDNs

Last Class: Web Caching

Last Class: Web Caching

World Wide Web Caching

Web Proxy Caching

Technology for Backbone Web Caching

Web Caching

web caching

Web Caching

Reseach on Web caching (UvA)

Web Caching

Semantic collaborative web caching

Web, HTTP and Web Caching

Java-Based Adaptive Web Caching

Web caching

Web Caching

Web, HTTP and Web Caching

Semantic collaborative web caching

Java-Based Adaptive Web Caching

Web Caching