1 / 28

Towards Understanding Modern Web Traffic

Towards Understanding Modern Web Traffic. Sunghwan Ihm and Vivek S. Pai Google Inc. / Princeton University. Web Changes and Growth. Simple static documents  c omplex rich media applications H eavy client-side interactions (e.g., Ajax ) Traffic increase

charis
Télécharger la présentation

Towards Understanding Modern Web Traffic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Understanding Modern Web Traffic Sunghwan Ihm and Vivek S. Pai Google Inc. / Princeton University

  2. Web Changes and Growth • Simple static documents  complex rich media applications • Heavy client-side interactions (e.g., Ajax) • Traffic increase • Social networking, file-sharing, and video streaming sites • Trends expected to continue • Applications migrated to the Web • A de facto standard interface of cloud services Sunghwan Ihm, Princeton University

  3. Understanding Changes • Goal: shape system design by better understanding the traffic optimization opportunities • Improve response times • Understand caching effectiveness • Design intermediary systems: firewalls, security analyzers, and reporting/monitoring systems Sunghwan Ihm, Princeton University

  4. Challenges • Tracking changes • Requires large-scale data set spanning many years collected under the same conditions • Web page analysis • Requires new analysis techniques suitable for dynamic Web pages with client-side interactions (e.g, Ajax) • Redundancy and caching • Requires full content instead of simple access logs for assessing implications of content-based caching We address these challenges by • Analyzing large-scale data with full content • Developing a new Web page analysis technique Sunghwan Ihm, Princeton University

  5. CoDeeN Traffic • CoDeeN content distribution network (CDN) • http://codeen.cs.princeton.edu/ • A semi-open globally distributed open proxy on 500+ PlanetLabnodes • Running since 2003 • 30+ million requests per day Sunghwan Ihm, Princeton University

  6. WAN Browser Cache Local Proxy Cache CoDeeN Cache Data Collection Full Content Access Logs • Assume local proxy caches • 1. Access logs (all requests, but limited info.) • URL, Timestamp, Content-Length, Content-Type, Referer, etc. • 2. Full content (cache-misses) • Header + body Origin Web Server User Sunghwan Ihm, Princeton University

  7. Data Set • 5 years: from 2006 to 2010 • Focus on one month (April) per year • Full content data only for 2010 • Total volume per month • 3.3~6.6 TB • 280~460 million requests • 240~360K unique client IPs (40~60% /8 nets) • 168~187 countries and regions • 820K~1.2 million servers Focus on US, CN, FR, BR: 100M+ requests / 1TB+ / 100K+ users Sunghwan Ihm, Princeton University

  8. Analysis Outline 1. High-level analysis 2. Page-level analysis 3. Caching analysis Access Logs Full Content Sunghwan Ihm, Princeton University

  9. 1. High-Level Analysis • Q: What has changed over five years? • Connection speed • NAT usage • Max # concurrent browser connections • Content type • Object Size • Traffic share of Web sites Sunghwan Ihm, Princeton University

  10. Content Type • US, 20062010, both X and Y log-scale • A sharp increase of Ajax: JavaScript / CSS / XML • A sharp increase of Flash video(FLV) (<5%25%) Sunghwan Ihm, Princeton University

  11. Traffic Share of Web Sites • Increase in video sites’ traffic • Increase in ad networks and analytics sites’ requests (~12%) • Ad networks market growth • Most accessed site by users • search / analytics • google.com, baidu.com, google-analytics.com • % user share increasing, tracking up to 65% Sunghwan Ihm, Princeton University

  12. 2. Page-Level Analysis • Q: How have Web pages changed? • New page detection heuristic • Initial page characteristics • Page size / # of embedded objects / latency • Page load latency simulation • Entire page characterization Sunghwan Ihm, Princeton University

  13. Page Detection Problem • Given a set of access logs, detect the page boundaries • # of embedded objects, page size, time, etc. • Challenge: previous approaches from 1990s are a poor fit, inaccurate for modern Web traffic Time main embedded Sunghwan Ihm, Princeton University

  14. Previous Approach #1:Time-based • Check idle time between requests • If within a threshold (e.g. 1 second), they belong to the same page • Misclassifyclient-side interactions (Ajax) with longer idle time as pages Sunghwan Ihm, Princeton University

  15. Previous Approach #2:Type-based • Check file extension / content type • Regard every html object as a main object • Misclassifyframes/iframes within a page as separate pages Sunghwan Ihm, Princeton University

  16. StreamStructureAlgorithm Ajax 1. Group logs into streams by Refererfield 2. Consider all html object as main object candidates ( Type-based) 3. Ignore those with no children (embedded objects) 4. Apply idle time among the candidates for finalizing selection ( Time-based) frames/iframes Sunghwan Ihm, Princeton University

  17. Validation • Ground truth:browse Alexa’s top 100 sites • Visit about 10 pages per site • Record Web page URLs (main objects) • Total 1197 pages • Precision • # correct pages found / # total pages found • Recall • # correct pages found / # total correctpages Sunghwan Ihm, Princeton University

  18. Validation Result Better 4 26~33 • StreamStructureoutperforms other approaches • Robustto the idle time parameter selection 19~30 4~24 1 sec Sunghwan Ihm, Princeton University

  19. Identifying Initial Page Loads Client-side Interactions (e.g., Ajax) Initial Page Load • Initial page: user-perceived page  user-perceived latency  traffic/revenue of Websites • Apply Time-based approach, but DNS lookup or browser processing time can vary significantly • Use Google Analyticsbeacon • JavaScript collecting various client-side info. • Fires when document are loaded 40-60% of traffic after initial page loads Sunghwan Ihm, Princeton University

  20. Initial Page Size and # Objects • Initial pages become increasingly complex • US: about 2x increase • 2006: 69 KB / 6 objects • 2010: 133 KB / 12 objects Caching Effectiveness Sunghwan Ihm, Princeton University

  21. Initial Page Load Latency • Median latency dropped in 2009 and 2010  Increased # of browser concurrent connections  Reduced per-object latency from improved caching behavior / client bandwidth Sunghwan Ihm, Princeton University

  22. 3. Caching Analysis • Q: Implications for caching? • URL popularity • Caching effectiveness • Required cache storage size • Impact of aborted transfers Sunghwan Ihm, Princeton University

  23. Two Caching Approaches • HTTP Object-based Approach • Whole object • HTTP-cacheable only • Previously reported cache hit rate: 35~50% • Byte hit rate usually much less • Content-based Approach • Cache smaller chunks instead of objects • Protocol independent • Effective for uncacheable content as well • WAN accelerators, storage/file systems Sunghwan Ihm, Princeton University

  24. Ideal Cache Hit Rate • HTTP object-based: 17~28% • Mainly effective for JavaScript and image • Content-based: 42~51%with 128-byte chunks • Effective for any content type • Growth of tail that hurts caching 1.8~2.5x Sunghwan Ihm, Princeton University

  25. Origins of Redundancy • Most of additional savings from the redundancy • across different versions (intra-URL) • across different objects (inter-URL) Aborted US, 128 byte Content updates Sunghwan Ihm, Princeton University

  26. Required Cache Storage Size • 1-KB outperforms 128-B w/ metadata overhead • MRC: Multi-Resolution Chunking (USENIX’10) • Increases working set size • Large cache storage highly desirable CN: 218GB Sunghwan Ihm, Princeton University

  27. Conclusions • Analyzed five years of real Web traffic with over 70,000 users • Observed a rise of Ajax and Flash video, search engine / analytics site tracking 65% users • Developed StreamStructure • Half of the traffic occurs due to client-side interactions after initial page loads • Pages have become increasingly complex • Content-based caching with large cache storage highly desirable • 2x larger byte hit rate, aborted transfers Sunghwan Ihm, Princeton University

  28. sihm@cs.princeton.eduhttp://www.cs.princeton.edu/~sihm/ Thank You Sunghwan Ihm, Princeton University

More Related