Measuring End-User Web Availability: Insights from Practical Experience

Measuring End-User Availability on the Web: Practical Experience Matthew Merzbacher (visiting research scientist) Dan Patterson (undergraduate) Recovery-Oriented Computing (ROC) University of California, Berkeley http://roc.cs.berkeley.edu

E—Commerce Goal • Non-stop Availability • 24 hours/day • 365 days/year • How realistic is this goal? • How do we measure availability? • To evaluate competing systems • To see how close we are to optimum

The State of the World • Uptime measured in “nines” • Four nines == 99.99% uptime (just under an hour downtime per year) • Does not include scheduled downtime • Manufacturers advertise six nines • Under 30s unscheduled downtime/year • May be true in perfect world • Not true in practice on real Internet

Measuring Availability • Measuring “nines” of uptime is not sufficient • Reflects unrealistic operating conditions • Must capture end-user’s experience • Server + Network + Client • Client Machine and Client Software

Existing Systems • Topaz, Porvio, SiteAngel • Measure response time, not availability • Monitor service-level agreements • NetCraft • Measures availability, not performance or end-user experience • We measured end-user experience and located common problems

Experiment • “Hourly” small web transactions • From two relatively proximate sites • (Mills CS, Berkeley CS) • To a variety of sites, including • Internet Retailer (US and international) • Search Engine • Directory Service (US and international) • Ran for 6+ months

Availability: Did the Transaction Succeed?

Types of Errors Network: Medium (11%) Severe (4%) Server (2%) Corporate (1%) Local (82%)

Client Hardware Problems Dominate User Experience • System-wide crashes • Administration errors • Power outages • And many many more… • Many, if not most, caused or aggravated by human error

What About Speed?

Does Retry Help? Green > 80% Red < 50%

What Guides Retry? • Uniqueness of data • Importance of data to user • Loyalty of user to site • Transience of information • And more…

Conclusion • Experiment modeled user experience • Vast majority (81%) of errors were on the local end • Almost all errors were in the “last mile” of service • Retry doesn’t help for local errors • User may be aware of the problem and therefore less frustrated by it

Measuring End-User Web Availability: Insights from Practical Experience