Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture email@example.com
Welcome to BbWold’08 • Finishing my 5th year at Blackboard. • Brought in to build a Performance Engineering Practice. • Team of 15 including myself • Half of the team are Performance Test Engineers • Half of the team are Software Developers • Responsible for the performance and scalability of the BbLearn architecture.
Session Housekeeping • Three hours of fun and excitement. • Feel free to fire up your laptops. • We will take 1 break at the half-way point • Take a break when ever you need to • Questions are welcome at any time.
Our Session Schedule • Part One: Introduction to Performance Forensics • 1:00 to 2:25pm • Break • 2:25 to 2:35pm • Part Two: Advanced Performance Forensics • 2:35pm to 4:00pm
Sessions Goals The goals of today’s session are… • Introduce you to the science of performance forensics. • Present a methodology for performing forensics. • Discuss techniques for arriving at root cause analysis. • Familiarize the audience with tools that can be used to assist the forensics process.
Session Learning Objectives At the end of the session you should be able to… • Write your own problem statements. • Perform the process of evidence collection and interviewing. • Apply techniques for using data and analysis to avoid diagnosis bias and value attribution. • Perform root cause analysis as part of the performance forensics process. • Begin using different tools for capturing key performance data
Part One: Introduction to Performance Forensics What is forensic engineering?
A Practical Definition • The term forensics means “The science and practice of collection, analysis, and presentation of information relating to a crime in a manner suitable for use in a court of law.” • This definition is in the context of a crime. • Forensic engineering is the application of accepted engineering practices and principles for discussion, debate, argumentative, or legal purposes.
Definition of Performance Forensics • The practice of collecting evidence, performing interviews and modeling for the purpose of root cause analysis of a performance or scalability problem. • Performance problems can be classified in two main categories: • Response Time Latency • Queuing Latency
Identifying the Problem • Problems are not always easily identifiable. • When they are easily apparent a simple problem statement should be declared so that the investigation can commence. • Calling out symptoms not diagnosing • When the problem is not clear, narrowing down the possibilities of what the problem could be should be the appropriate course of action. • Be willing to leave the problem statement open ended until a more formulated problem statement can be attained.
Problem Statements • Example Weak Problem Statement: • Sally Simpleton is experiencing response time latency in the Grade Center. • Why is it the statement weak? • Who is Sally Simpleton? • What defines response time latency? • What is she doing in the Grade Center? • When does it happen? • Can it be reproduced?
Strengthen the Problem Statement • Sand College is reporting response time latency of 90 to 120 seconds when course administrators edit Grade Center cells. • The problem is reproducible when using Sally Simpleton’s login credentials and accessing her course section (Introduction to Software Performance Engineering). • The problem has been reproduced at all times of days across different course sections and on different systems.
Evidence • Multiple types of gathered evidence used to solve performance problems. • Log artifacts • Monitoring/Measurement tools • Instrumentation/Sensors • Interactive evidence gathering through interviews. • Evidentiary support through discrete simulation • Improving future evidentiary capabilities by improving Performance Maturity Model
Log Artifacts • Understand what logs are in place and where they can be found. • Know what they are used for and whether they provide the right information. • Keep them slim and usable. • Learn how to associate and correlate • Associate multiple log artifacts • Correlate events to the problem statement
Putting Collectors/Sensors in Place • When should this happen? • When a problem statement cannot be developed from the data you do have (evidence or interviews) and more data needs to be collected. • How should you go about this? • Want to minimize disruption to the production environment. • Adaptive collection: Less Intensive to More Intensive over time. Basic Sampling Continuous Collection Profiling
Monitoring and Measurement • Third party components whether commercial or open source deployed to measure responsiveness and resource utilization • Excellent tools for trending and correlation • Specialization of tools to solve different types of problems. • Used in forensics for correlation for resource utilization to event occurrences.
Interviewing • Techniques • Lassie Question • Time Association • User experienced • Locality • Component/Feature Specific • Gathering non-discrete clues • Making use of method-R • Avoiding diagnosis bias • Eliminating value attribution • Can a pattern be identified?
Diagnosis Bias • It is human nature to label people, ideas or things based on our initial opinions of them. • Not necessarily scientific, but rather a combination of gut feelings, irrational judgment or failure to process enough conclusive data. • We often diagnose before we can get to root cause analysis based on a hunch or perception.
Value Attribution • Humans have a tendency to imbue someone or something with certain qualities based on its perceived value rather than objective data. • Example 1: The problem can’t be my SAN, I spent $250,000 on it. • Example 2: It can’t be the network, my engineers are the best in the field. They won’t allow a network problem to happen.
Discrete Simulation as Evidentiary Support • Performance testing is another technique for gathering evidence. • Provides the opportunity to increase logging and watch for events or occurrences note seen originally. • Also provides the opportunity to reproduce conditions that cause the performance issue.
An Abstract Example • Role of temperature in O-ring failures was difficult to determine by focusing on cases. Attention was focused on two key cases with O-ring failures: • SRM15 (cold launch) • SR22 (warm launch)
Hypothesis versus Diagnosis • Hypothesis: A prediction or educated guess about a problem prior to proving scientifically or mathematically. • Diagnosis: A scientific, empirical or measured conclusion about a problem. • Not necessarily the correct answer, but enough data has been gathered to propose a diagnosis. • A problem statement needs to be in place for both to exist. • Both need supporting data to develop either
Quick Comments About Method-R • Method-R is a preferred methodology for problem statement development and problem diagnosis. • While it was created for Oracle performance analysis, it can be applied to all aspects of software performance forensics. • Identifying the most important user actions for the needs of the business in order to improve performance.
What is Correlation? • Correlation is a measure of the statistical relationship between two comparable data points. • Time associations are typically made. • Correlate to resource demand • Correlate to event or occurrence • Correlation primarily a part of hypothesis and diagnosis.
Getting to Root Cause Analysis • Devising a strong problem statement • Foundation steps of Method-R • Knowing where to collect evidence • Formulating a data-driven hypothesis • Appropriate use of correlation, modeling and visualizing • Proving the hypothesis out (test-driven approach) • Establishing a diagnosis • Avoid diagnosis bias and value attribution • Treating the symptoms • A diagnosis is not always black and white