400 likes | 522 Vues
Exploring Tools and Techniques for Distributed Continuous Quality Assurance. Adam Porter aporter@cs.umd.edu University of Maryland. Quality Assurance for Large-Scale Systems. Modern systems increasingly complex Run on numerous platform, compiler & library combinations
E N D
Exploring Tools and Techniques for Distributed Continuous Quality Assurance Adam Porter aporter@cs.umd.edu University of Maryland http://www.cs.umd.edu/projects/skoll
Quality Assurance for Large-Scale Systems • Modern systems increasingly complex • Run on numerous platform, compiler & library combinations • Have 10’s, 100’s, even 1000’s of configuration options • Are evolved incrementally by geographically-distributed teams • Run atop of other frequently changing systems • Have multi-faceted quality objectives • How do you QA systems like this? http://www.cs.umd.edu/projects/skoll
A Typical QA Process • Developers make changes & perform some QA locally • Continuous build, integration & test of CVS HEAD • Sometimes performed by multiple (uncoordinated) machines • Lots of QA being done, but systems still break, too many bugs escape to field & there are too many late integration surprises • Some contributors to QA process failures • QA only for the most readily-available platforms & default configurations • Knowledge, artifacts, effort & results not shared • Often limited to compilation & simple functional testing • Static QA process that focus solely on CVS HEAD http://www.cs.umd.edu/projects/skoll
General Assessment • Works reasonably well for a few QA processes on a very small part of the full system space • For everything else it has deficiencies & leads to serious blind spots as to the system’s actual behavior • Large portions of system space go unexplored • Can’t fully predict effect of changes • Hard to interpret from-the-field failure reports & feedback http://www.cs.umd.edu/projects/skoll
Distributed Continuous Quality Assurance • DCQA: QA processes conducted around-the-world, around-the-clock on powerful, virtual computing grids • Grids can by made up of end-user machines, company-wide resources or dedicated computing clusters • General Approach • Divide QA processes into numerous tasks • Intelligently distribute tasks to clients who then execute them • Merge and analyze incremental results to efficiently complete desired QA process http://www.cs.umd.edu/projects/skoll
DCQA (cont.) • Expected benefits • Massive Parallelization • Greater access to resources/environs. not readily found in-house • Coordinate QA efforts to enable more sophisticated analyses • Recurring themes: • Leverage developer & application community resources • Maximize task & resource diversity • Opportunistically divide & conquer large QA analyses • Steer process to improve efficiency & effectiveness • Gather information proactively http://www.cs.umd.edu/projects/skoll
Group Collaborators Doug Schmidt & Andy Gohkale Alex Orso Myra Cohen Murali Haran, Alan Karr, Mike Last, & Ashish Sanil Sandro Fouché, Atif Memon, Alan Sussman, Cemal Yilmaz (now at IBM TJ Watson) & Il-Chul Yoon http://www.cs.umd.edu/projects/skoll
Overview • The Skoll DCQA infrastructure & approach • Novel DCQA processes • Future work http://www.cs.umd.edu/projects/skoll
register client kit task request QA task return results Skoll DCQA Infrastructure & Approach Server(s) Clients • Server selects best task that matches client characteristics • Client registers & receives client kit • When client becomes available it requests a QA task • Server processes results & updates internal databases • Client executes task & returns results See: A. Memon, A. Porter, C. Yilmaz, A. Nagarajan, D. C. Schmidt, and B. Natarajan, Skoll: Distributed Continuous Quality Assurance, ICSE’04 http://www.cs.umd.edu/projects/skoll
Implementing DCQA Processes in Skoll • Define how QA analysis will be subdivided by defining generic QA tasks & the QA space in which they operate • QA tasks: Templates parameterized by “configuration” • Variable information needed to run concrete QA tasks • QA space: the set of all “configurations” in which QA tasks run • Define how QA tasks will be scheduled & how results will be processed by defining navigation strategies • Navigation strategies “visit” points in the QA space, applying QA tasks & processing incremental results http://www.cs.umd.edu/projects/skoll
The ACE+TAO+CIAO (ATC) System • ATC characteristics • 2M+ line open-source CORBA implementation • maintained by 40+, geographically-distributed developers • 20,000+ users worldwide • Product Line Architecture with 500+ configuration options • runs on dozens of OS and compiler combinations • Continuously evolving – 200+ CVS commits per week • Quality concerns include correctness, QoS, footprint, compilation time & more • Beginning to use Skoll for continuous build, integration & test of ATC http://www.cs.umd.edu/projects/skoll
Define QA Space http://www.cs.umd.edu/projects/skoll
Define Generic QA Tasks • Implemented as a workflow of “services” provided by client kit • Some built-in: log, cvs_download, upload_results • Most interesting ones application-specific: e.g., build, run_tests • Ex: The run_testsuite QA task invokes services that: • Enable logging • Define component to be built • Set configuration options & environment variables • Download source from a CVS repo • Configure & build component • Compile tests • Execute tests • Upload results to server http://www.cs.umd.edu/projects/skoll
Navigation Strategy: Nearest Neighbor Search http://www.cs.umd.edu/projects/skoll
Navigation Strategy: Nearest Neighbor Search http://www.cs.umd.edu/projects/skoll
Navigation Strategy: Nearest Neighbor Search http://www.cs.umd.edu/projects/skoll
Navigation Strategy: Nearest Neighbor Search http://www.cs.umd.edu/projects/skoll
Extension: Adding Temporary Constraints Χ http://www.cs.umd.edu/projects/skoll
Overview • The Skoll DCQA infrastructure & framework • Novel DCQA processes & feasibility studies • Configuration-level fault characterization • Performance-oriented regression testing • Code-level fault modeling • Future work http://www.cs.umd.edu/projects/skoll
Configuration-Level Fault Characterization Goal • Help developers localize configuration-related faults Solution Approach • Strategically sample the QA space to test for subspaces in which (1) compilation fails or (2) regression tests fail • Build models that characterize the configuration options and specific settings that define the failing subspace See: C. Yilmaz, M. Cohen, A. Porter, Covering Arrays for Efficient Fault Characterization in Complex Configuration Spaces, ISSTA’04, TSE v32 (1) http://www.cs.umd.edu/projects/skoll
Sampling Strategy • Goals: Maximize “coverage” of option setting combinations, while minimizing the number of configurations tested • Approach: compute test schedule from T-way covering arrays • A set of configurations in which all ordered t-tuples of option settings appear at least once. • 2-way covering array example http://www.cs.umd.edu/projects/skoll
CORBA_MESSAGING | 1 0 AMI_POLLER AMI 1 0 1 0 AMI_CALLBACK OK ERR-1 ERR-3 1 0 OK ERR-2 Fault Characterization • We used classification trees (Brieman et al) to model option & setting patterns that predict test failures http://www.cs.umd.edu/projects/skoll
Feasibility Study • QA Scenario: Do regression tests pass across entire QA space? • 37584 valid configurations • 10 compile-time options with 12 constraints • 96 test options with 120 test constraints • 6 run-time options with no constraints & 2 operating systems • Navigation strategies: • Nearest neighbor search • t-way covering arrays for each 2 ≤ t ≤ 6 • Applied to one release of ATC using clients running on 120 CPUs • Exhaustive process takes 18,800 cpu hours (excl. compilation) http://www.cs.umd.edu/projects/skoll
Some Sample Results • Found 89 (40 Linux, 49 Windows), possibly option-related failures • Models based on covering arrays (even for t=2) were nearly as good as those based on exhaustive testing • Several tests failed that don’t fail with default option settings • Root cause: problems in feature-specific code • Some failed on every single configuration (even though they don’t fail when using default options!) • Root cause: problems in option processing code • 3 tests failed when ORBCollocation = NO • Background: when ORBCollocation = • GLOBAL, PER-ORB: local objects communicate directly • NO: objects always communicate over network • Root cause: data marshalling/unmarshalling broken http://www.cs.umd.edu/projects/skoll
Summary • Basic infrastructure in place and working • Initial results were encouraging • Defined large control spaces and performed complex QA processes • Found many test failures corresponding to real bugs, some of which had not been found before • Fault characterization sped up debugging time • Documenting & centralizing inter-option constraints beneficial • Ongoing extensions • Expanding QA space for ATC continuous build, test & integration • Extending model for platform virtualization http://www.cs.umd.edu/projects/skoll
Performance-Oriented Regression Testing Goal • Quickly determine whether a software update degrades performance Solution Approach • Proactively find a small set of configurations on which to estimate performance across entire configuration space • Later, whenever software changes, benchmark this set and compare post-update & pre-update performance See: C. Yilmaz, A. Krishna, A. Memon, A. Porter, D. C. Schmidt, A. Gokhale, and B. Natarajan, Main Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software Systems, ICSE’05 http://www.cs.umd.edu/projects/skoll
Reliable Effects Screening (RES) Process • Proactive • Compute a formal experimental plan (we use screening designs) for identifying “important” options • Execute the experimental plan across QA grid • Analyze the data to identify important options • Recalibrate frequently as system changes • Reactive • When the software changes estimate the distribution of performance across entire configuration space • by evaluating all combinations of important options, (called the screening suite), while randomizing the rest • Distributional changes signal possible performance degradations http://www.cs.umd.edu/projects/skoll
Feasibility Study • Several recent changes to message queuing strategy. • Has this degraded performance? • Developers indentified14 potentially important binary options (214=16,384 configs.) http://www.cs.umd.edu/projects/skoll
Half-Normal Probability Plot for Latency (128-run screening design) Half-Normal Probability Plot for Latency (Full Data Set) 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Identify Important Options option effects • Options B and J: clearly important, options I, C, and F: arguably important • Screening designs mimics exhaustive testing at a fraction of the cost http://www.cs.umd.edu/projects/skoll
Q-Q Plot for Latency screening suite full data set Estimate Performance with Screening Suite • Estimates closely track actuals http://www.cs.umd.edu/projects/skoll
Applying RES to Evolving System Latency Distribution Over Time • Estimated performance once a day on CVS snapshots of ATC • Observations • Developers noticed the deviation on 12/14/03, but not before • Developers estimated a drop of 5% on 12/14/03. • Our process more accurately shows around 50% drop http://www.cs.umd.edu/projects/skoll
Summary • Reliable Effects Screening is an efficient, effective, and repeatable approach for estimating the performance impact of software updates • Was effective in detecting actual performance degradations • Cut benchmarking work 1000x (from 2 days to 5 mins.) • Fast enough to be part of check-in process • Ongoing extensions • New experimental designs: Pooled ANOVA • Applications to component-based systems (multilevel interactions) http://www.cs.umd.edu/projects/skoll
Code-Level Fault Modeling Goal • Automatically gain code-level insight into why/when specific faults occur in fielded program runs Solution Approach • Lightly instrument, deploy & execute program instances • Capture outcome-labeled exec. profiles (e.g., pass/fail) • Build models that explain outcome in terms of profile data See: Murali Haran, Alan Karr, Michael Last, Alessandro Orso, Adam Porter, Ashish Sanil & Sandro Fouche, Techniques for Classifying Executions of Deployed Software to Support Software Engineering Tasks. UMd Technical Report #cs-tr-4826 http://www.cs.umd.edu/projects/skoll
Lightweight Instr. via Adaptive Sampling • Want to minimize overhead (e.g., runtime slowdown, code bloat, data volume & analysis costs) • Use incremental data collection approach • Individual instances sparsely sample the measurement space • Sampling weights initially uniform • Weights incrementally adjusted to favor high utility measurements • Over time, high utility measurements are sampled frequently while low ones aren’t • Typically requires increase in instances observed http://www.cs.umd.edu/projects/skoll
Association Trees • Resulting data set is very sparse (i.e., 90-99% missing) • Problematic for many learning algorithms • Developed new learning algorithm called association trees • For each measurement m split data range into 2 parts • e.g., low(m< k) orhigh (m ≥ k), where k is best split point • Convert data to items that are present or missing • Use apriori algorithm to find item combinations whose presence is highly correlated with each outcome • Uses of association trees • Analyze rules for clues to fault location • Use models on new instances, e.g., determine whether an in-house test case likely reproduces an in-the-field failure http://www.cs.umd.edu/projects/skoll
Feasibility Study • Subject program: Java Byte Code Analyzer (JABA) • 60k statements, 400 classes, 3000 methods • 707 test cases • 14 versions • Measured method entry and block execution counts • Ran clients on 120 CPUs • Standard classification techniques using exhaustive measurement yielded near perfect models (with some outliers) http://www.cs.umd.edu/projects/skoll
Some Sample Results • Competitive with standard methods using exhaustive data • Measured 5-8% of data/instance; ~4x as many instances total http://www.cs.umd.edu/projects/skoll
Summary • Preliminary results: created highly effective models at greatly reduced cost • Scales well to finer-grained data & larger systems • Ongoing extensions • Tradeoffs in # of measurements vs. instances observed • New utility functions • Cost-benefit models • Fixed-cost sampling • New applications • Lightweight test oracles • Failure report clustering http://www.cs.umd.edu/projects/skoll
Overview • The Skoll DCQA infrastructure & framework • Novel DCQA processes • Future work http://www.cs.umd.edu/projects/skoll
Future Work • Looking for more example systems (help!) • New problem classes • Performance and robustness optimization • Configuration advice • Integrate with model-based testing systems • GUITAR -- http://www.cs.umd.edu/~atif/newsite/guitar.htm • Distributed QA tasks • Improved statistical treatment of noisy data • Modeling unobservables and/or detecting outliers http://www.cs.umd.edu/projects/skoll