Incorporating Historical Test Case Performance Data for Effective Test Case Prioritization

IncorporatingHistorical Test Case Performance Data and Resource Constraints intoTest Case Prioritization Mohammad Abdollahi Azgomi Department of Computer Engineering Iran University of Science and Technology Test and Proof (TAP’09) ETH Zurich, Switzerland, July 2-3, 2009

Outline • Software Regression Testing, It’s Problem and techniques • Test Case Prioritization and It’s types • History-Based Test Case Prioritization • Kim and Porter History-Based Prioritization Approach and It’s Flaws • Proposed Equation to History-Based Prioritization • Evaluation • Conclusion and Future Work Context and Motivation

Motivation: Why HB Prioritization? • Local information are not sufficient for long-time effective regression test. • Not one-time activity but continuous process • Prevent from lose important information. • Local program change information → once tested a change and one chance to reveal fault • Prevent from misinterpret results. • Do not consider regression test frequency on its effectiveness severely limits results’ applicability • Hold improvement opportunities. • Analyzing historical data reveal test dependencies → More reduce test suite • Aggress to real test environment conditions. • Time and resource constraints in test environments

Software Maintenance Problem • Software constantly modified • Bug fixes • Addition of functionality • After changes, regression testing – run test cases in test suite and provide more • Provides confidence modifications correct • Helps find (unintended) new faults • Large number of test cases – continues to grow • Weeks/months to run entire test suite • Many test cases are broken, obsolete or redundant • Costs high – ½ cost of maintenance

Terminology - I • Regression fault: - A fault revealed by a test case that has previously passed but no long passes. • Test case: - A test-related item which contains the following information: 1. A set of test inputs 2. Execution conditions 3. expected outputs. • Test suite: - A group of related tests that are associated with a database, and are usually run together. • Test requirement (TR): - Specific elements of software artifacts that must be satisfied or covered (test goal).

Terminology - II • Coverage: - For every test requirement tr in TR it would be covered if and only if at least a test t in test set T satisfies tr. • Test adequacy: - Given a set of test requirements TR for coverage criterion C, we achieve test adequacy (test set T satisfies C coverage) if and only if for every test requirement tr in TR, there is at least one test t in T such that t satisfies tr.

Outline • Context and Motivation • Software Regression Testing, It’s Problem and techniques • Test Case Prioritization and It’s types • History-Based Test Case Prioritization • Kim and Porter History-Based Prioritization Approach and It’s Flaws • Proposed Equation to History-Based Prioritization • Evaluation • Conclusion and Future Work

Defining Regression Testing • Problem (Rothermel et al): • Given program P, its modified version P’, and a test set T that was used to previously test P, • find a way to utilize T to gain sufficient confidence in the correctness of P’ • Regression testing is software maintenance task performed on a modified program to instill confidence that: • changes are correct • have not adversely affected unchanged portions of the program (does not “regress”) • Rerunning test cases, which a program has previously executed, correctly in order to detect errors spawned by changes or corrections made during software development and maintenance

Regression Testing Place

Regression Test’s Problem • Old test cases rarely put aside. • Evolving software, regression test and it’s costs grows. • Re-running all test cases is costly and often infeasible due to time and resource constraints. • An industrial report: 7 weeks to rerun all test cases for a product of about 20,000 LOC ! • Optionally (randomly) put aside some test cases severely threats software product’s validity. • Regression test’s challenge: How to select an appropriate subset of the existing test suite each time regression test occurs in order to meet test goals in fastest rate possible?

Regression Test’s Main Techniques • Retest all • Simply re-execute all tests – costly, often infeasible • Regression Test Selection • Selecting an appropriate subset of the existing test suite, based on information about the program, modified version, and test suite → safety-cost trade off • Test Suite Reduction (Minimization) • Reducing a test suite to a minimal subset that maintains equivalent coverage of the original test suite with respect to a particular test adequacy criterion → size reduction- fault loss trade off • Test Case Prioritization • Order test cases so that those test cases with the highest priority, according to some criterion, are executed earlier → time and resource constraints

Test Case Prioritization Problem • Given: T, a test suite, PT, the set of permutations of T, and , a function from PT to the real numbers, Problem: Find T’ PT such that ( T”)(T” PT)(T” T’) [(T’) (T”)]. PT set of all possible prioritizations (orderings) of T and is a function applies an award value to any such ordering. • 0/1 Knapsack Problem → NP-hard, intractable and without definite solution → All existing prioritization techniques are heuristics.

Test Case Prioritization Goals • Increase the rate of fault detection. • Increase the rate of detection of high-risk faults. • Increase the likelihood of revealing regression errors related to specific code changes earlier. • Increase coverage of coverable code in the system under test at a faster rate. • Increase confidence in the reliability of the system under test at a faster rate.

Prioritization Techniques’ Categories History Based Prioritization (HBP) Code Based Prioritization (CBP) Model Based Prioritization (MBP) • Mostly based on code changes and tests execution profile • Wish to cover change parts of code at a fastest rate Software or system specifications → model or formal description Prioritize based on information collected during model execution Model regression testing as ordered sequence of testing sessions Use historical tests’ execution data to estimate current order

Non-HB Prioritization Techs’ Drawback • Consider regression testing as a one-time activity (memoryless) rather than a continuous and long-life process done each time code changes during maintenance. • Does not take real world time and resource constraints into consideration. • Do not consider the fact that Regression testing is modeled as an ordered sequence of testing sessions, each of whose performance may depend upon prior testing sessions, and each of which is subject to time and resource.

Kim and Porter HB Prioritization Approach Each time regression test occurs: Step1: Select a subset T' from original set T of tests Step2: Calculate selection probability for each test case tc T' in time t: Ptc,t(Htc,α) witchHtc: {h1,h2,…,ht} is: • Set of t time ordered observations • Drawn from previous executions of tc • Ptc,t(Htc,α) for each tc computes as follows: P0 = h1 Pk = α hk + (1 - α)Pk-1 k ≥1 , 0 ≤ α <1 Step3:Draw a test case from T‘ using the probabilities assigned in step 2, and run it, Step4: Repeat step 3 until testing time is exhausted. Calculated values are probabilities

Test set T Test set T’ • Test Case Execution • Draw a tc with highest selection probability • Regression Test Selection Associate selection probabilities to each tc based on previous executionsdata

Different Test History Definitions • Execution history: For every testing session i, in which test case tcis executed, hi takes the value 0. Otherwise it takes the value. Net effect: Cycle through all test cases over multiple testing sessions. • Demonstrated fault detection effectiveness: For every testing session i, in which test case tcpassed, hitakes the value 0, Otherwise it takes the value 1. Net effect: 1. Limit the running of test cases that rarely, if ever, reveal faults. 2. Test cases whose failures are related to unstable sections of code would continue to be selected until that code stabilizes. • Coverage of program entities: Give higher priority to test cases that cover functions infrequently covered in past testing sessions. Net effect: Limit the possibility that any particular function goes unexercised for long periods of time constraints.

Flaws of Kim and Porter HBP Approach • Using hk to determine selection probability, especially only with two values 0 and 1, and only based on the recent execution of each test case, is not an appropriate criterion to provide an execution history for the test cases. • Increasing test case selection probability only based on: • Whether or not a test case has been executed in recent execution or • If it has exposed fault in recent execution, will not produce efficient ordering for test cases in history-based prioritization. • We should not take the number of regression test sessions the test case executed in(ec) and the number of test sessions it reveals fault(s)(fc) separately. But also these two factors together as a single factor, show test cases' historical performance (fc/ec).

Solution: Proposed HBP Equation • Determine priority of each test case tc in each regression test session based on three factors: • Historical fault detection effectiveness of tc during past test sessions • Execution history (period in which tc has not been executed) • Previous priority of tc • Net effect of considering these factors in tc’s history based priority: • Corroborate test case priority based on it’s demonstrated effectiveness with respect to fault detection. • and • Cycle through all test cases during regression test long runs and prevent from obsolesce of any test case in test suite.

1. Historical Fault detection Effectiveness • Executing a test case → It’s priority have been weakened in next regression test session. • More effective test cases with respect to fault detection should lie among executive test cases faster than other ones. • Historical fault detection effectiveness factor is more corroborative for priority of those test cases which are more effective with respect to fault detection during past test sessions.

1. Historical Fault detection Effectiveness • Ink thexecution of regression test (software has been modified k times up to now): fck-the number of times that the execution of test case tc fails (reveals fault(s)). eck- the number of tc executions up to now. • Relation between each test case's priority with its fault detection performance in kth execution is as follows: if tc has revealed fault(s) in test session i Otherwise if tc has been executed in test session i Otherwise

2. Execution History (Period of not execution) • In context of OS and process scheduling → problem of process starvation. Starvation : During the execution of various processes in a system, a process has not been selected for execution for a long time. Solution: Job scheduler → assigns a counter to each process → takes number of times the process has not been executed →help to increases tc’s priority. • Similar to this idea: • Consider period of time a test case is not being executed. • Execution history factor is more corroborative for priority of those test cases which have not been executed for long time during past test sessions.

Executable Test cases 2 1 7 6 8 8 4 5 2 1 7 6 5 4 3 3 Existing Test Cases Fault Revealer Test Case 1 2 3 4

2. Execution History (Period of not execution) • Relationship between test case priority and its execution history in the kth execution: • Each time a test case is not executed → it’s execution history will be increased by one. • Once the test case is executed→ execution history becomes 0 and the operation is repeated as well. • Ensure that none of the test cases has remained unexecuted for a long time, and the corresponding faults will be revealed. if tc has been executed in test session k-1 Otherwise

3. Previous Priority of Test Case • Reasons using this factor: • It causes to smoother selection of test cases in successive executions of regression test → limits severe changes in selection of executing test cases in test suite in each run with respect to the previous run. • In cases historical fault detection effectiveness of test cases and their execution history are the same → consider another factor to establish a proper priority between them. • Relationship between test case priority and its execution history in kth execution: • Previous priority factor is more corroborative for those test cases which have high priority in recent test session.

Proposed HBP Equation Percentage of code coverage of the test case Calculated values are priorities

Notes About Proposed HBP Equation • , and are smoothing constants → control the effect of mentioned factors in test case prioritization. • fck/eck is between 0 and 1 and PRk-1 is small real number often near to 1 → must control the effect of hk such that it do not mask other factors effect by mistake. • must be smaller than , are so close to 0. • It is preferable to set and values between 0.5 and 1. • Difference: • Pkin Kim and Porter HBP approach is selection probability of each test case in kth execution. • PRk in proposed HBP equation is test case priority in kth execution. • Time and resource constraints →sufficient number of test cases will be executed beginning from highest priorities.

Process of proposed HBP approach

Evaluation of Proposed HBP Approach • Benchmark Programs: • 7 programs of Siemens suite and Space program. • Coverage Types and Tools: • Branch (decision) coverage • All-uses coverage: ATAC tool • Test Suites: • 1000 branch coverage adequate test suites (testplans-bigcov) • Faulty Versions: • 29 Siemens and Space multi-fault versions for 29 times regression test executions. • Evaluation Metric: APFD (with respect to fault detection rate). • Comparison:Compare with random ordering and Kim and Porter HBP approach.

Benchmark Programs: Siemens and Space Average test suite size Program Name Nom of test cases Description LOC Multi-fault versions Lexical analyzer Lexical analyzer Pattern recognition Priority scheduler Priority scheduler Collision avoidance Statistics computing 7 10 32 9 10 41 23 16 12 19 8 8 6 7 print-tokens print-tokens2 replace schedule schedule2 tcas tot-info 4130 4115 5542 2650 2710 1608 1052 402 438 516 299 297 138 346 155 space 13585 ADL language interpreter 6218 35

Types of Test Suite Coverage in Experiments • Branch (decision) coverage: either True or False branches of a decision • if, if-else, while, do-while, for, switch and entry of functions with no decisions • All-uses coverage: all uses of a definition → ATAC tool if (*j >= maxset) result = false; else } outset[*j] = c; *j = *j + 1; result = true; { if (*j >= maxset){ fprintf(fp, "bT1,"); result = false; { else } fprintf(fp, "bF1,"); outset[*j] = c; *j = *j + 1; result = true; { Branch instrumentation of a piece of replace program code

Types of Test Suite Coverage in Experiments

Faulty Versions of Programs • Siemens Programs and Space: • Siemens: hand-seeded faults • Space: real big program (11KLOC) and real faults • Creating Multi-fault versions: • Single fault versions have been created for each program • Set of multi-fault versions composed of non-interfering single faults • For each program 29 multi-fault version have been choosed for 29 runs of regression test Prioritize all by proposed approach 1000 Prioritized Test Suites 1000 Test Suites Compare test suites pair wise Prioritize all by random ordering approach 1000 Prioritized Test Suites In each regression test session

APFD: Prioritization Techniques Comparison • Comparison of prioritization techniques with respect to fault detection: APFD metric • APFD: Average Percentage of Fault Detection in test suite lifetime n : Number of test cases m: Number of faults TFi : First test case in ordered test suite reveals fault i

Display Experiment Results: Boxplot Diagrams • Box plot Diagrams: Help to (1) statistically analyze results (2) observe any differences between experimentsand (3) visualize the empirical results in test case prioritization studies. • Create box plot diagrams: SAS 9.1.3 in experiments. • How interpret boxplot diagram? • Lower Quartile of data → Q1 • Median of data → Q2 • Upper Quartile of data → Q3 • Higher place of the box plot → faster prioritization technique reveals faults. • Closer the box → more stability in prioritization technique’s behavior with respect to fault detection. Q3→ Q2→ Q1→

Experiments: Compare proposed approach versus random ordering with respect to faster fault detection (APFD values for prioritizing 1000 test suites) • For each Siemens programs and Space program • Percent of branch coverage as initial ordering of test cases • For each program 29 times regression test have been repeated (on 29 multi-fault versions) • Better Comparison: • A diagram for all regression test executions (all 29 runs) plotted. • Each same color boxplots compare proposed HBP approach versus random ordering test case prioritization. Exp 1: Proposed HBP Approach vs. Random Ordering

Initial prioritization: percent of branch coverage of test cases (control-flow criteria) Time and Resource Constraints: Only 70% of prioritized suite executed Exp 1: Proposed HBP Approach vs. Random Ordering All-Versions Diagram For each program → Each same color boxes Proposed HBP approach: Left box Random ordering approach: right box

Initial prioritization: percent of all-uses coverage of test cases (data-flow criteria) Time and Resource Constraints: Only 70% of prioritized suite executed Exp 2: Proposed HBP Approach vs. Random Ordering(diff initial ordering criteria) All-Versions Diagram For each program → Each same color boxes Proposed HBP approach: Left box Random ordering approach: right box

Initial prioritization: percent of branch coverage of test cases (control-flow criteria) Time and Resource Constraints: Only 30% of prioritized suite executed Exp 3: Proposed HBP Approach vs. Kim and Porter HBP Approach All-Versions Diagram For each program → Each same color boxes Proposed HBP approach: Left box Kim and Porter HBP approach: right box

In the proposed HBP approach, three factors are effective to determine the test case execution priority in each regression test session: • Historical demonstrated performance in fault detection during the regression test lifeline • Priority test case in previous regression test session • Duration that each test case has not been executed • Proposed HBP approach directly uses these three factors in prioritization. • In each test session priority values are computed for all test cases. • According to resource and time constraints, test cases are executed beginning from higher priorities as far as possible (solution for regression test selection tech). Proposed HBP Approach Specifications

Incorporating historical fault detection effectiveness , test case previous priority and duration that each test case has not been executed, causes to effective prioritizing test cases during continuous regression test runs. Proposed HBP approach always performs significantly faster with respect to fault detection versus both random ordering and Kim & Porter approach. Proposed HBP approach’s results with respect to fault detection is independent from initial prioritization code coverage criteria. In more severe resource constraint conditions, the gap between performance of the proposed HBP approach versus Kim & Porter HBP approach increases. Conclusions

Considering fault severity and different test case costs (e.g. execution time) in the proposed HBP approach. More empirical studies by programs with real faults in order to study the performance of this approach in real environments more precisely. Further studies using available Java benchmarks in order to investigate the proposed technique for object-oriented programs. Investigate the proposed HBP approach on successive faulty versions of the real softwares during regression testing (probably more interesting results). Determining , and coefficients more precisely based on obtained historical data. Future Works

[1] J. M. Kim and A. Porter, "A History-Based Test Prioritization Technique for Regression Testing in Resource Constrained Environment," in 24th International Conference on Software Engineering, 2002, pp. 119-129. [2] H. Park, H. Ryu, and J. Baik, "Historical Value-Based Approach for Cost-Cognizant Test Case Prioritization to Improve the Effectiveness of Regression Testing," in 2th International Conference on Secure System Integration and Reliability Improvement, Yokohama, Japan, 2008, pp. 39-46. [3] J. M. Kim, A. Porter, and G. Rothermel, "An empirical study of regression test application frequency," in 22nd lnternational Conference on Software Engineering, 2000, pp. 126-135. Also in Software Testing, Verification and Reliability, vol. 15, Iss. 4, pp. 257 – 279, 2005. [4] S. Elbaum, A. G. Malishevsky, and G. Rothermel, "Test Case Prioritization: A Family of Empirical Studies," IEEE Transactions on Software Engineering, vol. 28, No. 2, pp. 159-182, 2002. [5] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, "Prioritizing Test Cases for Regression Testing," 2001 IEEE Transactions on Software Engineering, pp. 102-112, 2001. [6] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, "Test case prioritization: an empirical study," in 1999 IEEE International Conference on Software Maintenanc, Oxford, England, 1999, pp. [7] I. Burnstein, Practical software testing: a process-oriented approach. New York: Springer-Verlag, 2003. [8] S. Elbaum, A. Malishevsky, and G. Rothermel, "Incorporating varying test costs and fault severities into test case prioritization," in 23rd International Conference on Software Engineering: IEEE Computer Society, Toronto, Canada, 2001, pp. 329-338. [9] G. Rothermel and M. Harrold, "A Safe, Efficient Regression Test Selection Technique," ACM Transactions on Software Engineering & Methodology, vol. 6, No. 2, pp. 173-210, 1997. [10] W. E. Wong, J. R. Horgan, A. P. Mathur, and A. Pasquini, "Test set size minimization and fault detection effectiveness: A case study in a space application," Journal of System and Software, vol. 48, pp. 79-89, 1999. Main References of Study

Incorporating Historical Test Case Performance Data for Effective Test Case Prioritization