TestRank: Revolutionizing Test-Driven Development

TestRank:Eliminating Waste fromTest-Driven Development Presented by Hagai Cibulski Tel Aviv University Advanced Software Tools Research Seminar 2010

Elevator Pitch • TDD bottleneck - repeated runs of an ever growing test suite • Less productivity  Casualness in following TDD  Loss of Quality • TestRank – finds appropriate tests to run after given code edits • Run a fraction of the tests in each cycle: eliminate waste + high bug detection rate

Agenda • In this talk we will: • learn Test-Driven Development (TDD) in two minutes • observe insights into the nature of TDD tests • identify the TDD bottleneck • define the Regression Test Selection (RTS) problem for the TDD context • review past work on RTS • see alternative Program Analysis techniques: • Dynamic PA • Natural Language PA • present TestRank – an RTS tool for TDD

Test-Driven Development • Agile software development methodology • Short development iterations • Pre-written test cases define functionality • Each iteration: code to pass that iteration's tests

Test-Driven Development Cycle • Repeat: • Add a test • Run tests and see the new one fails • Write some code • Run tests and see them succeed • Refactor code • Run tests and see them succeed

TDD Tests - Observations • TDD tests define functionality • TDD code is highly factored • Therefore: • A single test may cross multiple units of code • A single code unit implements functionalities defined in multiple tests

Test Suite - Observations • Tests are added over time • 5 developers x 1 test a day x 240 days = 1200 tests • 1200 tests x 200 mSec = 4 minutes • Integrated into nightly/integration builds • Commited changes are covered nightly/continuously • Integrated into the team developers' IDE • Programmers can run isolated tests quickly

The Motivation:Early detection of software bugs • A developer edits a block of code • Using strict unit tests as a safety net • Finding the unit test to run is straightforward (1-1 or 1-n) • Using TDD functional tests as a safety net • Finding the tests to run? (n-n) • Where is the code-test correlation? • Must run the entire test suite? • Might not be cost effective • Delay of running the entire test suite? • Delays the detection of software bugs • Bugs become harder to diagnose the further the symptom is removed from the cause

TestRank Problem Definition • Given • P – program under test • T – test suite (assuming all tests in T pass on P) • Q – query about location (method)L • Find: • Ranking: t1, t2, …, tn s.t. if change in L causes test ti to fail, i is minimal • Application: • Select top  (e.g. 20%) of ranking • Goal: Achieve 1- bug detection, s.t.  is minimal

TestRank – Application • Rank the tests such that running the top 20% ranked tests will reveal a failing test with 80% probability • 20% is our promise to the developer • 80% is justified assuming eventually all tests will be run • Usually new bugs first chance of being detected is when all the tests are run (typically on the nightly build) • Don't waste time reconciling which (or whose) coding changes are responsible for new bugs • The bugs never get checked into the master source

Related Work • Past Work on Test Suite Optimization • Test Selection • Lower total cost by selecting an appropriate subset of the existing test suite based on information about the program, modified version, and test suite • Usually conservative ("safe") analyses • Test Prioritization • Schedule test cases in an order that increases rate of fault detection (e.g. by decreasing coverage delta)

TestTube • TestTube: a system for selective regression testing • Chen, Rosenblum and Vo, 1994 • “Safe” RTS: Identify all global entities that test t covers • Assumes deterministic system • Coarse level of granularity – C functions • Instrumentation: t  {fi} • closure(f) = global vars, types, and macros used by f • A test case t in T is selected for retesting P' if: diff(P,P’) ∩ closure(trace(t)) Ø • Reduction of 50%+ in number of test cases • Only in case of “feature functions” . . core functions feature functions • Nondeterministic version - "transitive closure" technique  0% reduction in “core functions”

DejaVu • A safe, efficient regression test selection technique • Rothermel and Harrold, 1997 • Conservative + Granularity at statement level • improving precision • Control flow based • CFG for each procedure • Instrumentation: t  {e} • Simultaneous DFS on G,G' for each procedure and its modified version in P,P’ • A test case t in T is selected for retesting P' if its execution trace contains a “dangerous” edge • A lot of work goes into calculating diff(P,P’) • Might be too expensive to be used on large systems • Results: Two studies found average reduction of 44.4% and 95%

DejaVOO - Two Phase Technique • A comparative study [Bible, Rothermel & Rosenblum. 2001] found TestTube/DejaVu exhibit trade-off of efficiency versus precision • Analysis-time + test-execution-time ∝ const • Scaling Regression Testing to Large Software Systems • Orso, Shi, and Harrold (DejaVu). 2004 • JBoss = 1MLOC • Efficient approach: Selected too many tests • Precise approach: Analysis took too much time • In each case: Analysis + Execution > naïve-retest-all • Implementing a technique for Java programs that is “safe”, precise, and yet scales to large systems • Phase #1: Fast, high-level analysis to identify the parts of the system that may be affected by the changes • Phase #2: Low-level analysis of these parts to perform precise test selection

DejaVOO Results Considerable increase in efficiency Same precision

Standard RTS vs. RTS for TDD

Commercial/Free Tools • Google Testar: Selective testing tool for Java • Works with JUnit • Records coverage by instrumenting bytecode • Clover’s “Test Optimization” • A coverage tool with a new test optimization feature • Speed up CI builds • Leverages "per-test" coverage data for selective testing • JUnitMax by Kent Beck • A continuous test runner for Eclipse • Supports test prioritization to encourage fast failures • Run short tests first • Run recently failed (and newly written) tests first • JTestMe: • Another selective testing tool for Java • Uses AspectJ, method-level coverage • Infinitest: a continuous test runner for JUnit tests • Whenever you make a change, Infinitest runs tests for you. • It selects tests intelligently, and runs the ones you need. • Uses static analysis  will not work with dynamic/reflection-based invocations

CodePshychologist • Locating Regression Bugs • Nir, Tyszberowicz and Yehudai, 2007 • Same problem in reverse • Given check point C that failed, and source code S of the AUT, find the places (changes) in the code S that causes C to fail • System testing • UI level • Using script/manual • Checkpoint C defined at UI level

CodePshychologist - Code lines affinity Check point Select "clerk 1" from the clerk tree (clerk number 2). Go to the next clerk. The next clerk is "clerk 3"

CodePshychologist - affinity problem red, flower, white, black, cloud red, flower, white, black, cloud > rain, green, red, coat train, table, love

CodePshychologist - Words affinity • Taxonomy of words, graph where each node represents a synonym set • Wordnet: An electronic lexical database. 1998 • Wordnet-based semantic similarity measurement. Simpson & Dao, 2005

CodePshychologist – Word Groups affinity

TestRank Marketecture Q file:line P Ranking t1 t2 t3 … Correlation Scores & Locator Dynamic & Static Analyses Query Engine T

TestRank - Preprocessing Phase • Pre-compute test/unit correlation during a run of the test suite by tracing the tests through the production code • AspectJ • Use coverage as basic soundness filter • Collect dynamic metrics: • Control flow • Data flow • Look for natural language clues in sources text • WordNet

TestRank – Online Phase • Use correlation data during code editing to expose to the developer a list of tests which might conflict with the block of code currently being edited • Sorted in descending order of correlation • Developers can run just the specific functional tests

Dynamic PA • Execution Count Predictor • How many times this method was called during the execution stemming from test t? • Call Count Predictor • How many distinct calls to this method when called during the execution stemming from test t? • Normalize to [0, 1] • score = c / (c+1)

More Dynamic PA – The Stack • Two Stack Frames Count Predictor • How many distinct configurations of the calling frame and frame before that on the call stack? • Stack Count Predictor • How many distinct call stack configurations? • Stack Depth Sum Predictor • Sum the inverse depth of call stack at each execution of this method stemming from test t.

Dynamic PA – Data Flow • Value Propagation Predictor • Compare values of simple typed arguments (and return value), between those flowing out of the test and those reaching the method under test. • Size of intersection between the two sets of values. • For each test/method pair, find the maximum intersection m. • score = m / (m+1)

Natural Language PA • Adapted CodePsychologist Algorithm. • Coverage sound filter (execution count > 0). • For each test/method pair, look for: • SimilarmethodName() • “Similar literals” • // Similar comments • Similar words extracted frommeaningfulIdentifierNames

NL analysis • During tracing build SourceElementLocator: fileName  beginLine  ElementInfo{signature, begin, end, WordGroup} • For each source file, extract words and literals and map them by line numbers. • Literals are whole identifiers, strings and numbers • Words are extracted from identifiers by assuming_namingConventions  {assuming, naming, conventions} • For each method: • include the comments before the method • collect group of words and literals mapped to line numbers between the beginning and the end of the method

NLPA – Word Group Affinity • for each test/method pair (t, m), locate the two code elements and get two word groups wg(t), wg(m) • calculate GrpAff(wg(t), wg(m)) using adapted CodePsychologist algorithm: • separate words from literals, compute GrpAff for each type separately and take the average affinity. • filter out 15% most common words in the text.

TF-IDFTermFrequency x InverseDocumentFrequency • Balances the relative frequency of the word on particular method, with its overall frequency • w occurs nw,p times in a method p and there are a total of Np terms on the method • w occurs on dw methods and there are a total of D methods in the traces tfidf(w, p) = tf (w, p) × idf(w) = nw,p/Np * log D/dw

NLPA – Weighted Group Affinity AsyGrpAff’(A,B) = 1/n · Σ1≤i≤n [max{WrdAff(ai, bj) | 1 ≤ j ≤ m} · tfidf2(ai, A) ·factor(*)(ai)] (*) Words appearing in method name are given a x10 weight factor

Synthetic Experiment – Code Base • Log4J • Apache’s open source logging project • 33.3KLOC • 8.4K statements • 252 test methods • Used CoreTestSuite = 201 test methods • 1,061 actual test/method pairs traced

Synthetic Experiment – Performance • CPU: Intel Core2-6320 1.86GHz; RAM: 2Gb • Preprocessing: • Dynamic PA ≈ 6 sec • Natural Language PA ≈ another 12 sec • Creates two database files: • affinities.ser ~1.1Mb • testrank.ser ~2Mb • Query < 1 sec

Synthetic Experiment – Method • identified "core methods" covered by 20-30 tests each. • manually mutated four methods in order to get a test failing. • got ten test failures • getLoggerRepository  {testTrigger, testIt} • setDateFormat  {testSetDateFormatNull, testSetDateFormatNullString} • getRenderedMessage  {testFormat, testFormatWithException…. 3 more} • getLogger  {testIt} LogManager.getLoggerRepository covered by 30 tests Planted Bug: Removed the “if” condition: // if (repositorySelector == null) { repositorySelector = new DefaultRepositorySelector(new NOPLoggerRepository()); guard = null; LogLog.error("LogMananger.repositorySelector was null likely due to error in class reloading."); // } return repositorySelector.getLoggerRepository(); actual: Errors: SMTPAppenderTest.testTrigger Failures: TelnetAppenderTest.testIt

Synthetic Experiment – Method (2) • pairs (mi, ti) of mutated method and actual failing test e.g. input file (descriptor of 3 such pairs) actual_1.txt LogManager.java:174 LoggerRepository org.apache.log4j.LogManager.getLoggerRepository() void org.apache.log4j.net.SMTPAppenderTest.testTrigger() void org.apache.log4j.net.TelnetAppenderTest.testIt()

Synthetic Experiment – Method (3) • reverted all mutations back to original code. • ran TestRank preprocessing • for each pair (mi, ti) ran a query on mi and compared ti to the each of TestRank predictor's ranking (ti1, ti2, …, tim). • For predictor p, let the actual failed test's relative rank be RRp = j/m, where ij=i.

Synthetic ExperimentPredictors Results • Different heuristics predicted different failures • Best heuristics: Value Propagation, Affinity • Stack Depth Sum was very good on 4 experiments, and among the worst on the other 6 • Worst heuristic: Execution Count (*) Bug #3 caused five tests to fail

Synthetic ExperimentPredictors StatisticsImprovement vs. “Safe” RTS • Affinity is best on Average and Median • Call Count is best on 80% percentile • Worst heuristics: Execution Count, Stack Depth Sum • Simple Average is a bad meta heuristic

Conclusions • TDD is powerful but over time introduces waste in the retesting phase • “Safe” RTS techniques are too conservative for TDD (and are not really safe…) • Our technique enables to find and run the tests most relevant to a given code change • Dynamic and natural-language analyses are key • Developers can run the relevant tests and avoid wasting time on running the irrelevant ones • Eliminate waste from the TDD cycle while maintaining a high bug detection rate • Makes it easy to practice TDD rigorously

(Near) Future Work • We are currently working on: • Affinity Propagation through call tree • Meta heuristics • Weighted average • Use experimental results as training data? • Self weighting heuristics • Further validation

Future Work • Reinforcement Learning • Strengthen correlation between true positives and weaken for false positives • Interactive confirm/deny • Add annotations/tagging • Finer granularity  Greater precision • String edit distance between literals • Consider external resources • Changes in files such as XML and properties • Combine global ranking • Test cyclomatic complexity / test code size • Use timing of tests for cost effective ranking (short tests rank higher) • Selection should have good total coverage • Handle multiple edits • Integration with Eclipse and JUnit • Changes filtering: comments, refactoring, dead code • combine static analysis (combine existing tools)

Further Applications of Code/Test Correlation • Assist code comprehension • What this code does? • Assist test maintenance • What is the sensitivity/impact of this code? • What tests to change? • Find regression to known past bugs • related bug descriptions in bug tracking system • Reverse applications • Find regression cause: Test fails ? where to fix (CodePsychologist++) • Find bug cause: Find relevant code to bug description in bug tracking system • TDD impl assist: Spec (test) change? where to implement

Questions?

Thank You

How Often? Quote from JUnit FAQ: http://junit.sourceforge.net/doc/faq/faq.htm How often should I run my tests? Run all your unit tests as often as possible, ideally every time the code is changed. Make sure all your unit tests always run at 100%. Frequent testing gives you confidence that your changes didn't break anything and generally lowers the stress of programming in the dark. For larger systems, you may just run specific test suites that are relevant to the code you're working on. Run all your acceptance, integration, stress, and unit tests at least once per day (or night).

How much time? We posted a question on stackoverflow.com http://stackoverflow.com/questions/1066415/how-much-time-do-you-spend-running-regression-tests-on-your-ide How much time do you spend running regression tests on your IDE, i.e. before check-in? • In most cases these test will run in < 10 seconds. To run the complete test suite I rely on the Hudson Continuous Integration server... (within an hour). • sometimes I run a battery of tests which takes an hour to finish, and is still far from providing complete coverage. • My current project has a suite of unit tests that take less than 6 seconds to run and a suite of system tests that take about a minute to run. • I would generally run all my tests once per day or so, as, in one job, I had about 1200 unit tests.

Assumptions • Baseline - All tests in T pass on P • Change is localized to a single method • We currently ignore some possible inputs: • Source control history • Test results history • Test durations • Recently failed/ added/changed tests

TestRank: Revolutionizing Test-Driven Development