Enhancing Automated Software Testing for Critical Space Missions: A Mars Perspective

(Quickly) Testing the Tester via Path Coverage Alex Groce Oregon State University (formerly NASA/JPL Laboratory for Reliable Software)

A Sad Software Story A Critical Module: Multiplier FOR MARS A Very Important Space Mission Test Engineer

A Sad Software Story A Critical Module: Multiplier FOR MARS A Very Important Space Mission Automated testing! “If this fails, we could lose the mission!” Test Engineer

A Sad Software Story Test Engineer Complex automated test framework Multiplier FOR MARS

A Sad Software Story 6 months… 8,976,423,124 tests… Improvements… Bug fixes… Tester changes… 1,000,000,000 tests with NO failures! Test Engineer Complex automated test framework Multiplier FOR MARS

A Sad Software Story Multiplier FOR MARS Mission Day 9 6 x 9 = 42… 42??? Launch! Test Engineer

A Sad Software Story • “We found three very subtle bugs.Manual testing would never havefound them. We assumed itwould find all the important bugs.” • “The automated tests had very highbranch coverage.” • “We ran the tester for six days in a row,and found no bugs.” Congressional hearings

Automated Software Testing Congressional hearings: conclusions • Powerful, effective, important, but… • Relies on a large code base, may be nearly as complex as the module to be tested! • Behavior too complex to really understand • Configuration management can be a nightmare • Invites complacency about testing, neglect ofmanual tests • When a bug is introduced into the tester, the result may be lots of passing tests • Very hard to know when something is wrong

The Problem • Very hard to know when something is wrong • How do we know when an automated tester is producing false negatives (no failed tests) due to a bug in the tester? • Bug may mean a coding error, configuration foul up, or a fundamentally bogus assumption

The Problem • Automated testers are highly complex software systems with behavior that is • Particularly hard to specify (“find all the bugs” is not a nice clean LTL property or assertion) • Pretty much impossible for humans to understand (how do you summarize 100,000,000 tests?) • Easy to get wrong • Potentially mission or safetycritical

Possible Solutions? • Traditional Regression Testing • Differential Testing (“bakeoff”) • Coverage Measures

Traditional Regression Testing • Run latest tester on old (known buggy) versions of the SUT • Good: • Good for detecting regressions of the tester • Easy to understand results (“Yesterday, my tester caught this bug; today, it does not”)

Traditional Regression Testing • Bad: • Changes to interface of SUT require lots of work • Very coarse, very slow – need full run to compare • Old bugs may be easier to find • As software becomes more mature, remaining bugs are (almost by definition) lower probability

Differential Testing • A variation: compare to a different tester on current software version • Problems: • Where do we get another effective automated tester? These things are hard to write! • If it’s better, why not just use that one? • Why bother with the copper tester when we have a gold standard available?

Coverage • Branch and statement coverage • Good, minimal checks: know why lines that aren’t covered aren’t covered • RED ALERT if a previously covered branch isn’t covered by latest version of the tester

Coverage • Branch and/or statement coverage • Coarse: random testing and model checking perform similarly, even in cases where model checking is known better for fault detection • Slow: may take full test period to find a difference in branch coverage • Full automated test runs often take a day or two • When do we declare the coverage worse, given the all/nothing nature of covering branches?

Path Coverage • Fine grained • Therefore often quick • Exposes differences between test approaches that aren’t detected with branch coverage

Another Software Story • File system modules for JPL’s Mars Science Laboratory mission • Automated testing system based on explicit-state model checking [VMCAI 08, WODA 08, CFV 08, ASE 08] • Weeks of “no bugs” testing • Developer of file systemhappened to stumble acrosssome bugs while testing newfunctionality • “How did we miss this stuff???”

Path Coverage • Instrument with CIL • Track path bitvector, function entry if (x == 3) { add_to_bv(pathBV, 1); x++; if (y > 0) { add_to_bv(pathBV, 1); y++; } else { add_to_bv(pathBV, 0); }} else { add_to_bv(pathBV, 0); x--;} if (x == 3) { x++; if (y > 0) { y++; }} else { x--;} becomes

Path Coverage • Coverage here is per entry function, not whole program paths • Our application is a file system • Testing of a library: therefore we care about top-level function entry paths, not whole test-case • Takes less storage, still guarantees unique path • Overhead is acceptable (~15%) because does not change model checking storage time, which dominates test runtime

Traditional Regression Testing Ten minutes of testing (x 6 processors)

Ten minutes of testing (x 6 processors)

Swarm Model Checking • Standard Depth First Search on a very large model gets lost somewhere in a branch of a branch of a very big tree • Heuristics? But we have no idea • Where the bugs are • The structure of the state space • So, generate a vast array of different search configurations, transitions orderings • And let parallelism (multicore desktops) have at it! • Most effective method we know for testing programs with very large state spaces

Test Focus • Worse overall path coverage doesn’t always mean the tester is buggy • Can get better coverage of some functions if we don’t cover other functions at all • But we don’t want to cover only some functions… • Bugs may only arise when both are called • Or build 500 different configurations… • Automatic generation of a diverse set of focuses • Swarm for test focus

Is Path Coverage the Solution? • Not really • It’s helpful, and it finds some problems • Branch/path coverage measures should be seen as basic due diligence for critical systems testing • But testing the tester is still very difficult

Questions? Suggestions? • How do you test your automated testers?

Enhancing Automated Software Testing for Critical Space Missions: A Mars Perspective