220 likes | 369 Vues
Finding Errors in .NET with Feedback-Directed Random Testing By Carlos Pacheco, Shuvendu K. Lahiri and Thomas Ball. Presented by Bob Mazzi 10/7/08. Introduction. Testing software is expensive Testing is 50% of total software cost Microsoft has about one tester for each developer
E N D
Finding Errors in .NET with Feedback-Directed Random TestingBy Carlos Pacheco, Shuvendu K. Lahiri and Thomas Ball Presented by Bob Mazzi 10/7/08
Introduction • Testing software is expensive • Testing is 50% of total software cost • Microsoft has about one tester for each developer • Time consuming part of development cycle • Limited by testers' ability to determine what to test • Random Testing • Overall Effectiveness still unproven • Individual effectiveness seems promising
This Paper’s Contributions • Goals of this research • Studies needed in real world situations • Industrial Environment vs. Research Environment • Are errors found significant enough to justify testing and correction? • Case Study • This study confirms Feedback-Directed Random testing can find errors that other forms of testing had not found in an Industrial Environment. • New information developed to compare different types of testing. • Specific Results • 30 new errors found • Errors found in other testing tools • Error detection rate diminished to nil after 150 hours of testing.
Overview of .Net Component under test • Critical component • 100K Lines of code • 5 Years of development • Currently about 40 Developers working on this component • Currently 40 Test Engineers responsible for testing this component • Component is used as a component of many MS Applications
Current Testing Approach • Developers • Minimal testing of units as work is performed • Some design for testing • Test Engineers • Manual testing • Internally developed testing tools • Beta testing • Thousands of users within Microsoft use this component as part of the projects that they develop. • A large number of end users who install Beta versions of software that rely upon this component also perform beta and production testing.
Implications of Current Testing Approach • Errors found Rate was high earlier in cycle • Errors found Rate using existing testing team has diminished to 20 per man year • Component is mature and stable One of the stated goals of this study as proposed was to determine if this was “Good Enough”, or if Feedback-Directed Random testing could improve this component further.
This Paper: Feedback-Directed Random testing • Addresses Automatic Generation of Unit tests • Randoop = RANDom tester for Object Oriented Programs
How Randoop works • Inputs • Module to develop test cases for • How long to run Randoop • Optional Configuration files ( areas to not test as they contain known errors not yet fixed ) • Process • Randomly determine method call to test • Apply test inputs to method • Review outputs looking for Error-Revealing sequences • Save as method sequence • Outputs • Test cases that should not fail • Test cases that should fail
Insulating Randoop from the Operating System • The problem • Because Randoop is testing a component that accesses the OS directly at a low level some of the test runs will cause a crash by generating a method sequence interacting inappropriately with the OS. • The solution • Insulate Randoop from the OS by placing it in a “wrapper”. • The wrapper acts as a virtual machine so that a crash case can be terminated. • Once this happens an new instance can be started and testing can continue.
Here’s How Randoop Works • What Randoop does • The Randoop wrapper is designed to spawn a new instance of Randoop • As Randoop creates a sequence to explore it records the sequence before attempting to execute it • When a crash occurs the process starts again • Methods that might cause a crash appropriately can be excluded from being explored again. • Methods that should NOT have crashed can be explored further as they may be part of an Error-Revealing sequence.
Using Randoop • Randoop was provided to test team along with instructions • Initial use was using default settings, with no configuration files • As errors were identified they were added to configuration files to prevent generation of unproductive test cases • Regular meetings were held to discuss use of Randoop • Some of these meetings resulted in suggestions for Randoop changes which were implemented during the study • As the Case Study progressed the test team started to use it in more sophisticated ways to target specific areas and use longer test generation runs.
Overall Results • 30 serious errors were detected • These were previously unknown errors. • Time spent over the test period included reviews of Error Revealing tests • Each error found - Used 5 hours of CPU time - ½ hour of tester time. • Prior testing averaged approximately 100 hours of tester time per error found.
Error Characteristics • New errors found in previously well tested code. • A previously untested path was explored that caused an error. • Analysis revealed that prior testing was not looking for illegal reference addresses • Additional testing was implemented to look for this type of error in other places • Output of Randoop ( test cases ) used as input to other testing tools. • During the test study a request was made to modify Randoop to optionally allow all test cases to be output • Test cases used as input to other tools • Other errors were found using these tools • Test cases from Randoop did reveal errors when run in Stress and Concurrency tests
Error Characteristics • Errors were found that pointed to errors in other testing tools • Testing revealed that an output message describing a specific error was missing in the production build • The tool that was being used did not test correctly for some missing • By correcting this issue the other test tool was corrected • Corner Cases and other testing • Unexpected tests were generated that found areas that were not covered by existing tests • New ways to apply other tests were found • Test cases were generated that identified errors that revealed unsuspected lacks in the program design • Additional manual tests were developed • Manual testing policies were updated
General Comparisons to other Test Generation tools. • Randoop was able to find previously unknown errors not found by other test tools • Randoop was not able to find some known errors • Randoop did not have a clear stopping point • Randoop has good performance allowing it to be applied to a larger component and develop test cases quicker than another existing similar tool • Randoop as a Random Test Case generation tool has a basically unbiased approach
The Plateau effect? • During testing it was noticed that the rate of errors being found appeared to diminish in steps as the testing proceeded • Once the final Plateau was reached the found error rate appeared to drop to zero • This “Plateau” was first noticed on a single PC • First 2 hours found 10 errors, a rate of 5 errors per hour • Next 10 hours found 20 additional errors, a rate of 2 errors per hour. • Further testing did not reveal additional errors • Later this effect was noticed on a large cluster of PC’s as well • Additional new errors detected • Decreasing rate of errors found
Related Work • Random testing has been used in many other testing areas • Unix Utilities - Miller et al 90 • Windows GUI applications - Forrester and Miller 00 • Object-oriented code - Csallner et al 04, Pacheco and Ernst 05, Jartege 05 • Some approaches combine Random testing with some form of direction • Ferguson and Korel 96, Godefroid et al ( Dart ) 05, Sen et al ( Cute ) 06 • There is not a consensus on the validity of Random testing • Meyers et al 04, Hamlet and Taylor 90, Ntafos 98, Groce et al 07, Pacheco et al 07 • There is relatively little research on true random testing as compared to forms of directed testing • Ferguson and Korel 96, Marinov et al 03, Visor et al 25 06
Conclusions • This case study was designed to • Test in a Production Environment • Test on a Mature, Well tested product • Implementing Randoop finds errors that prior testing did not • Randoop did NOT find some errors that other test tools did • Some errors found point to other issues • Faults in other tests • Omissions in prior testing process
Comments – Future Work • Comparisons to other Test Generation tools • Performance was stated as superior to other tools • Randoop is cited as being unbiased in comparison to other tools • Coverage – Lack of stopping point • Because of the particulars of this component the number of possible tests becomes extremely large • This number itself may not be readily calculable • Is there a valid stopping point? • Plateau effect • Rate of errors found is not consistent • Rate drops at two points 5 per hr. > 2 per hr. > No additional? • Is this valid and why?
Comments • Error in number of Errors? • There also seems to be an error in the paper regarding number of errors found. In the initial comments on the Plateau effect they stated that they found 30 errors ( 10 errors in first 2 hours plus 20 errors in the next 10 hours ). Later when they installed Randoop on a cluster they found an unspecified number of additional errors. • Claim that Randoop is Unbiased • As Randoop was not able to find other known errors I find this item less than convincing. Is it possible that Randoop is indeed biased internally in some fashion such that it missed the other known issues? I would be much more comfortable with stating that it is “relatively unbiased” or “differently biased” or even “significantly less biased” than stating unbiased outright when I do not know why it could not find the other errors. Regardless of the terminology, it is obvious that it is either less or differently biased as it found errors that were previously unknown
Comments • Plateau effect • In this type of production environment the best approach may be to dedicate a CPU to run Randoop continuously. By only stopping the execution to modify configuration files to exclude found and unfixed errors or when a fresh build of the component is available it can be stated that Randoop has found all of the errors that it is capable of finding at any specific point in time. The cost of running this tool continuously should be compared to the cost of a test engineer. If the burdened cost of the engineer is 120k per year that relates to 6k per error found. If Randoop costs the same per year as a person and finds a new error every two weeks it will still find more in a year than the person. I find it unlikely that in 150 hours of CPU time ( less than a week of real time ) where it found 30 errors, that it suddenly hit a plateau where it had found all of the detectable errors that it was capable of finding. • If the Plateau was truly “flat”, the cluster testing should not have revealed any additional errors. In practice, it did reveal some additional errors. This indicates that all findable errors had not been found. They go on to state that this was done using different seeds and configuration files. • This leads me to question just how random Randoop is. It would be interesting to try multiple runs with a single seed and different configuration files as well as runs with a single configuration file and different seeds. • Another question is to do with the Randoop Wrapper. Early in the paper it states that the Wrapper respawns Randoop using a different seed, yet the cluster testing states that a different seed was entered. Does the Wrapper generate a truly random seed to start Randoop?