On Some Fault Prediction Issues

On Some Fault Prediction Issues CAMARGO CRUZ Ana Erika and IIDA Hajimu 奈良先端科学技術大学院大学情報科学研究科ソフトウェア設計学研究室

Introduction NATO Software Engineering Conference (1968), Mr. Nash from the IBM UK Laboratories suggested : • Test Planning and Status Control as valuable tools in managing the later part of a software development cycle. Various research areas for Test Planning, one: • Fault Prediction

Fault Prediction • Goal: • Predict faulty components  More effective criteria for the elaboration of test cases. • State of the art: • A large number of prediction models have been proposed in the last decades, but hardly put into practice. • Projects: mostly on open source software. • Predictors:mainly product metrics.

Fault Prediction Design Complexity Metrics (Coupling , Inheritance, etc.) Latest Literature Review[1]:Overall, LOC is useful for fault prediction. Are we are reinventing the wheel? Basili et al. [1996] LOC Akiyama et al. [1971] Complexity Metrics Process, History, Repository Metrics (Number of Commits, LOC modified, Past faults, Times a file is refactored, etc.) E. Arisholm et al. [2007] Moser et al. [2008] S. Shivaji et al. [2009] Y. Kamei et al. [2010] Halstead et al. [1975] • Logically, components with the greatest number of LOC, • Most complex • Most frequently changed [1]Hall, T., Beecham, S., Bowes, D., Gray, D. and Counsell, S.: A Systematic Literature Review on Fault Prediction Performance in Softw. Eng., Softw. Eng., IEEE Trans. on, Vol. 38, No.6, pp.1276–1304 (2012)

Research Questions • RQ1: Is there exist multicollinearity? • If our supposition is right, these metrics would be highly inter-correlated. • RQ2: How much accurate is a model using inter-correlated metrics than single metric models? • RQ3: How much does their usage worth? • If Model A is more accurate than B, does its prediction accuracy worth the effort invested to construct it? Faulty Metric1 Faulty Metric2 Prediction Model B MetricN Prediction Model A Metric1 Metric_N No Faulty No Faulty

The Multicolinearity Problem • Most common prediction techniques used for fault prediction (Naive Bayes and Logistic Regression) : • assume the predictor variables to be interdependent of each other. • For example, suppose the following correlation matrix:

The Multicolinearity Problem NCommits Shared Variance: Variation amount of two variables that tend to vary together LOC Faults Faults Multiple Correlation Coefficient A measure of the fit of a multiple linear regression model → [0,1] NCommits→ poor significance NComiits LOC NCommits LOC Faults Faults (Ideally)

A Rapid Literature Review[1] • From 208 research papers, only 36 passed their review. • They reported sufficient contextual and methodological information • We reviewed 13/36: (RQ1)Finding 1. Report muticollineraity among their used predictors, but no details are provided. • Correlation Matrix among predictors is missing. • Principal Component Analysis to alleviate the problem no major details are given. [1]Hall, T., Beecham, S., Bowes, D., Gray, D. and Counsell, S.: A Systematic Literature Review on Fault Prediction Performance in Softw. Eng., Softw. Eng., IEEE Trans. on, Vol. 38, No.6, pp.1276–1304 (2012)

A Rapid Literature Review (RQ2) Finding 2. Do not report their prediction results on single metrics, but on sets of metrics. • We cannot know how much more accurate is a model using a single metric than another using multiple metrics. • Exceptions: • LOC useful predictor with stable performance[2,3]. • LOC-only model was suggested as viable alternative to more complex models[4]. • One study exploring design complexity metrics (Chidamber and Kemerer) found that: • Using a single predictor model yielded betterprediction accuracy that using multiple metrics[5]. [2] M. D’Ambros, M. Lanza, and R. Robbes, “An extensive comparison of bug prediction approaches,” MSR 2010,. 7th IEEE Working Conference on, 2010, pp. 31–41. [3] Z. Hongyu, “An investigation of the relationships between lines of code and defects,” ICSM 2009. IEEE International Conference on, 2009, pp. 274–283. [4]R. Bell, T. Ostrand, and E. Weyuker, “Looking for bugs in all the right places,” in Procs of the 2006 international symposium on Software testing and analysis. ACM, 2006 [5] T. Gyimothy, R. Ferenc, and I. Siket, “Empirical validation of object-oriented metrics on open source software for fault prediction,” Software Engineering, IEEE Transactions on, vol. 31, no. 10, pp. 897 – 910, oct. 2005.

A Rapid Literature Review (RQ3)Finding 3. The gain of using multiple metrics is not assessed either. • An exception is the work of E. Arisholm et al.[7,8], that proposes a measures of cost-effectiveness. • However, their results are on set of metrics.  Size If the only thing a prediction model does is to model the fact that the number of faults of a class is proportional to its size, there would be likely no much gain from such a model. [7] E. Arisholm, L. C. Briand, and M. Fuglerud, “Data mining techniques for building fault-proneness models in telecom java software,” in Software Reliability, ’07. The 18th IEEE International Symposium on, nov. 2007, pp. 215 –224. [8] E. Arisholm, L. C. Briand, and E. B. Johannessen, “A systematic and comprehensive investigation of methods to build and evaluate fault prediction models,” Journal of Systems and Software, vol. 83, no. 1, pp. 2–17, 2010.

For DiscussionCourse of Action • Study other metrics. Which? Where from? • Publically available data only from open source software projects. • Metrics mined from these projects (product, repository) may be telling us the same. • Metrics on Experience, Team Communication, where from? • Include results and analysis: • on single predictors models as opposed to multiple predictors models. • cost effectiveness measures.

Conclusions • Although multicollinearityamong predictors of faulty code is reported: • Little is known about the gain of using multiple predictors as opposed to single predictor models • We think that researchers have exhausted the exploration of metrics from open source software • Other factors which may be related to faulty code are difficult to study due to the lack of publicly available data.

On Some Fault Prediction Issues

On Some Fault Prediction Issues

Presentation Transcript

HE Elearning: Some issues

CLUSTERING SUPPORT FOR FAULT PREDICTION IN SOFTWARE

Some Inquiry on Microeconometric Issues

Some Observations on Mash pH Prediction/Control

Dynamic Issues in Fault-to-Fault Jumping

Some Technical Issues

Fault Tolerance Some background

Some ENSDF Issues

Some Research Issues

Conceptual Issues in Risk Prediction

Automation of Software Fault Prediction

Some Policy Issues on the Internet

Some Critical Issues…

Telecom Network Fault Prediction

Some C urrent Issues Homoeoprophylaxis Attacks on CAM

Some sticky issues

Some Management Issues

Some QoS Deployment Issues

Fault Tolerance Some background

Some issues 1

Some Issues and observations

Fault Detection and Prediction in Cloud Computing