Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and OchimizuKoichiro Japan Institute of Science and Technology ESEM 2009

Contents • Abstract • Background • Problem Analysis • Case study • Results • Conclusion and Future Work

Abstract Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects. First attempt of solution: simple log data transformations P(Fault prone class) X = design-complexity metric P(y=1) x

Background • Some design-complexity metrics have shown to be good predictors of fault-prone classes in LR models • Among these metrics are the Chidamber & Kemerer (CK) metrics • 80th and 20th percentiles of the distributions can be used to determine high and low values • Their thresholds cannot be determined before their use and should be derivedand used locally

Problem Analysis Can a LR model built with these kind of metrics work efficiently with different software projects? LEAST FAULTY MOST FAULTY P (y=1) Large Size SW project Small Size SW project X = Number of Methods 20 10

Case Study • Data analysis of 7 different projects andapplication of simple log data transformations. • Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes). • Dependent Variables: CK-CBO, CK-RFC, CK-WMC • Independent Variables: Defects (from Bugzilla & CVS) • Test these models with 2 other smaller projects (with 11 and13 Java classes)

Challenge BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** produced biased regression estimates and reduce the predictive power of regression models (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

Case Study Solution. Simple data transformation using “Log10” Example : • Number of Outliers are less • Data Spread is more uniform LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm; Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

Results Effects of the Log data Transformations: • Elimination of great number of outliers • Overall goodness of fit of the 3 models is better • Discrimination (Most Faulty/Least Faulty) • All models discriminate well between most Faulty and Least Faulty classes of the Mylyn System • What about using different projects?

Results MF: Most Faulty LF: Least Faulty BANKING SYSTEM

Results MF: Most Faulty LF: Least Faulty E-COMMERCE SYSTEM

Conclusions and Future work • CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects • Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model. • Further data exploration and study of data transformations

Thank you! questions, comments … contact: erika.camargo@jaist.ac.jp

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects

Presentation Transcript

Logistic regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression