180 likes | 287 Vues
This paper presents the development of logistic regression (LR) models that utilize design complexity metrics to predict fault-prone object-oriented classes across various software projects. By examining data from seven distinct projects, simple log data transformations were employed to enhance model performance. Key findings indicate significant improvements in predictive ability, particularly when dealing with projects that display varying distribution characteristics. The results underscore the importance of locally derived thresholds for complexity metrics, as well as the potential for logistic transformation techniques in regression modeling.
E N D
Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and OchimizuKoichiro Japan Institute of Science and Technology ESEM 2009
Contents • Abstract • Background • Problem Analysis • Case study • Results • Conclusion and Future Work
Abstract Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects. First attempt of solution: simple log data transformations P(Fault prone class) X = design-complexity metric P(y=1) x
Background • Some design-complexity metrics have shown to be good predictors of fault-prone classes in LR models • Among these metrics are the Chidamber & Kemerer (CK) metrics • 80th and 20th percentiles of the distributions can be used to determine high and low values • Their thresholds cannot be determined before their use and should be derivedand used locally
Problem Analysis Can a LR model built with these kind of metrics work efficiently with different software projects? LEAST FAULTY MOST FAULTY P (y=1) Large Size SW project Small Size SW project X = Number of Methods 20 10
Case Study • Data analysis of 7 different projects andapplication of simple log data transformations. • Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes). • Dependent Variables: CK-CBO, CK-RFC, CK-WMC • Independent Variables: Defects (from Bugzilla & CVS) • Test these models with 2 other smaller projects (with 11 and13 Java classes)
Challenge BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** produced biased regression estimates and reduce the predictive power of regression models (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** RFC Data of BNS is more spread than the data of the MYL (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
Case Study Solution. Simple data transformation using “Log10” Example : • Number of Outliers are less • Data Spread is more uniform LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm; Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed
Results Effects of the Log data Transformations: • Elimination of great number of outliers • Overall goodness of fit of the 3 models is better • Discrimination (Most Faulty/Least Faulty) • All models discriminate well between most Faulty and Least Faulty classes of the Mylyn System • What about using different projects?
Results MF: Most Faulty LF: Least Faulty BANKING SYSTEM
Results MF: Most Faulty LF: Least Faulty E-COMMERCE SYSTEM
Conclusions and Future work • CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects • Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model. • Further data exploration and study of data transformations
Thank you! questions, comments … contact: erika.camargo@jaist.ac.jp