Analyzing Software Changes Using Historic Databases: The CISC 864 Study by Lionel Marks

Identifying Reasons for Software Changes Using Historic Databases The CISC 864 Analysis By Lionel Marks

Purpose of the Paper • Using the textual description of a change, try to understand why that change was performed (Adaptive, Corrective, or Perfective) • Observe difficulty, size, and interval on the different types of changes

Three Different Types of Changes • Traditionally, the three types of changes are (Taken from ELEC 876 Slides):

Three Types of Changes in This Paper • Adaptive: Adding new features wanted by the customer (Switched with Perfective) • Corrective: Fixing Faults • Perfective: Restructuring code to accommodate future changes (Switched with Adaptive) • They did not say why they changed these definitions

The Case Study Company • This paper did not divulge the company it used for its case study • It is an actual business • Kept developer names/actions anonymous in the study • This allowed them to study a real system that has lasted for many years, and has a large (and old) version control system.

Structure of the ECMS • The Company’s Source Code Control System - ECMS ( Extended Change Management System) • MRs vs. Deltas • Each MR could have multiple Deltas of changes to one file • Delta – each time a file was “touched”

The Test System • Called “System A” for anonymity purposes • Has: • 2M lines of source code • 3000 files • 100 modules • Over the last 10 years: • 33171 MRs • An average of 4 deltas each

How they Classified Maintenance Activities (Adaptive, Corrective, Perfective) • If you were given this project • You have: • The CVS repository, and access to the descriptions along with commits • The goal of labelling each commit as “Adaptive”, “Corrective”, or “Perfective”. • What would you intuitively study in the descriptions?

How they Classified Maintenance Activities (Adaptive, Corrective, Perfective) • They had a 5 step process: • Cleanup and normalization • Word Frequency Analysis • Keyword Clustering and Classification • MR abstract classification • Repeat analysis from step 2 on unclassified MR abstracts

Step 1: Cleanup and Normalization • Their approach used WordNet • A software that eliminates prefixes and suffixes to get back to the root word. E.g. fixing and fixes are all of the root word fix • WordNet also had a synonym feature, but it was not used. • They would be hard to correlate properly to the context of SW maintenance, and could be misinterpreted.

Step 2: Word Frequency Analysis • Determine the frequency of a set of words in the descriptions (Histogram for each description) • What words in the English language would be “neutral” to these classifications and be noise in this experiment?

Step 3: Keyword Clustering • Classification was done by reading the description of 20 randomly selected changes for each selected term in their set, such as “cleanup” meaning perfective maintenance. Human reading was done. • If word matched less than 75% of cases, then deemed “neutral” • Found that “rework” was used a lot during “code inspection” (a new classification)

Step 4: MR Classification Rules • Like the “hard-coded” answer when the learning algorithm fails • If an inspection word is found, then it is deemed an inspection classification • If fix, bug, error, fixup, or fail are present, the change is corrective • If more than one type of keyword is present, the dominating frequency wins.

Step 5: Cycle Back to Step 2 • As in Step 2 you cannot cover the frequency of every word in your document all at once, take some more now • Perform more “learning” and see if new frequent terms fit • Use static rules to resolve unclassified descriptions • When all else failed, considered fixes to be corrective

Case Study: Compare Against Human Classification • 20 Candidates, 150 MRs • More than 61% of the time, the tool and the real people came to the same classification • Kappa and ANOVA were used to show significance in the results

How Purposes Affect Size and Interval • Corrective and Adaptive had the lowest change intervals • New Code Development and inspection changes added the most lines • Inspection deleted the most lines • Distribution functions are significant at a 0.01 level ANOVA described significance as well, but is inappropriate due skewed distributions

Change Difficulty • 20 Candidates, 150 MRs • Goal: To model the difficulty of each MR. Is classification significant?

Modeling Difficulty • Modeling of Size: Deltas (# of files touched) • Difficulty changed with number of deltas except in corrective and perfective (changes in SW/HW) changes • Length of time modeled in difficulty as well

Likes and Dislikes of this Paper • Likes • The algorithm used to make classifications – good way to break down the problem • The accumulation graphs were interesting • Their utilization of a real company is also a breath of fresh air – real data! • Dislikes • Asking developers months after the work how hard changes were. No better way at moment, but results can be skewed with time. • Using a real company, the anonymity made the product comparison in the paper less interesting

Analyzing Software Changes Using Historic Databases: The CISC 864 Study by Lionel Marks