Audris Mockus
Lawrence G. Votta
April 17, 2000
Copyright (c) 1993 by the Institute of Electrical and Electronics Engineers.
Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
The scripts used in the experiment are in experimentScripts.tar. The file includes notes and scrap, so please focus on file testClass2.perl that does the classification.
The traditional approaches to understanding the software development process define specific questions, experiments to answer those questions, and instrumentation needed to collect data (see, e.g., the GQM model [2]). While such an approach has advantages (i.e., in some cases defines a controlled experiment), we believe that a less intrusive and more widely applicable approach would be to obtain the fundamental characteristics of a process from the extensive data available in every software development project. To ensure that our methods could be easily applied to any such project, we used data from a version control system. Besides being widely available, the version control system provides a data source that is consistent over the duration of the project (unlike many other parts of the software development process). Our model of a minimal version control system (VCS) associates date, time, size, developer, and textual description with each change.
Implicit in our approach is the assumption that we consider only software process properties observable or derivable from the common source -- VCS. Because VCSs are not designed to answer questions about process properties (they are designed to support versions and group development), there is a risk that they may contain minimal amounts of useful process information despite their large size. There may be important process properties that can not be observed or derived from VCSs and require more specialized data sources.
The quantitative side of our approach focuses on finding main factors that contribute to the variability of observable quantities: size, interval, quality, and effort. Since those quantities are interdependent, we also derive relationships among them. We use developer surveys and apply our methods on different products to validate the findings.
This work exemplifies the approach by testing the hypothesis that a textual description field of a change is essential to understand why that change was performed. Also, we hypothesize that difficulty, size, and interval would vary across different types of changes.
To test our hypotheses, we analyzed a version control database of a large telecommunications software system (System A). We designed an algorithm to classify automatically changes according to maintenance activities based on the textual description field. We identified three primary reasons for change: adding new features (adaptive), fixing faults (corrective), and restructuring the code to accommodate future changes (perfective), consistent with previous studies, such as, [20].
We discovered a high level of perfective activity in the system, which might indicate why it has been so long on the market and remains the most reliable among comparable products. We also discovered a that a number of changes could not be classified into one of the primary types. In particular, changes to implement the recommendations of code inspections were numerous and had both perfective and corrective aspects. The three primary types of changes (adaptive, corrective, perfective), as well as inspection rework, are easily identifiable from textual description and have strikingly different size and interval.
To verify the classification we did a survey where we asked developers of System A to classify their own recent changes. The automatic classification was in line with developer opinions. We describe methods and results used to obtain relationships between the type of change and its size or interval.
We then applied the classifier to a different product (System B) and found that the change size and interval varies much less between products than between types of changes. This indicates that size and interval might be used to identify the reason for a change. It also indicates that this classification method is applicable to other software products (i.e., it has external validity). We conclude by suggesting new ways to improve the data collection in the configuration management systems.
Sections 2 describes the System A software product. Section 3 introduces the automatic classification algorithm and Section 4.1 describes the developer validation study. We illustrate some of the uses of the classification by investigating size, interval, and difficulty for different types of changes in Section 5. In Subsection 5.2 the classifier is applied to System B. Finally, we conclude with recommendations for new features of change control systems to allow analysis of the changes and hence the evolution of a software product.
Our database was version control and maintenance records from a multi-million line real-time software system that was developed over more than a decade. The source code is organized into subsystems with each subsystem further subdivided into a set of modules. Each module contains a number of source code files. The change history of the files are maintained using the Extended Change Management System (ECMS) [14], for initiating and tracking changes, and the Source Code Control System (SCCS) [17], for managing different versions of the files. Our data contained the complete change history, including every modification made during the project, as well as many related statistics.
Each logically distinct change request is recorded as a Modification Request (MR) by the ECMS (see Figure 1). Each MR is owned by a developer, who makes changes to the necessary files to implement the MR. The lines in each file that were added, deleted and changed are recorded as one or more ``deltas'' in SCCS. While it is possible to implement all MR changes restricted to one file by a single delta, but in practice developers often perform multiple delta on a singe file, especially for larger changes. For each delta, the time of change, the login of the developer who made the change, the number of lines added and deleted, the associated MR, and several other pieces of information are all recorded in the ECMS database. This delta information is then aggregated for each MR. Each MR has associated English text describing reasons for the change and the change itself. There is no protocol on how and what information is entered, but the text is sufficient for other developers to understand what changes were made and why. A detailed description of how to construct change measures is provided in [15].
In the analysis that follows we use the following measures of size: the number of deltas, numbers of lines of code added, deleted, and unmodified by the change. To obtain these measures we simply count all deltas in a change and add the last three measures over all deltas in a change (each SCCS delta records numbers of lines of code added, deleted, and unmodified). We measure interval of a change by the time lag between the first and the last delta in a change.
We selected a subsystem (System A) for our analysis. The subsystem contains approximately 2M source lines, 3000 files, and 100modules. Over the last decade it had 33171 MRs, each having an average of 4 deltas. Although it is a part of a larger system, the subsystem functionality is sold as a separate product to customers.
The classification proceeds in five steps:
The described decision rule is designed to reject the null hypothesis at 0.05 level that half or less of the MR abstracts containing the term belong to the type assigned to the keyword.
As a result of this activity, we discovered that the term rework is frequently used in conjunction with code inspection. The development process in this environment requires formal code inspections for any changes in excess of 50 lines of source code. Code inspection is performed by a team of experts who review the code and recommend changes [7,8]. Typically, those changes are then implemented by a developer in a separate MR. The purpose of such changes is both corrective and perfecting, reflecting errors found and minor code improvements recommended in a typical code inspection. Since code inspection changes are an essential part of the new code development and contain a mixture of purposes, we chose to place code inspection changes in a separate class to be able better to discern the patterns of changes that have a single purpose. As it turned out, the developer perception of change difficulty and the size of code inspection changes were distinct from other types of changes.
After keyword classification, we looked at keywords and designed simple rules to resolve some of the conflicts when keywords of several types are present in one abstract. For example, the presence of code inspection terms would assign an abstract to the inspection category, independent of the presence of other terms like new, or fix. The rules were obtained based on our knowledge of the change process (to interpret the meaning of the keyword) and the knowledge obtained from classifying the keywords.
Following examples illustrate the three rules (the actual module, function, and process names have been replaced by three letters). The abstract ``Code inspection fixes for module XXX'' will be classified as an inspection because of keyword inspection. The abstract ``Fix multiple YYY problem for new ZZZ process'' will be classified as corrective because of keyword fix. The abstract ``Adding new function to cleanup LLL'' will be classified as adaptive because there are two adaptive keywords add and new and only one perfective keyword cleanup.
The MRs abstracts where none of the rules applied, were subjected to classification step 2 (word frequency analysis) and then step 3. There were 33171 MRs of which 56 percent were classified (one of the rules did apply) in the first round and another 32 percent in the second round leaving 12 percent unclassified after the second round. As we later found from developer survey, the unclassified MRs were mostly corrective. One of the possible reasons is that adaptive, perfective, and inspection changes need more explanation, while corrective activity is mostly implied and the use of corrective keywords is not considered necessary to identify the change as a fault fix. The resulting classification is presented in Table 1. It is worth noting that the unclassified MRs represent fewer delta than the classified MRs (12% of MRs were unclassified but less than 10% of delta, added, deleted, or unmodified lines).
Corrective | Adaptive | Perfective | Inspection | Unclassified | Total | |
MR |
33.8% | 45.0% | 3.7% | 5.3% | 12.0% | 33171 |
delta |
22.6% | 55.2% | 4.3% | 8.5% | 9.4% | 129653 |
lines added |
18.0% | 63.2% | 3.5% | 5.4% | 9.8% | 2707830 |
lines deleted |
18.0% | 55.7% | 5.8% | 10.8% | 9.6% | 940321 |
lines unchanged |
27.2% | 48.3% | 4.5% | 10.3% | 9.6% | 328368903 |
A number of change properties are apparent or can be derived from this table.
The subsystem management then selected 8 developers from the list of candidates. The management chose them because they were not on tight deadline projects at the time of the survey. We called the developers to introduce the goals, format, and estimated amount of developer time (less than 30 minutes) required for the survey and asked for their commitment. Only one developer could not participate.
After obtaining developer commitment we sent the description of the survey and the respondents' ``bill of rights'':
The researchers guarantee that all data collected will be only reported in statistical summaries or in a blind format where no individual can be identified. If any participants at any time feel that their participation in this study might have negative effects on their performance, they may withdraw with a full guarantee of anonymity.
None of the developers withdrew from the survey.
The survey forms are described in Appendix.
All of the developers surveyed have completed many more than 30 MRs in the past two years, so we had to sample a subset of the MRs to limit their number to 10 in the preliminary phase and 30 in the secondary phase.
In the first stage we sampled uniformly from each type of MR. The results of the survey (see tables below) indicated almost perfect correspondence between developer and automatic classification. The MRs classified as other by the developer were typical perfective MRs, as was indicated in the response comment field and in the subsequent interview. We discovered that perfective changes might be classified both as corrective or adaptive, while all four inspection changes were classified as adaptive.
To get a full picture in the second stage we also sampled from unclassified MRs and from the perfective and inspection classes. To obtain better discrimination of the perfective and inspection activity we sample with higher probability from from the perfective and inspection classes than from other classes. Otherwise we might have ended with one or no MRs per developer in these two classes.
The survey indicates that automatic classification is much more likely to leave corrective changes unclassified. Hence we assigned all unclassified changes to the type corrective. In the results that follow we assume that all unclassified changes are corrective. This can be considered as the last rule of the automatic classification in Section 3.4.
The overall comparison of developer and automatic classification is in Table 4.
We discussed the two MRs in the row ``Other'' with the developers. Developers indicated that both represented a perfective activity, however we excluded the two MRs from the further analysis.
More than 61% of the time, both the developer and the program doing the automatic classification put changes in the same class. A widely accepted way to evaluate the agreement of two classifications is Cohen's Kappa () [4] which can be calculated using a statistics package such as SPSS. The Kappa coefficient for Table 4 is above 0.5 indicating moderate agreement [6].
To investigate the structure of the agreement between the automatic and developer classifications we fitted a log-linear model (see [13]) to the counts of the two-way comparison table. The factors included margins of the table as well as coincidence of both categories.
Let mij be counts in the comparison table, i.e., mij is the number of MRs placed in category i by automatic classification and category j by developer classification. We modeled mi,j to have Poisson distribution with mean where C is the adjustment for the total number of observations; adjusts for automatic classification margins ( ); adjusts for developer classification margins ( ); I(i=j) is the indicator function; represents interactions between the classifications; and indexes denote adaptive, corrective, perfective, and inspection classes.
In Table 5 we compare the full model to simpler
models. The low residual deviance (RD) of the second model indicates
that the model explains the data well. The difference between the
deviances of the second and the third models indicates that the
extra factor
(that increases the degrees of
freedom (DF) by 1) is needed to explain the observed data.
ANOVA table for the second model (Table 6)
illustrates the relative importance of different factors.
The fact that the coefficient is significantly larger than zero shows that there is a significant agreement between the automatic and developer classifications. The fact that the coefficient is significantly larger than zero shows that inspection changes are easier to identify than other changes. This is not surprising, since for the less frequent types of changes, developers feel the need to identify the purpose, while for the more frequent types of changes the purpose might be implied and only a more detailed technical description of the change is provided.
This section exemplifies some of the possibilities provided by the classification. After validating the classification using the survey, we proceeded to study the size and interval properties of different types of changes. Then we validated the classification properties by applying them to a different software product.
The change interval is important to track the time it takes to resolve problems (especially since we determined which changes are corrective), while change size is strongly related to effort, see, e.g. [1].
Figure 2 compares empirical distribution functions of change size (numbers of added and deleted lines) with change interval for different types of changes. Skewed distribution, large variances, and integer values make more traditional summaries, such as boxplots and probability density plots, less effective. Because of a large sample size, the empirical distribution functions had small variance and could be reliably used to compare different types of maintenance activities.
The empirical distribution functions in Figure 2 are interpreted as follows: the vertical scale defines the observed probability that the value of a quantity is less than the value of the corresponding point on the curve as indicated on the horizontal axis. In particular, the curves to the right or below other curves indicate larger quantities, while curves above or to the left indicate smaller quantities.
The interval comparison shows that corrective changes have the shortest intervals, followed by perfective changes. The distribution functions for inspection and adaptive changes intersect at the 5 day interval and 65th percentile. This shows that the most time consuming 35 percent of adaptive changes took much longer to complete than the the most time consuming 35 percent of inspection changes. On the other hand, the least time consuming 60 percent of inspection changes took longer to complete than corresponding portion of adaptive changes. This is not surprising, since formal inspection is usually done only for changes that add more than 50 lines of code. Even the smallest inspections deal with relatively large and complex changes so implementing the inspection recommendations is rarely a trivial task.
As expected, new code development and inspection changes add most lines, followed by perfective, and then corrective activities. The inspection activities delete much more code than does new code development, which in turn deletes somewhat more than corrective and perfective activities.
All of those conclusions are intuitive and indicate that the classification algorithm did a good job of assigning each change to the correct type of maintenance activity.
All the differences between the distribution functions are significant at the 0.01 level using either the Kruskal-Wallis test or the Smirnov test (see [11]). Traditional ANOVA also showed significant differences, but we believe it is inappropriate because of the undue influence of extreme outliers in highly skewed distributions that we observed. Figure 3 shows that even the logarithm of the number of deleted lines has a highly skewed distribution.
This section compares the size and interval profiles between changes in different products to validated the classification algorithm on a different software product. We applied the automatic classification algorithm described in Section 3 to a different software product (System B) developed in the same company.
Although System B was developed by different people and in a different organization, both systems have the same type of version control databases and both systems may be packaged as parts of a much larger telecommunications product. We used the keywords obtained in the classification of System A, so there was no manual input to the automatic classification algorithm. System B is slightly bigger and slightly older than System A and implements different functionality.
Figure 4 checks whether the types of changes are different between the two products in terms of the empirical distribution functions of change size (numbers of added and deleted lines), and change interval.
The plots indicate that the differences between products are much smaller than the differences between types of changes. This suggests that the size and interval characteristics can be used as a signature of change purpose across different software products. However, there are certain differences between the two systems:
This section illustrates a different application of the change classification by a model relating difficulty of a change to its type. In the survey (see Section 4.1) developers matched purpose with perceived difficulty for 170 changes. To check the relationship between type and difficulty we fitted a log-linear model to the count data in a two-way table: type of changes (corrective, adaptive, perfective, or inspection) versus difficulty of the change (easy, medium, and hard). Table 7 shows that corrective changes are most likely to be rated hard, followed by perfective changes. Most inspection changes are rated as easy.
In the next step we fitted a linear model to find the relationship between difficulty and other properties of the change. Since the difficulty might have been perceived differently by different developers, we included a developer factor among the predictors. To deal with two outliers in the interval (the longest three MRs took 112, 91, and 38 days to complete); we used a logarithmic transformation on the intervals.
We started with the full model:
Using stepwise regression we arrived at a smaller model:
Because numbers of delta, numbers of added or deleted lines, and numbers of files touched were strongly correlated with each other, any of those change measures could be used as a change size predictor in the model. We chose to use the number of delta because it is related to the number of files touched (you need at least one delta for each file touched) and to the number of lines (many lines are usually added over several days often resulting in multiple checkins). As expected, the difficulty increased with the numbers of deltas, except for the corrective or perfective changes, which may be small but are still very hard. Not surprisingly, developers had different subjective scales of difficulty. Table 8 gives an analysis of variance (ANOVA) for the full model and Table 9 gives ANOVA for the model selected by stepwise regression. Since Rvalues are so similar, the second model is preferable because it is simpler, having three fewer parameters. We see that the three obvious explanatory variables are size, corrective maintenance, and developer identity. The other two explanatory variables (interval and perfective type), although present in the final model, are not as strong because their effect is not clearly different from zero. This may appear surprising, because interval seems like an obvious indicator of difficulty. However, this is in line with other studies where the change interval (in addition to size) does not appear to help predict change effort [1,10,19]. One possible explanation is that the size might account for the difficult adaptive changes, while corrective changes have to be completed in a short time, no matter how difficult they might be.
We studied a large legacy system to test the hypothesis that historic version control data can be used to determine the purpose of software changes. The study focused on the changes, rather than the source code. To make results applicable to any software product, we assume a model of a minimal VCS so that any real VCS would contain a superset of considered data.
Since the change purpose is not always recorded, and when it is recorded the value is often not reliable, we designed a tool to extract automatically the purpose of a change from the textual description. (The same algorithm can also extract other features). To verify the validity of the classification, we used the developer surveys and we also applied the classification to a different product.
We discovered four identifiable types of changes: adding new functionality, repairing faults, restructuring the code to accommodate future changes, and code inspection rework changes that represent a mixture of corrective and perfective changes. Each has a distinct size and interval profile. The interval for adaptive changes is the longest, followed by inspection changes, with corrective changes being the smallest.
We discovered a strong relationship between the difficulty of a change and its type: corrective changes tend to be the most difficult, while adaptive changes are difficult only if they are large. Inspection changes are perceived as the easiest. Since we were working with large non-Gaussian samples, we used non-parametric statistical methods. The best way to understand size profiles was to compare empirical distribution functions.
In summary, we were able to use data available in a version control system to discover significant quantitative and qualitative information about various aspects of the software development process. To do that we introduced an automatic method of classifying software changes based on their textual descriptions. The resulting classification showed a number of strong relationships between size and type of maintenance activity and the time required to make the change.
Our summaries of the version control database can be easily replicated on other software development projects since we use only the basic information available from any version control database: time of change, numbers of added, deleted, and unchanged lines, and textual description of the change.
We believe that software change measurement tools should be built directly into the version control system to summarize fundamental patterns of changes in the database.
We see this work as an infrastructure to answer a number of questions related to effort, interval, and quality of the software. It has been used in work on code fault potential [9] and decay [5]. However, we see a number of other important applications. One of the questions we intend to answer is how perfective maintenance reduces future effort in adaptive and corrective activity.
The textual description field proved to be essential to identify the reason for a change, and we suspect that other properties of the change could be identified using the same field. We therefore recommend that a high quality textual abstract should always be provided, especially since we cannot anticipate what questions may be asked in the future.
Although the purpose of a change could be recorded as an additional field there are at least three important reasons why using textual description is preferable:
Listed below are 10 MRs that you have worked on during the last two years. We ask you to please classify them according to whether they were (1) new feature development, (2) software fault or "bug" fix, (3) other. You will also be asked to rate the difficulty of carrying out the MR in terms of effort and time relative to your experience and to record a reason for your answer if one occurs to you.For each MR, please mark one of the types (N = new, B = bug, O = other), and one of the levels of difficulty (E = easy, M = medium, H = hard). You may add a comment at the end if the type is O or if you feel it is necessary.
The second stage survey form began with the following introduction.
Listed below are 30 MRs that you have worked on during the last two years. We ask you to please classify them according to whether they were (1) new feature development, (2) software fault or "bug" fix, (3) the result of the code inspection, (4) code improvement, restructuring, or cleanup, (5) other. You will also be asked to rate the difficulty of carrying out the MR in terms of effort and time relative to your experience, and to record a reason for your answer if one occurs to you. For each MR, please mark one of the type options (N = new, B = bug, I = inspection, C = cleanup, O = other), and one of the levels of difficulty (E = easy, M = medium, H = hard), You may add a comment at the end if the type is O or if you feel it is necessary.