Measuring Domain Engineering Effects on Software Change Cost

Harvey Siy
Software Engineering Technology Transfer
Lucent Technologies
Bell Laboratories
hpsiy@lucent.com - Audris Mockus
Software Production Research Department
Lucent Technologies
Bell Laboratories
audris@mockus.org

Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Abstract:

Domain Engineering (DE) is an increasingly popular process for efficiently producing software. DE uses detailed knowledge of a particular application domain to define rigorously a family of software products within that domain. We describe a methodology for precise quantitative measurement of DE impact on software change effort. The methodology employs measures of small software changes to determine the effect of DE. We illustrate this approach in a detailed case study of DE in a telecommunications product. In the particular case the change effort was dramatically reduced. The methodology can precisely measure cost savings in change effort and is simple and inexpensive since it relies on information automatically collected by version control systems.

Software engineering productivity is notorious for being difficult to improve [4]. Domain Engineering (DE) is a promising new approach for improving productivity by simplifying coding tasks that are performed over and over again and by eliminating some tasks, such as design [16,5]. DE practitioners believe that it can improve productivity by a factor of between two and ten, although there has been no quantitative empirical support for such claims.

Quantifying the impact of a technology on software development is particularly important in making a case for transferring new technology to the mainstream development process. Rogers [14] cites observability of impact as a key factor in successful technology transfer. Observability usually implies that the impact of the new technology can be measured in some way. Most of the time, the usefulness of a new technology is demonstrated through best subjective judgment. This may not be persuasive enough to convince managers and developers to try the new technology.

In this paper we describe a simple-to-use methodology to measure the effects of one DE project on software change effort. The methodology is based on modeling measures of small software changes or Modification Requests (MRs). By modeling such small changes we can take into account primary factors affecting effort such as individual developer productivity and the purpose of a change. The measures of software changes come from existing change management systems and do not require additional effort to collect.

We apply this methodology to DE project at Lucent Technologies. We show that using DE increased productivity around four times.

Sections 2 and 3 describe DE in general and specifics of the particular case study. Sections 4 and 5 describe a general methodology to estimate the effects of DE and the analysis we performed on one DE project. Finally we conclude with a relevant work section and a summary.

Traditional software engineering deals with the design and development of individual software products. In practice, an organization often develops a set of similar products, called a product line. Traditional methods of design and development don't provide formalisms or methods for taking advantage of these similarities. As a result the developers practice some informal means of reusing designs, code and other artifacts, massaging the reused artifact to fit into new requirements. This can lead to software that is fragile and hard to maintain because the reused components were not meant for reuse.

There are many approaches to implementing systematic reuse, among them Domain Engineering. Domain Engineering approaches the problem by defining and facilitating the development of software product lines (or software families) rather than individual software products. This is accomplished by considering all of the products together as one set, analyzing their characteristics, and building an environment to support their production. In doing so, development of individual products (henceforth called Application Engineering) can be done rapidly at the cost of some up-front investment in analyzing the domain and creating the environment. At Lucent Technologies, Domain Engineering researchers have created a process around Domain Engineering called FAST (Family-oriented Abstraction, Specification and Translation) [6]. FAST is an iterative process of conducting Domain Engineering and Application Engineering, as shown in Figure 1.

**Figure 1:** FAST process. FAST is an iterative process of conducting Domain Engineering and Application Engineering.
$\begin{figure}\epsfig{file=fast.ps,width=3.5in}\protect\end{figure}$

Application Engineering is the process of producing members of the product line using the application engineering environment created during Domain Engineering. Feedback is then sent to the Domain Engineering team, which makes necessary adjustments to the domain analysis and environment.

Lucent Technologies' 5ESSTMswitch is used to connect local and long distance calls involving voice, data and video communications. To maintain subscriber and office information at a particular site, a database is maintained within the switch itself. Users of this database are telecommunications service providers, such as AT&T, which need to keep track of information, such as specific features phone subscribers buy. Access to the database is provided through a set of screen forms. This study focuses on a domain engineering effort conducted to reengineer the software and the process for developing these screen forms.

A set of customized screen forms is created for each service provider who purchases a 5ESS switch corresponding to the 5ESS features purchased by the service provider. When a service provider purchases new features, its screen forms have to be updated. Occasionally, a service provider may request a new database view to be added, resulting in a new screen form. Each of these tasks requires significant development effort.

In the old process, screen forms were customized during compile-time. This often means inserting #ifdef-like compiler directives into existing screen specification files. Forms have had as many as 30 variants. The resulting specification file is hard to maintain and modify because of the high density of compiler directives. In addition, several auxiliary files specifying entities such as screen items need to be updated.

The Asset Implementation Manager (AIM) project is an effort to automate much of this tedious and error-prone process. The FAST process was used to factor out the customer-specific code and create a new environment that uses a data-driven approach to customization. In the new process, screen customization is done at run-time, using a feature tag table showing which features are turned on for a particular service provider. A GUI system was implemented in place of hand-programming the screen specification files. In place of screen specification files, a small specification file using a domain specific language is stored for each individual screen item such as a data entry field. This new system also automatically updates any relevant auxiliary files.

We undertook to evaluate the impact of the AIM project on software change effort. Collecting effort data has traditionally been very problematic. Often, it is stored in financial databases that track project budget allocations. These data do not always accurately reflect the effort spent for several reasons. Budget allocations are based on estimates of the amount of work to be done. Some projects exceed their budgets while others underuse them. Rather than changing budget allocations, the tendency of management is to charge work on projects that have exceeded budgets to those that haven't.

We chose not to consider financial data, but rather to infer the key effort drivers based on the number and type of source code changes each developer makes. This has the advantage of being finer grain than project-level effort data. Analyzing change-level effort could yield trends that would be washed out at the project level because of aggregation. In addition, our approach requires no data to be collected from developers. We use existing information from the change management system such as change size, the developer who made the change, time of the change, and purpose of the change, to infer the effort to make a change.

The 5ESS source code is organized into subsystems with each subsystem further subdivided into a set of modules. Each module contains a number of source code files. The change history of the files are maintained using the Extended Change Management System (ECMS) [9], for initiating and tracking changes, and the Source Code Control System (SCCS) [13], for managing different versions of the files.

Each logically distinct change request is recorded as a Modification Request (MR) by the ECMS. Each MR is owned by a developer, who makes changes to the necessary files to implement the MR. The lines in each file that were added, deleted and changed are recorded as one or more ``deltas'' in SCCS. While it is possible to implement all MR changes restricted to one file by a single delta, but in practice developers often perform multiple delta on a singe file, especially for larger changes. For each delta, the time of change, the login of the developer who made the change, the number of lines added and deleted, the associated MR, and several other pieces of information are all recorded in the ECMS database. This delta information is then aggregated for each MR. A more detailed description of how to construct change measures is provided in [10].

We inferred the MR's purpose from the textual description that the developer recorded for the the MR [12]. In addition to the three primary reasons for changes (repairing faults, adding new functionality, and improving the structure of the code; [15]), we used a class for changes that implement code inspection suggestions, since this class was easy to separate from others and had distinct size and interval properties.

We also obtained a complete list of identifiers of MRs that were done using AIM technology, by taking advantage of the way AIM was implemented. In the 5ESS source, a special directory path was created to store all the new screen specification files created by AIM. We refer to that path as the AIM path. The source code to the previously used screen specification files also had a specific set of directory paths. We refer to those paths as pre-AIM paths. Based on those sets of paths we classified all MRs into three classes:

To understand the effects of AIM better, we need to know the effort associated with each change. Since the direct measurement of such effort would be impractical, we developed an algorithm to estimate it from the widely available change management data. We briefly describe the algorithm here. For more detail see [8]. Let us consider one developer. From the change data, we obtained the start and ending month of each of the developer's MRs. This helps us partially to create a table as in Table 1 (a typical MR takes a few days to complete, but to simplify the illustration we show only three MRs over five months).

**Table 1:** An example table of effort-per-MR-per-month breakdown for one developer. The ?'s represent blank cells, whose values are initially unknown.
	Jan	Feb	Mar	Apr	May	$\cdots$	Total
${\mbox{MR}}_1$	0	?	?	0	0	$\cdots$	?
${\mbox{MR}}_2$	?	?	0	0	0	$\cdots$	?
${\mbox{MR}}_3$	0	0	?	?	?	$\cdots$	?
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$		$\vdots$
Total	1	1	1	1	1	$\cdots$	12

We assume that monthly developer effort to be one technical headcount month so we have 1's as the column sums in the bottom row. The only exception are months with no MR activity which have zero effort. We then proceed to iteratively fit values into the blank cells, initially dividing the column sums evenly over all blank cells. The row sums are then calculated from these initial cell values. A regression model of effort is fitted on the row sums. The cell values in each row are rescaled to sum up to the fitted values of the regression model. Then the cell values in each column are rescaled to sum up to the column sums. A new set of row sums is calculated from these adjusted cell values and the process of model fitting and cell value adjustments is repeated. This is done until convergence occurs. The code to perform the analysis is published in [11].

We outline here a general framework for analyzing domain engineering projects and describe how we applied this to the AIM project. The analysis framework consists of five main steps.

The basic characteristic measures of software changes include: identity of the person performing the change; the files, modules, and actual lines of code involved in the change; when the change was made; the size of the change measured both by the number of lines added or changed and the number of deltas; the complexity of the change measured by the number of files touched; and the purpose of the change including whether the purpose of the change was to fix a bug or to add new functionality. Many change management systems record data from which such measures can be collected using software described in [10].

In real life situations developers work on several projects over the course of a year and it is important to identify which changes they using the environment provided by DE. There are several ways to identify these changes. In our example the domain engineered features were implemented in a new set of code modules. In other examples we know of a new domain-specific language is introduced. In such a case the DE changes may be identified by looking at the language used in the changed file. In yet another DE example, a new library was created to facilitate code reuse. To identify DE changes in such a case we need to look at the function calls used in the modified code to determine if those calls involve the API of the new library.

Finally we need to identify changes that were done prior to DE. In our example the entire subsystem contained pre-DE code. Since the older type of development co-existed in parallel with the post-DE development, we could not use time as the factor to determine if the change was post-DE. In practice it might happen often that initially only some of the projects use a new methodology until it has been proven to work.

Inspection of change measures

To understand the differences between DE and pre-DE changes that might affect the effort models, we first inspect the change measures to compare interval, complexity, and size between DE and pre-DE changes. The part of the product the AIM project was targeting has a change history starting from 1986. We did not consider very old MRs (before 1993) in our analysis, to avoid other possible changes to the process or to the tools that might have happened more than six years ago.

The following table gives average measures for AIM and pre-AIM MRs. Most measures are log-transformed to make use of a t-test more appropriate. Although differences appear small, they are all significant at 0.05 (using two-sample t-test) because of the very large sample size (19450 pre-AIM MRs and 1677 AIM MRs).

**Table 2:** Measures of Changes
measure	units	pre-AIM	AIM
interval	$\log(\mbox{days})$	$\log(1.7)$	$\log(1.5)$
complexity	$\log(\mbox{\char93 files})$	$\log(1.5)$	$\log(2)$
size	$\log(\mbox{\char93 delta})$	$\log(2)$	$\log(2.2)$
size	$\log(\mbox{\char93 lines})$	$\log(13)$	$\log(10)$

As shown in Table 2, AIM MRs do not take any longer to complete than pre-AIM MRs. The change complexity, as measured in number of files touched by the MR, appears to have increased, but we note that instead of modifying specification files for screens, the developers are modifying smaller specification files for individual screen attributes. In addition, all changes are done through the GUI, which handles the updating of individual files, hence the changes. The table also shows that more deltas are needed to implement an MR. The increased number of deltas might be a result of the MRs touching more files. The numbers of lines are an artifact of the language used and since the language in the new system is different, they are not directly comparable here. The differences probably due to the fact that AIM measures are measuring the output of a tool, where in the pre-AIM measures are measuring the manual output. We use the number of delta as the MR size measure in the final model partly because the differences among groups are the smallest in this dimension. It is possible to adjust the size measure for the AIM (or pre-AIM) MRs when fitting the model, however we have not done so in the subsequent analysis.

First among variables that we recommend always including in the model is a developer effect. Other studies have found substantial variation in developer productivity [7]. Even when we have not found significant differences and even though we do not believe that estimated developer coefficients constitute a reliable method of rating developers, we have left developer coefficients in the model. The interpretation of estimated developer effects is problematic. Not only could differences appear because of differing developer abilities, but the seemingly less productive developer could have more extensive duties outside of writing code.

Naturally, the size and complexity of a change has a strong effect on the effort required to implement it. We have chosen the number of lines added, the number of files touched, and the number of deltas that were part of the MR as measures of the size and complexity of an MR.

We found that the purpose of the change (as estimated using the techniques of [12]) also has a strong effect on the effort required to make a change. In most of our studies, changes that fix bugs are more difficult than comparably sized additions of new code. The difficulty of changes classified as ``perfective'' varies across different parts of the code, while implementing suggestions from code inspections is easy.

Eliminating collinearity with predictors that might affect effort Since developer identity is the largest source of variability in software development (see, for example, [2,7]), we first select a subset of developers that had a substantial number of MRs both in AIM and elsewhere so that the results would not be biased by the developer effects. If we had some developers that made only AIM (or only pre-AIM) changes, the model could not distinguish between their productivity and the AIM effect.

To select developers of similar experience and ability, we chose developers that had completed between 150 and 1000 MRs on the considered product in their careers. We chose only developers who completed at least 15 AIM MRs. The resulting subset contained ten developers. Itemization of their MRs is given in Table 3.

Table 3: Breakdown of MRs.

Type of MR	Dev1	Dev2	Dev3	Dev4	Dev5	Dev6	Dev7	Dev8	Dev9	Dev10
other	109	118	92	121	312	174	152	408	93	197
pre-AIM	82	56	32	46	209	70	55	351	48	14
AIM	16	27	30	30	21	20	23	81	12	91

Models and Interpretation In the fourth step we are ready to fit the set of candidate models and interpret the results. Our candidates models are derived from the full model (specified next) by omitting sets of predictors. The full model that included measures that we expect to affect the change effort was:

$\displaystyle E(\mbox{effort})$	=	$\displaystyle \mbox{\char93 delta}^{\alpha_1} \times \mbox{\char93 files}^{\alpha_2} \times \mbox{\char93 lines}^{\alpha_3}$
	$\textstyle \times$	$\displaystyle \beta_{\mbox{BugFix}} \times \beta_{\mbox{Perfective}} \times \beta_{\mbox{Inspection}}$
	$\textstyle \times$	$\displaystyle \gamma_{\mbox{AIM}} \times \gamma_{\mbox{PreAIM}}$
	$\textstyle \times$	$\displaystyle \prod_i \delta_{\mbox{Developer}_i}$

**Table 4:** Full Model
	estimate	p-val	95% CI
$\alpha_1$	0.69	0.000	[0.43,.95]
$\alpha_2$	-0.04	0.75	[-0.3,0.22]
$\alpha_3$	-0.10	0.09	[-0.21,.01]
$\beta_{\mbox{BugFix}}$	1.9	0.003	[1.3,2.8]
$\beta_{\mbox{Perfective}}$	0.7	0.57	[0.2,2.4]
$\beta_{\mbox{Inspection}}$	0.6	0.13	[0.34,1.2]
$\gamma_{\mbox{AIM}}$	0.25	0.000	[0.16,0.37]
$\gamma_{\mbox{PreAIM}}$	1.03	0.85	[0.7,1.5]

The following MR measures were used as predictors: $\mbox{\char93 delta}$ -- the number of deltas, $\mbox{\char93 lines}$ -- the number of lines added, $\mbox{\char93 files}$ -- the number of files touched, indicator if the change is a bug fix, a perfective change, an inspection change, indicator if it was AIM or pre-AIM MR, and an indicator for each developer. Only the coefficients for number of deltas, indicator of the bug fix, and indicator of AIM were significant at 0.05 level.

The coefficients reflect how many times a particular factor affects the change effort in comparison with the base factors $\beta_{\mbox{New}}$ and $\gamma_{\mbox{other}}$ which are assumed to be 1 for reference purposes.

Inspecting the table, we see that the only measure of size that was important in predicting change effort was $\mbox{\char93 delta}$ (p-value of $\alpha_1$ is 0.000). Although we refer to it as a size measure it measures both size and complexity. For example, it is impossible to change two files with a single delta, so the number of delta has to be no less than the number of files. It also measures the size of the change. A large number of new lines is rarely incorporated without preliminary testing which would lead to additional deltas.

From the type measures we see that the bug fixes are almost twice as hard as new code ( $\beta_{\mbox{BugFix}} = 1.9$ ). Other types of MRs (perfective and inspection) are not significantly less difficult than new code changes (p-values of $\beta_{\mbox{Perfective}}$ and $\beta_{\mbox{Inspection}}$ are 0.57 and 0.13, respectively). This result is consistent with past work [8].

Finally, the model indicates that pre-AIM MRs are indistinguishable from other MRs (p-value of $\gamma_{\mbox{PreAIM}}$ is 0.85). This result was expected since there was no reason why pre-AIM MRs should be different from other MRs. More importantly, AIM MRs are significantly easier than other MRs ( $\gamma_{\mbox{AIM}} = 0.25$ , with p-value of 0.000). Consequently, AIM reduced effort per change by a factor of four.

Next we report the results of a reduced model containing only predictors found significant in the full model. The simple model was:

$\displaystyle E(\mbox{effort})$	=	$\displaystyle \char93 delta^\alpha \times \beta_{\mbox{BugFix}} \times \gamma_{\mbox{AIM}}$
		$\displaystyle \times \prod_i \delta_{\mbox{Developer}_i},$

Those results are almost identical to the results from the full model indicating the robustness of the particular set of predictors.

The models above provide us with the amount of effort spent at the fine change level. To estimate the effectiveness of the DE we must integrate those effort savings over all changes and convert effort savings to cost savings. Also, we need an assessment of the cost involved in creating a new language, training developers and other overhead associated with the implementation of the AIM project.

To obtain the cost savings first the total effort spent doing DE MRs is estimated. Then we use the cost saving coefficient from the fitted model to predict the hypothetical total effort for the same set of features as if the domain engineering has not taken place. ¹

The effort savings would then be the difference between the latter and the former. Finally, the effort savings are converted to cost and compared with additional expenses incurred by DE. The calculations that follow are intended to provide approximate bounds for the cost savings based on the assumption that the same functionality would have been implemented without the AIM.

Although we know that AIM MRs are four times easier, we need to ascertain if they implement functionality comparable to that of pre-AIM MRs. In the particular product, new functionality was implemented as software features. The features were the best definition of functionality available in the considered product. We assume that on average, all features implement similar amount of functionality. This is a reasonable assumption since we have a large number of features under both conditions and we do not have any reason to believe that the definition of a feature changed over the considered period. Consequently, even a substantial variation of functionality among features should not bias the results.

We determined the software features for each MR (MR implementing new functionality) using an in-house database. We had 1677 AIM MRs involved in implementation of 156 distinct software features and 21127 pre-AIM MRs involved in implementation of 1195 software features, giving 11 and 17 as the MR per feature ratio. Based on this analysis AIM MRs appear to implement 60 percent more functionality per MR than pre-AIM MRs.

Consequently the functionality in 1677 AIM MRs would approximately equal the functionality implemented by 2650 pre-AIM MRs. The effort spent on 1677 AIM MRs would approximately equal the effort spent on 420 hypothetical pre-AIM MRs if we use the estimated 75% savings in change cost obtained from the models above. This leaves total cost savings expressed as 2230 pre-AIM MRs.

To convert the effort savings from pre-AIM MRs to technical headcount years (THCY) we obtained the average productivity of all developers in terms of MRs per THCY. To obtain this measure we took a sample of relevant developers, i.e., those who performed AIM MRs. Then we obtained the total number of MRs each developer started in the period between January 1993 and June 1998. No AIM MRs were started in this period. To obtain the number of MRs per THCY, the total number of MRs obtained for each developer was divided by the interval (expressed in years) that developers worked on the product. This interval was approximated by the interval between the first and the last delta each developer did in the period between January 1993 and June 1998. The average of the resulting ratios was 36.5 pre-AIM MRs per THCY.

Using MR per THCY ratio the effort savings expressed as 2230 pre-AIM MRs are equal to approximately 61 technical headcount years. Hence the total savings in change effort would be between $6,000,000 and $9,000,000 in 1999 US dollars. This assumes technical headcount costs varying between $100K and $150K per year in the Information Technology industry.

To obtain expenses associated with the AIM domain engineering effort we used internal company memoranda summarizing expenses and benefits of the AIM project. The total expenses related to the AIM project were estimated to be 21 THCY. This shows that the first nine months of applying AIM saved around three times (61/21) more effort than was spent on implementing AIM itself.

We also compared our results with internal memoranda predicting effort savings predictions in AIM and several other DE projects performed on the 5ESSTMsoftware. Our results were in line with the approximate predictions given in the internal documents indicating that DE interval reduction to one third to one fourth of pre-DE levels.

This work is an application of the change effort estimation technique originally developed by Graves and Mockus [8]. This technique has been refined further in their more recent work [11]. The method was applied to evaluate the impact of a version editing tool in [1]. In this paper we focus on a more general problem of domain engineering impact, where the software changes before and after the impact often involve completely different languages and different programming environments.

This technique is very different in approach and purpose from traditional cost estimation techniques (such as COCOMO and Delphi [3]), which make use of algorithmic or experiential models to estimate project effort for purposes of estimating budget and staffing requirements. Our approach is to estimate effort after actual development work has been done, using data primarily from change management systems. We are able to estimate actual effort spent on a project, at least for those phases of development that leave records on the change management system. This is useful for calibrating traditional cost models for future project estimation. In addition, our approach is well-suited for quantifying the impact of introducing technology and process changes to existing development processes.

We present methodology to obtain cost savings from Domain Engineering exemplified by a case study of one project. We find that the change effort is reduced three to four times in the considered example.

The methodology is based on measures of software changes and is easily applicable to other software projects. We described in detail all steps of the methodology so that anyone interested can try it. The key steps are,

We expect that this methodology will lead to more widespread quantitative assessment of Domain Engineering and other software productivity improvement techniques. We believe that most software practitioners will save substantial effort of trials and usage of ineffective technology, once they have the ability to screen new technologies based on a quantitative evaluation of their use on other projects. Tool developers and other proponents of new (and existing) technology should be responsible to perform such quantitative evaluation. It will ultimately benefit software practitioners who will be able to evaluate appropriate productivity improvement techniques based on quantitative information.

Acknowledgements

We would like to thank Mark Ardis, Todd Graves, and David Weiss for their valuable comments on earlier drafts of this paper. We also thank Nelson Arnold, Xinchu Huang and Doug Stoneman for their patience in explaining the AIM project.

	estimate	p-val	95% CI
$\alpha$	0.51	0.000	[0.34,0.67]
$\beta_{\mbox{BugFix}}$	2.1	0.002	[1.3,3.2]
$\gamma_{\mbox{AIM}}$	0.27	0.000	[0.17,0.43]

Measuring Domain Engineering Effects on Software Change Cost

Abstract:

Inspection of change measures

Acknowledgements

Bibliography

Footnotes