(untagged)

An Exercise in Estimating Refactoring Time

Łukasz Bownik

5.00/5 (2 votes)

17 Feb 2021CPOL4 min read

5.3K

How to estimate refactoring time

This tip describes an exercise in finding the formula that is able to estimate the time required to refactor code based on the number of lines of original, pre-refactored code.

Introduction

This tip describes an exercise in finding the formula able to estimate the time required to refactor code based on the number of lines of original, pre-refactored code. The result is very rough and subjective, as very little data has been gathered and shall not be treated as definite.

Studying the Classics

Fred Brooks in his “The Mythical Man-Month" ponders upon how much effort it takes to develop a computer program. He writes that that “numbers, although not for strictly comparable problems, suggest that effort goes as a power of size even when no communication is involved except that of a [single] man with his [own] memories. Results reported from a study done by Nanus and Farr at System Development Corporation […] show an exponent of 1.5; that is:

effort = (constant) * (number of instructions)^1.5 “.

Since effort is roughly equivalent to a time spend by programmer developing the program, the formula for a single developer can be equivalently expressed as:

development time = (constant) * (number of instructions)^1.5.

Development vs Refactoring

The question is a follows: “Is time of refactoring performed by single developer governed by a similar formula?”, and if the answer is “Yes”, then what is the approximate value of “constant”?

The time required to refactor codebase comprising n lines of code needs to be bound by time = constant * n function from the bottom and time = constant * n² function from the top. The linear function implies that a change to a single line of code does not require a change to any other line of code. This is requires each line of code to be totally independent from any other line of code, which excludes custom function calls and use of variables. This never happens in practice. The quadratic function on the other hand implies that a change to single line of code requires changes in all other lines of code. Such situation also never happens in practice. This means that the function needs to take the form of time = constant * n^x where 1 < x < 2. This means that the real function must lie above green and below red line presented on Figure 1, which means that it may match the formula governing development time.

Figure 1. Refactoring time boundaries.

The Three Numbers

Finding the power curve function requires at least three data points. In order to obtain these data, I refactored the three programs (consisting of pure spaghetti code) available to me. The table below characterizes these programs.

Program description	Size in KLOC (calculated with cloc)	Refactored into patterns	Refactoring time in hours
A web frontend written in PHP. Reads data from MySQL database (populated by background service) and presents it to user. Contains action button to delete record from database.	0.3	Data access object Page controller Transforming view Strategy	5
Background service written in Java. Retrieves data from Monit, transforms it and publishes through ZeroMQ.	0.75	Gateway Adapter Controller Inversion of control	28
Background service written in Java. Reads data from SQLite database using received geographic location (geographic data from http://www.naturalearthdata.com) and publishes it through ZeroMQ.	1.8	Gateway Data access object Adapter Controller Inversion of control	88

All refactoring kept the programing language of original program. Refactoring used method described in the book "Working Effectively with Legacy Code" (which I highly recommend) which can be summarized as:

Refactor the legacy code just enough to put it under test harness (isolate concurrency, isolate networking, parametrize configuration and database location)
Capture the existing behavior of code with tests (use code coverage tool for help)
Refactor the code under tests (and tests if code interfaces change)

In order to achieve the first step, I used the following techniques described in the book:

Exposure of public constructors + dependency injection
Interface extraction
Reference encapsulation (accessing fields through protected methods)
Parametrization through constructor
Parametrization through public fields
Sub-classing and overriding

Results

Using the three data points, I use Excel spreadsheet to estimate the formula to calculate refactoring time as:

refactoring time in hours = 40*KLOC^1.3.

Additionally, putting the original code under test harness took consistently about 20% of total refactoring time.

The obtained result is very rough and highly subjective. The main deficiency of the method is that there was not definite stop condition (I stopped refactoring based of subjective feeling that the code is good enough) which makes the recorded refactoring times imprecise. Based on these results, I suspect that the “constant” value could generally be within range of 50 +/-20 and the exponent within range of 1.5 +/- 0.2.

I encourage readers in possession of refactoring data to put them into comments for later incorporation.

History

16^th February, 2021: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)