Tracking Hidden Duplicate Code

Issam Lahlali

4.33/5 (2 votes)

21 Oct 2014CPOL5 min read

8.4K

Tracking hidden duplicate code

It’s known that the presence of duplicate code has negative impacts on software development and maintenance. Indeed a major drawback is when an instance of duplicate code is changed for fixing bugs or adding new features, its correspondents have to be changed simultaneously.

The most popular reason of duplicate code is the Copy/Paste operations, and in this case the source code is exactly similar in two or more places, this practice is discouraged in many articles, books, and web sites, however sometimes it’s not easy to practice the recommendations, and as usual in the real world, there are many constraints. Here’s a story to show one of the possible constraints:

Some years ago, a friend who works as developer had a bug to resolve, after investigation he found that some lines of code from a function(FA) must be copy/pasted to another function(FB). The first reflex was to make in common this code in a function and call it from the two other ones, technically it was very simple, but the problem was he didn't have the permission to modify the FA function. He has to inform his manager who would contact the FA function maintainer and be sure that no problem would occur after these changes, and later have the permission to change the FA function. The process was time consuming. Finally, he chose the easy solution: the Copy/Paste method.

There are many tools to detect these kind of cloned code, CCFinderX is one of the interesting available open source tools. CCFinderX is a code-clone detector, which detects code clones (duplicated code fragments) from source files written in Java, C/C++, COBOL, VB, C#. It enables a user-side customization of a preprocessor, and provides an interactive analysis based on metrics.

Using the appropriate tool makes easy the detection of the duplicate code from the copy/paste operations, however there are some cases where cloned code is not trivial to detect.

Hidden Duplicate Code

Case1: Modified Copy/Pasted Code

As described before, the major problem of a copy/pasted code is when an instance of duplicate code is changed, its correspondents have to be changed simultaneously. Unfortunately, it’s not always the case and the duplicate code instances became different.

To avoid these kind of hidden duplicate code, don’t hesitate to use a tool like CFinderX to discover the duplicate code instances, and at least tag them by adding comments if you don’t have time to refactor your code. This operation is very useful when a developer tries to change a duplicate code instance, he will notice that other places have the same code. However, if the developer is not informed, he will change it in only one place, and it will be very difficult in the future to detect the modified duplicate code.

Case 2: Similar Functionality

The copy/paste operations are not the only origin of duplicate code, another reason is when a similar functionality is implemented.

Here’s from wikipedia a brief description of this second duplicate code origin:

Functionality that is very similar to that in another part of a program is required and a developer independently writes code that is very similar to what exists elsewhere. Studies suggest that such independently rewritten code is typically not syntactically similar.

Tracking Hidden Duplicate Code

In case of duplicate code not exactly the same, no tool could give you a reliable results, it could report only suspicious duplicate code, and it’s the responsibility of developers to check if it really concerns a cloned code or just a false positive result.

Each tool uses a specific algorithm to track these kind of duplicate code, we didn't test any of these tools but I think that most of them could be interesting to check at least once, it could give you some interesting results that could help you to improve the design and implementation of your code, as we will discover later in this post.

In our case, we will talk about an algorithm introduced by NDepend tool. It consists in defining sets of methods that are using the same members, i.e., calling the same methods, reading the same fields, writing the same fields. We call these sets, suspect-sets. Suspect-sets are sorted by the number of same members used.

CppDepend also implements this algorithm as a CppDepend Power-Tool. Power-Tools are a set of open-source tools based on CppDpend.API. The source code of Power-Tools can be found in $CppDependInstallPath$\ CppDepend.PowerTools.SourceCode\ CppDepend.PowerTools.sln.

Let’s discover the efficiency of this algorithm by searching the duplicate code in the Irrlicht 3D engine code base using the CppDepend PowerTools.

Case Study: Irrlicht 3D Engine

The Irrlicht Engine is an open-source high-performance realtime 3D engine written in C++. It is completely cross-platform, using D3D, OpenGL and its own software renderers.

And here are two of the suspicious duplicate code detected:

1. Exact Code Duplicate

In this case, 18 methods detected are using the same 3 methods, reading the same 2 fields and writing same 9 fields.

After checking the source code of these methods, it concerns the exact code duplicated, however other tools are more interesting to detect these kind of duplicates, and the algorithm has no added value when it concerns the exact code cloned.

2. Similar Functionaity

Here’s a second suspicious duplicate code, it concerns four methods using the same 11 methods, reading the same 6 fields and writing the same 2 fields.

After checking the source code if these four methods, it’s not exactly the same code, however they implement a unique layout algorithm. So here I’d vote for a factorization.

To better explain this case, here’s a relation between the classes concerned by the duplicate code:

OnSetConstants is declared in the IShaderConstantSetCallBack interface and implemented by all the derived classes. All the four implementations have the same layout algorithm and in such cases, the template method pattern is a good soultion to refactor the existing implementation.

When testing this algorithm in many C++ open source projects, we were very surprised that a lot of duplicate code is similar to this case, and the template method pattern is rarely used.

Conclusion

Tracking duplicate code is very useful to improve both the implementation and the design of your projects. Fortunately, many tools exist to detect the cloned code, and it’s recommended to execute periodically one of these tools and at least tag the duplicate instances.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)