Introduction
Tools to detect duplicate code can add up to code quality effectively. In the context of C#; various tools and options are available to achieve this. Visual Studio’s (Ultimate or Premium) Code Clone Detection and Atomiq’s Code Similarity Finder are few notable options. This tip discusses how to build your own code similarity finder using the principle of Levenshtein distance. Between two string
s, it is the measure of single character edits required to change one string
to the other. Lesser the value, more similar are the two string
s and vice versa. The tip also employs Roslyn, the .NET complier platform to parse the target code. The tool presented here builds a “similarity matrix” for a target C# class, which is essentially a square matrix, who’s each row is the Levenshtein distances between the codes of a method and all methods in the class. The diagonal of the matrix will have 0 entries as it would be distance between the same codes of the same method. The following table depicts the similarity matrix:
Code Similarity Matrix
|
Method 1
|
Method 2
|
…
|
Method n
|
Method 1
|
0
|
10
|
|
2
|
Method 2
|
10
|
0
|
|
12
|
…
|
|
|
|
|
Method n
|
2
|
12
|
|
0
|
Here, entries like [1,2]; [2,1], … [1,n];[n,1] essentially have the same values as they are representing the distance between the same pair of string
s. This matrix gives a snapshot of the code similarity in a C# class with the entries of lesser values indicating code similarity or duplicate code. The tool discussed here writes this matrix into an Excel file for convenience.
Background
The aim of this tip is not to discuss the implementation of the Levenshtein distance algorithm, but to show how it can effectively be applied for finding code similarities. A sample implementation from the dotnetperls.com is picked and the code is re-used. It also not the intent of the tip to discuss the performance implications of a certain implementation of the algorithm.
Using the Code
The first requirement of the tool is to provide the ability to parse the code of the target class to find all methods. The MethodsAnalyzer
class provides this infrastructure. The method Parse
of this class takes the string
of the code of the target class and uses the Roslyn’s Syntax tree to get all the methods. It builds two dictionaries: indexMap
and methodsCache
which indexes the method name and method code respectively. These dictionaries are used to build the Code Similarity matrix later.
public void Parse(string code)
{
var tree = CSharpSyntaxTree.ParseText(code);
var syntaxRoot = tree.GetRoot();
var methods = syntaxRoot.DescendantNodes().OfType<MethodDeclarationSyntax>();
var index = 0;
foreach (var method in methods)
{
var methodName = method.Identifier.ToString();
var body = method.Body.ToString();
indexMap[index] = methodName;
methodsCache[index++] = body;
}
}
The Consolidate
method loops through the methodsCache
to build Code Similarity matrix similarityMatrix
. It uses the LevenshteinDistance
class’s Compute
method which takes the string
s of the method bodies of two methods and computes the distance. The Consolidate
method does this for each method versus all methods in the class to build the Code Similarity matrix.
public void Consolidate()
{
similarityMatrix = new int[methodsCache.Count, methodsCache.Count];
foreach (var index in methodsCache.Keys)
{
var referenceCode = methodsCache[index];
var allCodes = methodsCache.Values.ToList();
for (var j = 0; j < allCodes.Count; j++)
{
similarityMatrix[index, j] = LevenshteinDistance.Compute(referenceCode, allCodes[j]);
}
}
}
The Driver
project uses the CodeAnalyzer
project to build the Code Similarity matrix for a target class and then writes it to an Excel file. The sample code is written in the Program
class of the Driver project. The Main
method first calls the GenerateSimilarityMatrix
method which in turn uses the MethodsAnalyzer
’s Parse
and Consolidate
methods to build the Code Similarity matrix. Then the Main
method calls the GenerateReport
method to write the matrix into an Excel file. The details of the GenerateSimilarityMatrix
and GenerateReport
are omitted in the discussion and left in the source code with detailed documentation for the reader.
static void Main(string[] args)
{
var program = new Program();
var methodsAnalyzer = new MethodsAnalyzer();
program.GenerateSimilarityMatrix(args[0], methodsAnalyzer);
program.GenerateReport(args[1], methodsAnalyzer);
Console.WriteLine("... Analysis completed. Press any key to exit");
Console.ReadKey();
}
The source code also includes an example target class ContactsProcessor
for which the analysis is performed. The target class is hosted in a separated project called Target
. Following is a sample out of the tool run:
|
Add |
Remove |
GetAddresses |
SendMails |
SendMail |
Add |
0 |
35 |
177 |
160 |
215 |
Remove |
35 |
0 |
176 |
154 |
211 |
GetAddresses |
177 |
176 |
0 |
69 |
122 |
SendMails |
160 |
154 |
69 |
0 |
144 |
SendMail |
215 |
211 |
122 |
144 |
0 |
The analysis shows that methods Add
; Remove
and GetAddresses
; SendMails
have relatively low values of the computed Levenshtein distance and thus are indicative of having similar code.
Points of Interest
The tool just handles a single class. The future work can be to extend it for a project and then for the whole solution. Also, the computed distances can be scaled relative to the maximum in the matrix, which would provide better readability and quick identification of similarity between methods if the matrix is sufficiently large.
History
- 25th January, 2016: Initial version