Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / MSIL

Code Analysis Tools - A collection of IL code analysis tools

4.96/5 (41 votes)
1 Sep 2006CPOL24 min read 1   1.6K  
This tool analyses the IL of a list of assemblies, looking for types, methods, and fields that are not used by another list of assemblies. This lets you see if you have unused legacy code lying around that should be cleaned up.

Main form with available code analysis tools

Introduction

This article started out as a description of one tool. But I’ve decided to turn it into a collection of code analysis tools. I’ve got several tools already written that I’ll plug in, and I’ve gotten some great feedback from other people as to possible code analysis tools. So I’ll be, bit by bit, expanding this tool collection with new tools as they become written and tested.

I’m not sure why, but code tools, code analysis and data mining have been something that has always intrigued me. I started out writing tools, like Visual Studio add-ins, to automate mundane coding tasks. One day I got ambitious and started trying to write a tool that implements a group of simple code refactorings (see Martin Fowler’s book, Refactoring).

But after looking around a bit, I noticed that there were already several commercial products that did just this, and when Visual Studio 2005 came out, it included seven fairly simple ones. So, writing another tool to do the same thing just seemed redundant. But it got me thinking more about code refactoring.

Basically, code refactoring consists of two broad tasks. Finding places that need to be refactored, and then actually refactoring the code. The second one is relatively easy. There are several (really good) books available that tell you, step by step, how to accomplish this.

But it is this first step that intrigues me. Fowler seemed to coin the term “Code Smells”. Basically, a code smell is an artifact in the code that (might) be an indicator that something could be wrong. Finding these smells, in my opinion, is the hard part, especially if your code base encompasses thousands of classes. These smells just kind of get lost in the noise and clutter, especially after several years of continuous development. They just kind of get hidden. But that doesn't mean they shouldn't be fixed.

So over the past few years, I’ve been fairly engrossed in the idea of analysis tools that sniffed around for these smells, and that is the basis of this collection of tools. This article will give a description of each of these tools, their architecture, and how to use them.

The following is a list of tools that are included in this collection of tools (as well as tools I am working on or considering):

  • Not Used Analysis: searches through your code in order to find classes, methods, and fields that are not used anymore
  • Visibility Analysis: searches for any classes or methods that are more exposed than they need to be
  • Duplicate Code: searches through your code to find duplicate, and near duplicate functions
  • Circular Dependency: searches for any hierarchy of classes that have a circular reference dependency (coming soon)

Microsoft.Cci.dll

This tool depends on a DLL that ships with FxCop. You will have to install FxCop (or just the Microsoft.Cci.dll), and you'll most likely have to reset the assembly reference in this project for Microsoft.Cci.dll based on where you installed it.

So the neat thing about .NET (as everyone knows or should know by now) is that no matter what language you are working in, it all gets compiled down to IL. Nothing new there, right? Correct. So, if I'm going to write a code analysis tool, would it make sense to write one that parses C#? Nope, because then I'd have to write one that parses VB.NET too... or the new Python.Net. That whole route would just be kind of stupid because then I'd have to maintain a parser for each language that I want to support. No fun!

But if I accessed and used the compiled IL for my code analysis, then my tool can work on any .NET application, regardless of the language it is written in. OK, so how do I parse the actual IL of a compiled assembly? There are actually several different ways.

You could do it manually. There is an article here written by Sorin Serban (Parsing the IL of a Method Body) that does just this. But it gets quite complex. You could use the API that comes with Reflector to get at the IL. Microsoft Research has a project called Phoenix that also will give you an API for digging through the IL. There is also the Mono project's Cecil which will load an assembly and provide an object graph that represents the assembly contents. I've used this quite a bit, but it sometimes doesn’t play well with Visual Studio compiled DLLs, and is always in a state of change (a good and a bad thing).

But the API I like the best is called the Common Compiler Infrastructure, and ships with FxCop (Microsoft.Cci.dll). It provides an object hierarchy / graph of the contents of the assembly, very much like Cecil does, but I've never had it throw an error with any assembly that I've ever loaded.

To start playing with CCI, create a project, and add Microsoft.Cci.dll to your references (it's located wherever you have installed FxCop), and add a using statement for Microsoft.Cci. The root class that you'll probably use is the AssemblyNode class, and its static GetAssembly method. This method takes a path to an assembly, and returns an AssemblyNode instance. Among many of the properties exposed by AssemblyNode is one called Types. This is a TypeNodeList which contains every Type within the assembly.

C#
AssemblyNode assembly = AssemblyNode.GetAssembly(someAssemblyPath);
foreach (TypeNode type in assembly.Types)
    CheckType(type);

Now, a TypeNode has a Members property which is a MemberList. This list holds every member of that type; fields, events, methods, properties, sub types, and namespace. A Method class is a sub-class of Member, so as you spin through the type's Members, look for one that is of type Method.

C#
foreach (Member member in type.Members)
{
    Method method = member as Method;
    if (method != null)
    {
         this.ParseMethod(method);
        continue;
    }
}

Now that you have a method, you can look at its parameters, attributes, and… its IL instruction list. This is where it gets cool! Get it? You can programmatically spin through an assembly's IL code and look for specific scenarios. This is pretty much exactly what FxCop does. If you've ever written a custom FxCop rule, then you’ve most likely worked with the Method object's IL instruction list.

C#
InstructionList instList = method.Instructions;
for (int index = 0; index < instList.Length; index++)
{
  this.ParseInstructionForType(instList[index]);
  this.ParseInstructionForMember(instList, index);
}

Each Instruction in the InstructionList has an OpCode property, and a Value property (among many other properties). I'm not going to get into an IL tutorial mode, but the OpCode property is the instruction. It tells the computer what operation to do. The Value property is what the OpCode is going to operate on. For example, if the program is going to load a field onto the stack, the OpCode would be ldfld, and the Value would be a Field instance. Using all this, it makes it fairly easy to figure out what types, methods, and fields are being used by your IL code.

"Not Used" Analysis

Not Used Analysis UI

Several times throughout my career, I have worked on large software projects that contain dozens of projects in a solution, thousands of classes, and tens of thousands of methods. It is always a chore to keep all this code straight, and eventually after a few years of updates and code changes, some classes and methods become deprecated...unused. But they are still left in the code because no one knows about them.

This is generally considered bad because it clutters your code up, exposes more than you intend to, takes longer to compile, and takes up more memory when the DLL is loaded.

Recently, I had to work in an area of code that I wasn’t familiar with. As I was going through the code, I would look to see how and where each class was being used. I started to become frustrated when I found several classes that were no longer used, but were still hanging out. This is one of my pet peeves.

So I got on a role and started searching for more unused code. I found fields, methods, and classes that were no longer used. Yeah! There is nothing more fun than deleting unused code! But the annoying thing about trying to find unused code is that the only way to check that it is truly not used anywhere in your solution is to compile (unless you are using reflection, at which point you can only do a string search), which, if you have 32 projects, can take a while.

So in order to speed up this process, I set out to write a tool that walks through all my code, looking for where each class, method, and field is referenced or used.

"Not Used" Architecture

The architecture for this tool is fairly simple. There are basically two main classes; a cataloger class and an analyzer class. I created an AssemblyCataloger class to rip through a list of assemblies and catalog every type, field, and method (constructors and properties are included in this) defined in the assembly. I then have a UsageAnalyzer class which accepts the AssemblyCataloger instance and a list of assemblies to check the catalog against.

What the analyzer does is spin through each assembly in the list and look for any type, field, or method that is used or called in any way. When I find a type, method, or field that exists in the catalog, I mark that type, method, or field in the catalog as being called. At the end of the analysis, I then spin through the catalog and look for the types, methods, and fields in the catalog that don't have any callers. These are then considered unused.

I originally used the Visitor model that comes with the Microsoft.Cci.dll in order to "visit" every type, field, and method in order to catalog them. The way you do this is create a class that inherits from Microsoft.Cci.StandardVisitor. You then override various VisitXXX methods like VisitField, VisitTypeNode, VisitMethod, etc. When you tell your visitor class to run against an assembly, it would spin thorough the assembly and call each overridden visitor when it finds a member that corresponds with the overridden visitor. For example, when it finds a field, it would call VisitField, and your overridden method would get called. So, when the overridden VisitXXX method is called, I would catalog that item.

This worked great, was really simple, and cut down on the amount of code I thought I'd have to write in order to traverse the entire assembly to get every field, method, and type. The only problem was that it took a long time to catalog an entire assembly, much less 30+ assemblies.

I then wrote another cataloger that manually traversed the assembly, and found that I could catalog an assembly in 30% of the time that the StandardVisitor class took. So obviously, that’s the model I ended up using.

"Not Used" Rules

The analyzer's structure is very similar to the cataloger, in that it spins through an assembly in the same manner. Basically, it loads an assembly that you want to check the catalog against. It looks at every type, method, field, event, attribute, and parameter too see if they use or reference anything in the catalog. This means getting down to the IL level. Yeah!

The following three sections are a list of the rules I came up with to determine what types, fields, and methods are used by an assembly. It actually turned out to be more complex than I thought it would. There are two categories of rules. "Don’t Catalog" rules: those rules that tell you not to even catalog a specific type, method, or field. The other type of rules are called "Mark as Used" rules: those rules that tell you that a type, field, or method in the catalog is used.

Don't Catalog Rules for Types

  • Don’t catalog any type that starts with "<": If you open any assembly up in Reflector, the very first namespace you'll find is called; and it will always contain a type called <Module>, and sometimes another type called <PrivateImplementationDetails>. These are internally used types that get auto generated. For this reason, I filter out (don’t catalog) any type that starts with the character "<". This works well because the CLR doesn’t allow you to manually define types with this character anyway.

Mark as Used Rules for Types

  • Look for an interface a type inherits from in the catalog and remove.
  • Look for a type's base type in the catalog and remove.
  • Look for any attribute that decorates a type in the catalog and remove.
  • Look for a field's type in the catalog and remove.
  • Look for any attributes that decorate a field in the catalog and remove.
  • Look for an event's handler type (delegate) in the catalog and remove.
  • Look for any attribute that decorates an event in the catalog and remove.
  • Look for a property's return type in the catalog and remove.
  • Look for any attributes that decorate a property in the catalog and remove.
  • Look for any attribute that decorates a method in the catalog and remove.
  • Look for the type of a method's arguments in the catalog and remove.
  • Look for any attributes that decorate a method's arguments in the catalog and remove.
  • Look for the return type of a method in the catalog and remove.
  • Look for any attribute that decorates a method's return type in the catalog and remove.
  • The first opcode in any method is a list of local variables that are needed by the method. Look at each local variable's type in the catalog and remove it.
  • If the opcode is one of the following (OpCode.Castclass, OpCode._Catch, OpCode.Newarr, OpCode.Box, OpCode.Initobj, OpCode.Isinst, OpCode.Unbox, OpCode.Unbox_Any, OpCode.Ldtoken), look for the opcode's value in the type catalog and remove.
  • If the opcode is one of the following (OpCode.Ldfld, OpCode.Ldflda, OpCode.Ldsfld, OpCode.Ldsflda, OpCode.Stfld, OpCode.Ldftn, OpCode.Stsfld, OpCode.Ldtoken) and the opcode value is a Field, then look for the field's type in the catalog and remove.
  • If the opcode is one of the following (OpCode.Call, OpCode.Callvirt, OpCode.Calli, OpCode.Newobj, OpCode.Ldvirtftn, OpCode.Ldftn, OpCode.Tail_) and the opcode value is a Method, then look for the method's declaring type in the catalog and remove.

Don't Catalog Rules for Methods

  • Don’t catalog static constructors: if a type has a static field, then the compiler will generate a static constructor to initialize the field. Also, static constructors are explicitly called anyway, they are called the first time any member of a type is called.
  • Don’t catalog default type constructors (constructors with no parameters): every type must have one of these, so don’t catalog them.
  • Don’t catalog instance constructors that have the same parameter list that the type's base type constructor has: If type B inherits from type A, and A has a constructor that takes one string, and B wants to expose a constructor that allows the caller to pass a string to the base, it needs to expose this constructor overload as well. While this is not explicitly needed if it's never actually called, it's considered "good form" to provide it if needed.
  • Don’t catalog Finalize methods: type finalizers are not explicitly called, they are called by the GC, so don’t catalog them.

Mark as Used Rules for Methods

  • If a method is not in a catalog, but the type that it belongs to implements an interface that defines this method, then keep the method in the catalog since it's needed by the interface.
  • If the opcode is one of the following (OpCode.Call, OpCode.Callvirt, OpCode.Calli, OpCode.Newobj, OpCode.Ldvirtftn, OpCode.Ldftn, OpCode.Tail_) and the opcode value is a Method, then look for the method in the method catalog and remove.
  • If the method found in the previous rule was found and removed from the catalog, and is also abstract or virtual, then walk up and down methods declaring type inheritance hierarchy, and see if one of those types declares a method with the same signature. When found, remove that from the method catalog as well.

Don't Catalog Rules for Fields

  • Don’t catalog fields that are constants: this is because the value of the constant field just gets hard-coded into the IL at compile time. There is no way to know if it ever gets used or not.
  • Don’t catalog fields that have the name "value__": I don’t know why the compiler puts these in there, but every enumeration defined has a private field defined by the compiler that is called "value__".

Mark as Used Rules for Fields

  • If the opcode is one of the following (OpCode.Ldfld, OpCode.Ldflda, OpCode.Ldsfld, OpCode.Ldsflda, OpCode.Stfld, OpCode.Ldftn, OpCode.Stsfld, OpCode.Ldtoken) and the opcode value is a Field, then look for the field in the field catalog and remove.

Known False Positives

Methods, types, and fields invoked via reflection: since reflection invocations often are initiated with the string name of the method or field, this type of analysis would not catch any type, method, or field that is called via reflection.

There are also reference chains that could give you a false positive, which I currently don't analyze for. For example, take the following reference chain: class A is only used by class B, and class B is only used by method x() of class C, and method x is not called anywhere. Method C.x would be marked as unused, but classes A and B would not be marked as unused. Now, once you have removed method C.x, then class B would be marked as unused. And then once B is removed, then class A would be marked as unused. But this is a lot of steps to manually perform.

Eventually, I'd like to fix both of these deficiencies.

Visibility Analysis

One of the analysis suggestions that I received recently was to check to see if a type, method, or field's scope is more open or visible than it needs to be. For example, if a method is marked as public, but no other assembly calls it, it could be marked as internal. Along those same lines, if that method is called only by other members defined in the same class, it could be marked as private.

This has several benefits such as reducing the 'API surface', reducing documentation and testing requirements, and reducing possible security vulnerabilities. Not to mention it's just good practice to limit the scope of your members to the lowest required visibility.

Well, this style of analysis is very easy to do considering all the data I gather when I catalog and analyze the assemblies during the Not Used analysis, so I added it in.

Visibility Analysis UI

Note: so far, I only looked for public, private, and internal scopes. I haven't analyzed for protected and protected internal (yet).

Not Used and Visibility Analysis Usage and Results

To begin with either analysis (or both), click the "Pick files to get checked..." link. This will popup a file selection dialog. Pick the assemblies that you want to catalog.

Then, click the "Pick files to check against..." link, and again select one or more assemblies. These are the assemblies that you will use to see if the cataloged items are ever used. Alternately, you can click the ">>" button, which will bring over all the assemblies in your "to catalog" list over to your "to analyze" list.

Next, select the analysis options. You can analyze for any combination of classes, methods, fields, and visibility (in order to analyze for visibility, you have to check at least one of the three from the classes, methods, and fields).

Then, click the "Run Analysis" button to actually run the analysis. Depending on how many assemblies you are cataloging and analyzing, this could take up to a minute. My biggest analysis was with 42 DLLs and EXEs, and it took about 50 seconds (though most of that time was spent populating the ListViews to show the results).

On the lower half of the UI, there are four tabs where the results of the analysis are shown. Each tab shows how many members were cataloged and how many were found to be not used. The ListView below that is a member by member breakdown of each member not used. If you double click on any row, you will get the details about that member. This is useful so you can hunt down the offending member and terminate its existence!

I ran my analysis tool against FxCop, cataloging two of FxCop's internal DLLs, and checking them against all seven of its assemblies; I got the following results:

  • Out of 110 classes cataloged, 2 were not used.
  • Out of 916 methods cataloged, 173 were not used.
  • Out of 244 fields cataloged, 10 were not used.
  • Out of 1270 total members analyzed, 71 were shown to have too much visibility.

I also ran the tool against Reflector, which only has one assembly. These are the results I got:

  • Out of 363 classes cataloged, 15 were not used.
  • Out of 2500 methods cataloged, 1313 were not used.
  • Out of 454 fields cataloged, 0 were not used (very good!).
  • Out of 3317 total members analyzed, 983 were shown to have too much visibility (not so good; but Reflector does expose an add-in API, so that would explain why there are so many public members not being used).

What Does All This Mean?

So if no one calls your method, does that mean you should delete it? It depends. If you are defining a public API for a product and it’s a useful method for your customer, then no, don’t delete. If it’s a private method or field, then yes, it's probably left over from legacy code and not used anymore, so blow it away. Basically, this tool just sniffs out a specific type of smell, but it's up to you to refactor it correctly.

Duplicate Code Analysis

One code smell that is really stinky is duplicate code. But it can be hard to find when, like I’ve said before, you have thousands of classes in your project. When you need a piece of functionality that already exists in one class, to be in another class, it’s just too easy to copy and paste the code. We all know this is bad, and we have all done it at one point or another.

But this again is just added clutter and memory overhead, not to mention a breeding ground for possible bugs. What happens if a function, over the years, is copied to five different classes, and then it's determined that, that function has a bug in it. Well, hopefully, you remember all the other places you copied the code to, so you can go fix it there too. But what about all the places Joey or Stu copied it to that you don’t know about?

That is the motivation behind this tool. Basically it scans your assemblies for methods that are either an exact duplicate or a pretty close near duplicate of other methods. It then (as shown above) reports back all the different clusters of duplicate functions that it found. Then it’s up to you to clean up the code.

Duplicate Code Analysis

In order to determine if one function is a duplicate of another, I spin through each function, and build up a string that consists of each OpCode / OpCode-operand pair in the function appended together. The OpCode operand is the thing that the OpCode is operating upon. For instance, the call OpCode could be operating upon the method System.Console.WriteLine(string), or the ldfld opcode operating on a class field.

Once I have this string built up for a function, I then run it through an MD5 hash to get a 16 byte hash of the function. I check to see if this hash already exists in a dictionary of hashes. If it exists, I’ve found a duplicate and report it back to the UI. If I didn’t, I add it to the dictionary to be checked against other functions. The reason I hash the IL/ IL operands is that the strings get fairly large and if I used the string as my key in the dictionary, I could easily run out of memory if my application was large enough (like I did with when analyzing the entire .NET framework).

There are a few OpCodes that I don’t worry about when building the string of OpCodes: these are the nop OpCode and the _Locals OpCode. The nop OpCode is used in debug builds for syncing break points up with the code in Visual Studio. The _Locals OpCode is the first OpCode of every function that holds information about all the variables that the function declares.

Lots of Duplicates! More filtering

When I first ran this tool against my code, I found a disturbing number of duplicate methods… hundreds of duplicate methods. I mean, I knew I had copied a few methods here and there, once in a while. But not that many! In looking closer at the results, I found two culprits: basic getter / setter properties, and default constructors. Basic getter /setter properties are those properties that do nothing else but return a field, or set a field. Hmmm… this is OK right? Yup. I’m not going worry about those. Default constructors are those constructors that do nothing but just call the constructor on its base class, and are usually inserted into your assemblies by the C# compiler. Again, just noise that can be ignored.

Near Duplicate Code Analysis

This worked pretty good, but I thought to my self… "Self, what about methods that do the same processing steps, but against different types and methods?" These could be indicators of methods that might be candidates for refactoring down to a base class, or a helper method.

For example, what if you had a function that did a series of steps against a FileStream, and another function that did the exact same steps against a MemoryStream. The find duplicates algorithm would not find this because it uses the OpCode operand when calculating the hash-code; namely the methods of FileStream and MemoryStream. But what if you only included the OpCodes in your hash, and not the operands? You might actually find more candidates for refactoring!

Turns out that I was correct. And the existing algorithm made it simple to add this option. Below are two screenshots. The first one is the duplicate code check against the FxCop binaries, and the second one is a near duplicate code check against the FxCop binaries. Notice that it found more candidates for refactoring with the near duplicate analysis.

Duplicate Code Analysis

Duplicate Code Analysis

Another type of duplicate code analysis that I’ve been working on (but not ready for public eyes yet) is one that only looks at smaller groups of IL OpCodes vs. the entire method. This would allow you to find duplicate blocks of code buried in a function. For instance, say you have a method that does a bunch of stuff. Then it opens a file stream, and reads the data out into a byte array or a string. Then the method goes on to do more stuff. A month later you write a new method. This one does some stuff that needs to open and read a file, and then do more stuff.

These two methods might do drastically different things, but have a duplicate block of code in them; namely the code that opens and reads a file. This sounds like another candidate for refactoring! This type of analysis could also look for duplicate blocks of code within the same function as well. Ever seen one of those 500 line functions written by an old school ASP developer? Where there are code blocks after code blocks just copied and pasted over and over again? This type of analysis should be able to find those as well.

Ideas Please?

I am always looking for new ideas for code analysis tools, and writing them is a passion of mine. So if you can think of other rules that I forgot to check in any of these existing tools, or any other type of code analysis, please leave a comment, or send me an email, and I'll try to work it into the tools.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)