Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Using NRefactory for analyzing C# code

0.00/5 (No votes)
11 Aug 2012 1  
Use NRefactory to write a standalone application that refactors your C# code.

Introduction  

NRefactory is the C# analysis library used in the SharpDevelop and MonoDevelop IDEs. It allows applications to easily analyze both syntax and semantics of C# programs. It is quite similar to Microsoft's Roslyn project; except that it is not a full compiler NRefactory only analyzes C# code, it does not generate IL code.  

This article describes NRefactory 5, a complete rewrite that was started in July 2010. The goals of the rewrite were to have a more accurate syntax tree (including exact token positions), and to integrate the IDE-specific semantic analysis features and refactorings into the NRefactory library. This allows us to share them between SharpDevelop and MonoDevelop. 

NRefactory 5.0 has recently shipped with MonoDevelop 3.0. NRefactory is also used in the ILSpy decompiler, in Unity's IL-to-Flash converter, and as a front-end in the C#-to-JavaScript compiler Saltarelle.

Overview 

NRefactory offers APIs for accessing syntax trees, semantic information, and the type system.


To understand the different stages, consider this simple piece of code:  

obj.DoStuff(0) 

As C# code, this is simply a string. Parsing the string into a syntax tree tells us that it is an invocation expression, which has a member reference as target. Using the NRefactory API, this syntax tree could be built manually like this:

new InvocationExpression {
    Target = new MemberReferenceExpression {
        Target = new IdentifierExpression("obj"),
        MemberName = "DoStuff"
    },
    Arguments = {
        new PrimitiveExpression(0)
    }
}

Note that the syntax tree doesn't tell us at all what obj or DoStuff is DoStuff most likely is an instance method, and obj is a local variable, parameter or a field of the current class. Or obj could be a class name and DoStuff a static field containing a delegate. The NRefactory resolver can tell which of those it is. The output of the resolver in this case is a semantic tree similar to this:

new InvocationResolveResult {
    TargetResult = new LocalResolveResult('TestClass obj;'),
    Member = 'public void TestClass.DoStuff(int)',
    Arguments = {
        new ConstantResolveResult('System.Int32', 0)
    }
}

This is pseudo-code; the strings in single quotes here are not strings, but references into the type system.

Syntax Tree 

To parse C# code, use the ICSharpCode.NRefactory.CSharp.CSharpParser class:

CSharpParser parser = new CSharpParser();
SyntaxTree syntaxTree = parser.Parse(programCode);

Syntax errors detected while parsing can be retrieved using the parser.Errors collection.

The syntax tree can be visualized using the NRefactory.Demo application: 

The base class of the syntax tree is the AstNode every item in the demo application's tree view is one AstNode. (the term "AST" stands for Abstract Syntax Tree)

Every node has a list of children, and each child has a role. A child node's role describes the relation between the parent node and the child - it explains where in the parent the node appears.

The titles describing the nodes in the demo's tree view follow the pattern "node.Role: node.GetType().Name". In the screenshot, you can see that the selected IndexerExpression "args[0]" has the child identifier "args" in the Target role, and the child "0" in the Argument role. For multidimensional arrays, there would be multiple children in the Argument role, separated by comma tokens.

The tokens themselves are AstNodes as well. For example, the opening bracket is a CSharpTokenNode in the Roles.LBracket role. This flexible AST structure allows us to add comments at the correct positions - for example, parsing the code "args[/*i*/0]" would result in an IndexerExpression that has an additional Comment node between the "[" and the Argument.

However, this flexible syntax tree is rather inconvenient - filtering the children by role gets rather verbose if you need to do it all over the place. Also, it is not always clear which roles can appear in a given construct. For this reason, the syntax tree API contains additional helper properties. The following three lines are all equivalent: 

var arguments = indexer.Children.Where(c => c.Role == Roles.Argument).Cast<Expression>();
var arguments = indexer.GetChildrenByRole(Roles.Argument);
var arguments = indexer.Arguments;

The convenience properties also have an additional benefit: they never return null. If the indexer's target expression is missing (which might happen in incomplete code that produces parse errors), the null node Expression.Null will be returned instead. This is the null object pattern. To test if a node is a null node, check the IsNull property.

Traversing The Syntax Tree

If you want to find some construct in the syntax tree, you will need to traverse it, e.g. to find all constructs of a specific type. This can be done easily using the AstNode.Descendants property, e.g. syntaxTree.Descendants.OfType<InvocationExpression>().

However, for more complex operations it is usually better to use the visitor pattern:

syntaxTree.AcceptVisitor(new FindInvocationsVisitor());
 
class FindInvocationsVisitor : DepthFirstAstVisitor
{
    public override void VisitInvocationExpression(InvocationExpression invocationExpression)
    {
        if (LooksLikeIndexOfCall(invocationExpression)) {
        	...
        }
        // Call the base method to traverse into nested invocations
        base.VisitInvocationExpression(invocationExpression);
    }
}

There is also a generic version of the DepthFirstAstVisitor available which allows returning a value from the visit method. This can be useful for implementing a more complex analysis of the source code. 

Identifying Code Patterns

When analyzing C# code to look for certain issues, it is often necessary to identify if a given piece of code matches a syntactic pattern.  

For example, consider the pattern "X a = new X(...);". A refactoring engine might propose to replace X with var. However, identifying such a construct using the syntax tree API can get very tedious. In our example, we need to check that:

  • The variable declaration statement declares only a single variable. 
  • The variable is initialized with a 'new' expression.
  • The new expression uses the same type as the variable declaration. 

In code: 

bool CanBeSimplified(VariableDeclarationStatement varDecl)
{
    if (varDecl.Variables.Count != 1)
        return false;
    VariableInitializer v = varDecl.Variables.Single();
    ObjectCreateExpression oce = v.Initializer as ObjectCreateExpression;
    if (oce == null)
        return false;
    // It is not clear yet how to compare two AST nodes for equality
    // Equals() would just use reference equality
    return ?AreEqualTypes?(varDecl.Type, oce.Type);
}

While not too terrible in this case, such imperative condition-testing code quickly gets unreadable when checking more complex constructs. Fortunately, NRefactory provides a declarative alternative: pattern matching

A pattern is a syntax tree that contains special pattern nodes. Patterns work similar to regular expressions in .NET, except that they are dealing with syntax nodes instead of characters.

In our example, we can use this pattern: 

 var pattern = new VariableDeclarationStatement {
    Type = new AnyNode("type"),
    Variables = {
        new VariableInitializer {
            Name = Pattern.AnyString,
            Initializer = new ObjectCreateExpression {
                Type = new Backreference("type"),
                Arguments = { new Repeat(new AnyNode()) }
            }
        }
    }};    
To use this pattern, call the IsMatch or Match extension methods. The Match method returns an object that can be used to retrieve details of the match, such as the contents of the named capture group:
Match m = pattern.Match(someNode);
if (m.Success) {
    // Replace redundant type name with 'var'
    m.Get<AstType>("type").Single().ReplaceWith(new SimpleType("var"));
}

A pattern doesn't have to contain any special pattern nodes any normal syntax tree can also be used a pattern. Normal nodes will only match other nodes that are syntactically identical (but whitespace and comments are ignored). So pattern matching also answers the question of how to compare the two type nodes: we could have written 'return varDecl.Type.IsMatch(oce.Type);'. 

Pattern are strict - they only match syntactically identical nodes, any variations must be explicitly specified. In fact, our pattern is not equivalent to the imperative code earlier: it will fail to match List<int> x = new List<int> { 0 };, because the pattern does not account for object/collection initializers. To fix the pattern, we can insert 'Initializer = new OptionalNode(new AnyNode())' into the ObjectCreateExpression. 

However, this strictness can also be an advantage. There's a second difference between our pattern and the imperative code: the pattern rejects const int x = new int();. This is a valid constant declaration, but we const var would be invalid C# code rejecting this code was the right thing to do! To fix our original CanBeSimplified method, we would need an explicit test: 

if (varDecl.Modifiers != Modifiers.None)
    return false;

On the other hand, our pattern could be made to accept constant declarations as well using Modifiers = Modifiers.Any in the pattern's initializer. 

This makes pattern matching a great tool when implementing refactorings, as your refactoring won't accidentally touch cases you haven't thought about. In my experience working on ILSpy, it is much easier to write patterns than to write the code that manually checks all the boundary conditions.

Summary

This concludes our discussion of syntax trees. If you want to know more, please download the demo application. It is a great tool for learning how NRefactory represents a given piece of C# code.

Type System 

Before we can talk about semantic analysis of C# code, we will need to provide NRefactory with the necessary information. A C# code file cannot be analyzed on its own; we need to know which classes and methods exist. This depends on other code files in the same project, and on the referenced assemblies and projects. In NRefactory, all this information taken together is called the type system it contains not only types, but also methods/properties/event and fields.

There actually are two type systems in NRefactory: the unresolved type system and the resolved type system.

The unresolved type system is intended as a kind of "staging ground" for building a type system. The unresolved type system essentially contains the same information as the declarations in the syntax tree, but represents it using language-independent interfaces. It does not contain any information about method bodies, and is significantly more memory-efficient than keeping all syntax trees around. 

This diagram shows the architecture of NRefactory with its two type systems:  

Red boxes signify that the API is language-independent; blue boxes signify C#-specific APIs. Currently C# is the only language supported by NRefactory 5, but we plan to add VB support in the future.

Note that the unresolved type system is serializable (using either .NET BinaryFormatter or the FastSerializer included in NRefactory); this is used in SharpDevelop and MonoDevelop to speed up loading solutions (parsing the whole project takes a bit time on my machine, NRefactory parses 70,000 lines per second).  

What's In the Type System  

The type system provides roughly the same information as System.Reflection.  

The root object of the type system is the ICompilation. A compilation consists of a main assembly (the assembly being compiled) and a set of referenced assemblies. Each assembly has a set of type definitions and assembly attributes. Type definitions contain other type definitions and members. This is the hierarchy of entities - class and member definitions.

The other part of the type system are, of course, the types. A type in NRefactory is one of the following: 

  • A type definition class/struct/enum/delegate/interface types. Examples: int, IDisposable, List<T> 
  • An array type. Example: string[]
  • A parameterized type. Example: List<string>
  • A pointer type. Example: int* 
  • A managed reference type. Example: ref int 
  • A special type: dynamic, unknown type, or the type of the null literal
  • A type parameter. Example: T

The IType.Kind property can be used to distinguish between the different kinds of types. The special "unknown type" is used to represent types that could not be resolved (e.g. compiler error due to a missing class). ITypes provided by NRefactory are never null; SpecialType.UnknownType is used as a null-object.  

This table describes the relations between syntax tree classes and the two type systems:

C# Syntax Tree Unresolved Type SystemResolved Type System
AstTypeITypeReferenceIType
TypeDeclarationIUnresolvedTypeDefinitionITypeDefinition
EntityDeclarationIUnresolvedEntityIEntity
FieldDeclarationIUnresolvedFieldIField
PropertyDeclaration / IndexerDeclaration IUnresolvedPropertyIProperty
MethodDeclaration / ConstructorDeclaration / OperatorDeclaration /
Accessor
IUnresolvedMethodIMethod
EventDeclarationIUnresolvedEventIEvent
AttributeIUnresolvedAttributeIAttribute
Expression IConstantValueResolveResult
PrivateImplementationTypeIMemberReferenceIMember
ParameterDeclaration IUnresolvedParameterIParameter
AccessorIUnresolvedMethodIMethod
NamespaceDeclaration UsingScope ResolvedUsingScope
--INamespace
SyntaxTree IUnresolvedFile -
 
IUnresolvedAssemblyIAssembly
IProjectContentICompilation

Most code using NRefactory only needs to deal with the syntax tree and the resolved type system.

Building the Type System  

To build a type system, we need to provide NRefactory with approximately the same information as one would need to start the Microsoft C# compiler on the command-line: all code files in the project, the list of referenced assemblies, and some of the project settings.

NRefactory itself does not contain any code to read that information from .sln or .csproj files. Fortunately, the sample application attached to this article does; and you're free to copy that into your own projects.

The sample application contains a small project model consisting of the classes Solution, CSharpProject and CSharpFile.

For each project, we use MSBuild to open the .csproj file, and figure out the relevant compiler settings. We create one CSharpProjectContent instance for each project - this is the root object of the unresolved type system. Note that this class is immutable, so whenever we add some elements/set a property, we need to use the return value.

For each code file in the project, we parse it using CSharpParser, add then add it to the project content as follows: 

pc = pc.AddOrUpdateFiles(syntaxTree.ToTypeSystem());

The sample application also stores the full syntax tree in the CSharpFile class. This makes our sample application simpler, but can cause significant memory usage when loading a large solution. In the SharpDevelop IDE, we usually only hold the syntax tree of the currently open file in memory. For the other files, we only keep the much smaller type system around. 

As for the referenced assemblies, we use the Microsoft.Build library (part of the .NET framework) to run the MSBuild ResolveAssemblyReferences task. This task determines the full path to the assembly files. This will get us the correct version of the assemblies based on the project's target framework. 

To load those assemblies into NRefactory, we can use the CecilLoader class: 

foreach (string assemblyFile in ResolveAssemblyReferences(p)) {
    IUnresolvedAssembly assembly = new CecilLoader().LoadAssemblyFile(assemblyFile);
    pc = pc.AddAssemblyReferences(assembly);
} 
The final step, after the project has been loaded, is to create the resolved type system:
compilation = pc.CreateCompilation(); 

In the description above, I've ignored project references. The class CSharpProjectContent implements IAssemblyReference, so it is possible to create a project reference by passing a project content to AddAssemblyReferences like we did with the assembly file references. However, in our solution loading logic we don't know if the referenced project is already loaded - it might come later in the list of projects. And since CSharpProjectContent is immutable, using a project content directly would reference that exact version of the project content - this means it is impossible to create cyclic references this way.

To avoid this problem, NRefactory provides the ProjectReference class. Only when a compilation is being created, this will look up the correct version of the referenced project. This indirection allows us to build the unresolved type system in any order, and we can even represent cyclic dependencies. The list of available project contents in the solution can be provided to NRefactory through the DefaultSolutionSnapshot class. Take a look at the sample application code for more details.

Semantic Analysis 

Armed with a type system, we have all the information we need to perform semantic analysis.
To retrieve the semantics for a C# AST node, we can use the class CSharpAstResolver:

CSharpAstResolver resolver = new CSharpAstResolver(compilation, syntaxTree, unresolvedFile);
ResolveResult result = resolver.Resolve(node); 

To create a C# AST resolver, we need to provide the resolved type system (compilation) and the root node of the C# syntax tree. Optionally, we can also provide the IUnresolvedFile that was created from the syntax tree and registered in the type system. If provided, it is used to figure out the mapping between the AST declarations and the type system. Otherwise, the resolver will use the member signatures to figure out this mapping this tends to be slower and might fail if there are errors in the C# program (e.g. if the method signature cannot be determined due to missing types). So it is a good idea to store the IUnresolvedFile when creating the type system, so that we can provide it to the resolver.

After a resolver instance is created, the Resolve() method is used to determine the semantics of any node within the syntax tree. The CSharpAstResolver has an internal cache of already resolved nodes, so if you use the same CSharpAstResolver to resolve both IndexOf() calls in the following example, the type of the variable tmp will be resolved only once. 

var tmp = a.complex().expression;
int a = tmp.IndexOf("a");
int b = tmp.IndexOf("b");

A ResolveResult is a tree of objects that represent the semantics of an expression. You can use the "Resolve" button in NRefactory.Demo to view this tree:

Within the semantic tree, all operations are fully resolved "args" is known to be a parameter, "Length" is known to refer to System.Array.Length, etc.  

The semantic tree may have fewer nodes than the syntax tree for example, parentheses are ignored, and constants get folded where possible. 

On the other hand, the semantic tree may have additional nodes that do not appear in the syntax tree in this example, the ConversionResolveResult is an additional node that represents the implicit conversion from int to double. Such conversions appear in the semantic tree only when resolving the parent node; resolving only the "args.Length" expression would result in the MemberResolveResult

The methods CSharpAstResolver.GetConversion() and GetExpectedType() can be used to retrieve the conversion that is being applied to a given node.

Finally, the CSharpAstResolver also provides methods that return the resolver's state (local variables in scope etc.) at a given AST node. This can be used to create a second CSharpAstResolver that analyzes a small code fragment in the context of a different file. For example, SharpDevelop uses this feature to resolve watch expressions in the debugger in the context of the current instruction pointer. 

Sample: Finding IndexOf() Invocations

The string.IndexOf() method is difficult to use correctly: if you forget to specify a string comparison, it uses a culture-aware comparison. Culture-aware comparisons often work against a programmer's expectation, causing bugs. For example, a programmer might expect that a call to 'text.Substring(text.IndexOf("file://") + 7)' would always return the text after the second '/'. However, IndexOf will also match 'fi' with the ligature 'fi' (a single unicode character), and by adding 7, the code ends up skipping the first character of the file name. There are plenty of other special characters in various languages that can cause similar problems.

Thus, it is a good guideline to always use StringComparison.Ordinal. Ordinal comparisons work like programmers expect, and are the right choice when dealing with URLs, file names and pretty much everything else. 

We will use NRefactory to find all invocations of string.IndexOf() that are missing a StringComparison argument. 

The process is relatively straightforward: after parsing the code and setting up the type system as described above, we resolve all invocation expressions in each code file, and determine which overload of IndexOf() is getting called. 

var astResolver = new CSharpAstResolver(compilation, file.SyntaxTree, file.UnresolvedFile);
foreach (var invocation in file.SyntaxTree.Descendants.OfType<InvocationExpression>()) {
    // Retrieve semantics for the invocation:
    var rr = astResolver.Resolve(invocation) as InvocationResolveResult;
    if (rr == null) {
        // Not an invocation resolve result - could be a UnknownMemberResolveResult instead
        continue;
    }
    if (rr.Member.FullName != "System.String.IndexOf") {
        // Invocation isn't a string.IndexOf call
        continue;
    }
    if (rr.Member.Parameters.First().Type.FullName != "System.String") {
        // Ignore the overload that accepts a char, as that doesn't take a StringComparison.
        // (looking for a char always performs the expected ordinal comparison)
        continue;
    }
    if (rr.Member.Parameters.Last().Type.FullName == "System.StringComparison") {
        // Already using the overload that specifies a StringComparison
        continue;
    }
    Console.WriteLine(invocation.GetRegion() + ": " + invocation.GetText());
    file.IndexOfInvocations.Add(invocation);
}

Modifying C# Code  

We will now extend our sample application to automatically fix the programs by inserting ", StringComparison.Ordinal" in the correct spot. 

There are two separate approaches to code modification with NRefactory. The easiest one is to modify the syntax tree, and then generate code for the whole file from the modified tree. However, this has a major disadvantage: the whole file is reformatted (the syntax tree preserves comments, but not whitespace). Also, if there were syntax errors parsing the file, the parsed syntax tree might be incomplete; so outputting it again could remove portions of the file. 

We will instead use a more sophisticated approach: using the locations stored in the AST, we apply edits to the original source code file. NRefactory already contains a helper class for this purpose: DocumentScript.  

var document = new StringBuilderDocument(file.OriginalText);
var formattingOptions = FormattingOptionsFactory.CreateAllman();
var options = new TextEditorOptions();
using (var script = new DocumentScript(document, formattingOptions, options)) {
    foreach (InvocationExpression expr in file.IndexOfInvocations) {
        var copy = (InvocationExpression)expr.Clone();
        copy.Arguments.Add(new IdentifierExpression("StringComparison").Member("Ordinal"));
        script.Replace(expr, copy);
    }
}
File.WriteAllText(Path.ChangeExtension(file.FileName, ".output.cs"), document.Text); 

This code first loads the text into a document. An IDocument is a sort of StringBuilder (and in the case of StringBuilderDocument, uses an actual StringBuilder as the underlying data structure), but additionally supports mapping between offsets (integer that counts number of characters from the beginning of the file) and TextLocations (line/column pairs).  

The DocumentScript class allows applying AST-based refactorings to the document. The main feature of this class is that it maintains a mapping between positions in the original document, and positions in the current document. For example, inserting a new statement at the beginning of the document causes all other statements to move down one line, but the AST still contains the original locations. In such cases, DocumentScript automatically takes care of finding the new location of the AST node, so that further replacements work as expected.

Note that DocumentScript expects the replaced AST nodes to belong to the old document state (old state = at the time of the DocumentScript constructor call). Inconsistencies can cause the replacement to fail. To avoid bugs due to accidental modification of the old AST (e.g. forgotten Clone() call), you can call file.CompilationUnit.Freeze() - this will cause AST modifications to throw exceptions, making such problems much easier to debug.

Instead of replacing the whole invocation, we could also perform a more targeted text insertion in front of the closing parenthesis: 

int offset = script.GetCurrentOffset(expr.RParToken.StartLocation);
script.InsertText(offset, ", StringComparison.Ordinal");

Generating Code from the Type System

Our AST replacement always inserts StringComparison.Ordinal, assuming that using System; exists. We can do better by using the TypeSystemAstBuilder helper class to let NRefactory generate the type reference for us. This will generate a short type reference where possible; or a fully qualified type reference where necessary.

To make this work, we need to extract the resolver context (available usings, declared variables, etc.) for the location where we want to insert code. 

// Generate a reference to System.StringComparison in this context:
var astBuilder = new TypeSystemAstBuilder(astResolver.GetResolverStateBefore(expr));
IType stringComparison = compilation.FindType(typeof(StringComparison));
AstType stringComparisonAst = astBuilder.ConvertType(stringComparison);
 
// Clone a portion of the AST and modify it
var copy = (InvocationExpression)expr.Clone();
copy.Arguments.Add(stringComparisonAst.Member("Ordinal"));
script.Replace(expr, copy); 

Using the TypeSystemAstBuilder, we can convert any type system entity back into C# AST nodes. This works not only for type references, but also for attributes, method declarations, etc. 

Summary  

I hope this article gave you an idea of how NRefactory works. If you want to know more, try playing with the demo application to learn more about syntax and semantic trees. If you have any questions, feel free to use the message board below.

History   

  • late 2009: first ideas for a major refactoring of SharpDevelop's code completion type system
  • 2010-07-26: NRefactory rewrite started
  • 2011-02-11: ILSpy starts using NRefactory 5 
  • 2012-04-14: MonoDevelop 3.0 released
  • 2012-05-18: NRefactory 5.0.1 released
  • 2012-08-11: NRefactory 5.2 released + this article is published   

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here