Introduction
In this article, I will elaborate how the use of code generators and .NET reflection can heavily reduce code maintenance for industries that operate with a lot of data and data testing. If you are working as a developer in an insurance, finance or bank company, you sure know about the issues that these companies have concerning code maintenance. Probably there are a lot of programs in your company based on a single, but huge data format that are being changed every once in a while. When that happens, you know you will go through a lot of pain in the following weeks. This article is supposed to help you in developing several strategies in order to fight against this process. In the end, you may have a single code generator that generates code, inserts it into your project and recompiles the project with the click of a button.
Background
At my work, most of my colleagues struggle with the lengthy task of maintaining historically grown code, without even thinking about automating the process in order to be more efficient. In a recent project of mine, I was given the task to compare huge structured files of data with data in Excel-workbooks resulting from Excel-calculations. Said files are around 8000 lines long and contain over 1000 properties to compare with those from the Excel-workbooks.
I thought the process of thinking about solutions to write the code to be as maintenance-free as possible could be interesting for many developers working in industries that handle this much data and that is why I decided to write this article.
The Problems
At first glance, this may seem like a simple task - but is not when one takes a closer look at what actually has to be achieved:
- High performance - we deal with around 5000 dynamically created test cases (stored in files that are 8000 lines long) that again consist of 20-30 services. The services themselves are stored in different files that are again around 8000 lines long.
- Biggest possible editing possibility - the customers want to define the properties to be compared themselves, as the Excel-workbooks and the formulas change almost weekly.
- Biggest possible configuration - for testing purposes, they also want to define which type of file they want to compare to which type of Excel-workbook.
Everyone knows that high performance and biggest possible "xyz" are usually enemies. If you give the user more control, your code has to be more dynamic, i.e., performs worse. We are trying to find a solution that does not hurt both of them too much.
The Big Files with Data
Basically, we need to read and parse them into a C# structure with around 1000 members.
- Firstly, writing such code takes long and eventually a human being will lose track of that huge data monster.
- Secondly, the name of the members may change anytime (it does monthly, at least here) and new members may be introduced anytime.
- Thirdly,
string
comparisons are not performing very well - in fact, they do not even work here because our file has several headings that again contain several sub-members and those members may have duplicate names at global file scope. That means we have to do at least amount-of-headings * amount-of-sub-properties string
comparisons per line, and that for 8000 lines. Nope.
The Solution: Code Generators
The solution to the problem is code generators. In this specific case, they read the file, detect headings and sub-properties and generate several C# structures that represent the entire file. The implementation of those is entirely up to you - I will not elaborate how to do them in this article, but maybe in another, if requested. Just make sure that along with the private
members of the structure also generate C# properties that return the member value. Do this for actual values but also headings! You will soon find out that this is a key part of all the magic that will happen at a later point of the article.
Now that we have a code generator that can generate C# code out of files anytime they change, the maintenance problem is solved. But what about our possibly millions of string
comparisons? I came up with a pretty clever solution to this. When iterating through each property within the file, calculate the Hashcode
of the property string and the Hashcode
of the heading that property belongs to. Once you did this, you can generate a switch
-statement that will replace the string
comparisons with integer comparisons and that speeds up the code significantly. (Note: If you abs()
and mod()
the Heading-Hash
accordingly (in a way they do not overlap), you can achieve the headings to be in range 0
to <num>
: the compiler may then generate an actual function pointer table out of this and you gain another bit of performance.)
The switch
-statement may look as follows:
switch (headingHash)
{
case {generator_heading_hash}:
{
if (propertyHash == {generator_prop_hash})
myStructure.myHeading.[...].{generator_prop_name} = currentValue;
else if (...)
}
}
The Excel-workbooks
Our task is to read values from workbooks and compare them to values from the big files I have mentioned earlier. The interface or connection between those two formats must be chosen wisely: If you fail, you may suffer from huge maintenance problems or even performance problems. Before I was assigned with the task, many departments of my company tried to find a suitable solution. All of the approaches, though, were very unmaintainable and slow since Excel-interopability has its costs and they dealt with data cell-by-cell. The first approach of many of my colleagues looked something like this:
excelInput.setCell(1, 1, bigFileStructure.someHeading.[...].someProperty);
excelInput.setCell(1, 2, bigFileStructure.someHeading.[...].anotherProperty);
if (excelOutput.cell(1, 1).CastTo<double>() == bigFileStructure.[...].property)
The Maintenance Solution: XML Files and .NET Reflection
To actually come up with this solution required me putting on the thinking cap. On the one hand, I had to provide a reasonable performance for the tests. On the other hand, they wanted to manipulate as much as possible in order to control what is being tested and how it is tested. Eventually, I came up with a solution to fulfill both requirements. To begin with, let us take a look at the XML file I have designed. The file is structured like this:
<inputs type="{calculator type}">
<input row="1" column="1" property="{property tree}"/>
</inputs>
The Property Attribute and Where .NET Reflection Shines
As some of you might know, via .NET reflection you can receive pointers to functions, pointers to members and some other fancy things by name. How can we possibly use this knowledge to map an actual property in the code to a lousy string
in an XML file? Imagine that we have a property tree that is structured like this:
<input ... property="Heading1.Property1"/>
and a C# structure looking like that:
static MyData globalData = new MyData();
struct MyData
{
private Heading1 m_heading1;
public Heading1 Heading1 { get { return m_heading1; } }
}
struct Heading1
{
private double m_property1;
public double Property1 { get { return m_property1; } }
}
then, you can finally parse the property tree as follows:
object current = globalData;
string[] entries = tree.split('.');
for (int i = 0; i < entries.Length-1; i++)
{
var nextEntry = entries[i];
var propInfo = current.GetType().GetProperty(nextEntry);
if (propInfo == null)
current = propInfo.GetValue(current);
}
var finalEntry = entries[entries.Length-1];
var finalProp = current.GetType().GetProperty(finalEntry);
var method = (Func<double>) Delegate.CreateDelegate(typeof(Func<double>), current, finalProp.GetMethod);
listOfMethods.Add(method);
With the awesomeness of .NET reflection, we now have a list of function pointers that we can simply use during the test run:
public class RunData
{
public class RunEntry
{
public int Row { get; set; }
public int Column { get; set; }
public Func<double> Method { get; }
public int MinRow { get; set; }
public int MaxRow { get; set; }
public int MinColumn { get; set; }
public int MaxColumn { get; set; }
}
public List<RunEntry> Entries { get; }
}
public void FillExcel(RunData data)
{
foreach (var entry in data.Entries)
{
excelInput.setCell(entry.Column, entry.Row, entry.Method());
}
}
The advantages of that method are obvious - we will never have to recompile the code (unless we decide to update the interface provided by the code generator), therefore the customer does not have to call us when they want to implement a new property to be compared. They simply modify the underlying XML file in order to achieve what they want! Win-win situation here.
The only drawback is that we have to use a static
object in order to pre-parse the data at startup and retrieve the delegates. While that is not a problem in a single-threaded test run, it causes severe issues in multi-threaded test runs. The ugly solution to this is to simply define multiple static
instances of your data class and to create delegates for all of those. Keep in mind that this can significantly reduce startup performance.
The Performance Solution: Range-based Reading and Writing
While our code is perfectly maintenance-free by now, we still read/write cell by cell and that is a lengthy task that almost ate ~8 seconds of runtime (!) per test case per service. The problem is that for every cell we read or write to, an interop call to the Excel-assembly has to be issued. The interoperability overhead eats a lot more than actually manipulating the cell. If we instead write an entire range at once, the interop overhead is reduced to a single interop call, which perfectly suits our needs. But reading and writing range-based is not as easy as it may sound - some of the input values are default values and should not be overwritten during the process of writing to the Excel-workbook. The input sheet also does not require you to consistently fill something in row-by-row - there is blank space in between.
Solution: Reading the Default Values at Startup
How about we pre-fetch all the values at startup and write them into a multidimensional array beforehand, so we can modify the array during the test run and write it back into the Excel-worksheet?
private object[,] m_inputValues;
public void Initialize()
{
var a1ref = app.ConvertFormula($"=R{minrow}C{mincol}:R{maxrow}C{maxcol}",
XlReferenceStyle.xlR1C1, XlReferenceStyle.xlA1).ToString().Substring(1);
m_inputValues = (object[,]) (worksheet.Range[a1ref].Value2);
}
public void FillExcel(RunData data)
{
foreach (var entry in data.Entries)
{
m_inputValues[entry.Column, entry.Row] = entry.Method();
}
worksheet.Range[a1ref].Value2 = m_inputValues;
}
Explanation: At startup, we call Initialize
. That function will read all default values into a private
member variable at once. Blank spaces will simply be null
after the process. We also do not carelessly read an arbitrary range of unspecific size - upon parsing the XML file, we remember the minimum and maximum row/column we have encountered in the entire file. Those values span the range we need from Excel. When we now insert the data into Excel in function FillExcel
, we basically iterate through every RunData
entry whilst modifying the multidimensional array we have acquired at startup. (Note: You simply call a delegate like any other function in C#, using the brackets.)
With this solution, I have scaled down the runtime per testcase and per service to ~150 milliseconds!
Closing Words
Writing code that is maintenance-free and performant at the same time is a hard task, especially if you work in the insurance, finance or bank industry, in which data formats continuously change over time. I hope this article helped you to counteract this process by writing code generators, using .NET reflection in combination with Delegates
and inventing simple but efficient XML formats that allow the customer to maintain the tool's parameters themselves, if ever needed.
If you have any questions, suggestions or criticism, let me know in the comments below. I will try to answer them as soon as possible.
History
- 26th September, 2017: Initial version of the article