(untagged)

Adapting Old Code to New Realities

Vitaliy Shibaev

0.00/5 (No votes)

27 Nov 2010

This article describes useful techniques for transforming old-style C/C++ code to fully managed C# code. These methods were used for porting classic libjpeg and libtiff libraries to .NET Framework.

Introduction
Prerequisites
Transfer process

Removing the unnecessary code
Preprocessor and conditional compilation
switch and goto operators
Time to gather stones
Preprocessor again and multiple inheritance
typedef operator
Pointer arithmetic
Function pointers
Isolation of the "problem code"
Changing compiler
Making it all work

Introduction

In this article, I shall describe one of the methods that can be used to transform C/C++ code into C# code with the least amount of effort. The principles laid out in this article are also suitable for other pairs of languages, though. I want to warn you straight-off that this method is not applicable to porting of any GUI-related code.

What is this useful for? For example, I have used this method to port libtiff, the well-known TIFF library, to C# (and libjpeg too). This allowed me to reuse work of many people contributed to libtiff along with the .NET Framework Class Library in my program. Code examples in my article are taken mainly from libtiff / libjpeg libraries.

1. Prerequisites

What you will need:

Original code that you can build "in one click"
A set of tests, also runnable "in one click"
A version control system
Some basic understanding of refactoring principles

The "one-click" build and test runs requirement is there to speed up the "change - compile - run tests" cycle as much as possible. The more time and effort goes into each such cycle, the fewer times it will be executed. This may lead to massive and complex roll-backs of erroneous changes.

You can use any version control system. I personally use Subversion - you may pick up whatever you're comfortable with. Anything instead of a set of folders on the hard disk will do.

Tests are required to make sure that the code still retains all of its features at any given time. Being safe in the knowledge that no functional changes are introduced into the code is what sets my method apart from the "let's rewrite it from scratch in the new language" approach. Tests are not required to cover a 100% of the code, but it's desirable to have the tests for all the main features of the code. The tests shouldn't be accessing the internals of the code to avoid constant rewriting of them.

Here's what I used to port LibTiff:

A set of images in TIFF format
tiffcp, the command-line utility that converts TIFF images between different compression schemes
A set of batch scripts that use tiffcp for conversion tasks
A set of reference output images
A program that performs binary comparison of output images with the set of reference images

To grasp refactoring concepts, you only need to read one book. Martin Fowler's Refactoring: Improving the Design of Existing Code. Be sure to read it if you still haven't. Any programmer can only gain by knowing refactoring principles. You don't have to read the entire book, first 130 pages from the beginning is enough. This is the first five chapters and the beginning of the sixth, up to the "Inline Method".

It goes without saying, the better you know the languages that are being used in your source and destination code, the easier the transformation will go. Please note that a deep knowledge of the internals of the original code is not required when you begin. It's enough to understand what the original code does, a deeper understanding of how it does it will come in the process.

2. Transfer Process

The essence of the method is that the original code is simplified through a series of simple and small refactorings. You shouldn't attempt to change a large chunk of code and try to optimize it all at once. You should progress in small steps, run tests after every change cycle and make sure to save every successful modification. That is, make a small change - test it. If all is well, save the change in the VCS repository.

Transfer process could be broken down into 3 big stages:

Replacement of everything in the original code that uses language-specific features with something simpler, but functionally equivalent. This frequently leads to slower and not so neat looking code, but let it not concern you at this stage.
Modification of the altered code so that it can be compiled in the new language.
Transferal of the tests and making the functionality of the new code match the code in the source language.

Only after completing these stages, you should look at the speed and the beauty of the code.

The first stage is the most complex. The goal is to refactor C/C++ code into "pure C++" code with syntax that is as close to C# syntax as possible. This stage means getting rid of:

preprocessor directives
goto operators
typedef operators
pointer arithmetic
function pointers
free (non-member) functions

Let us go over these steps in detail.

2.1 Removing the Unnecessary Code

First of all, we should get rid of the unused code. For instance, in the case of libtiff, I removed the files that were not used to build Windows version of the library. Then, I found all the conditional compilation directives ignored by Visual Studio compiler in the remaining files and removed them, as well. Some examples are given below:

#if defined(__BORLANDC__) || defined(__MINGW32__)
# define XMD_H 1
#endif

#if 0
extern const int jpeg_zigzag_order[];
#endif

In many cases, the source code contains unused functions. They should be sent off to greener pastures, too.

2.2 Preprocessor and Conditional Compilation

Frequently, conditional compilation is used for creating specialized versions of the program. That is, some files contain #define as a compiler directive, while code in other files is enclosed in #ifdef and #endif. Example:

/*jconfig.h for Microsoft Visual C++ on Windows 95 or NT. */
.....
#define BMP_SUPPORTED
#define GIF_SUPPORTED
.....

/* wrbmp.c  */
....
#ifdef BMP_SUPPORTED
...
#endif /* BMP_SUPPORTED */

I would suggest selecting what to use straight away and get rid of conditional compilation. For example, should you decide that BMP format support is necessary, you should remove #ifdef BMP_SUPPORTED from the entire code base.

If you do have to keep the possibility to create several versions of the program, you should make tests for every version. I suggest leaving around the most full version and work with it. After the transition is complete, you may add conditional compilation directives back in.

But we are not done working with preprocessor yet. It's necessary to find preprocessor commands that emulate functions and change them into real functions.

#define	CACHE_STATE(tif, sp) do { \
    BitAcc = sp->data; \
    BitsAvail = sp->bit; \
    EOLcnt = sp->EOLcnt; \
    cp = (unsigned char*) tif->tif_rawcp; \
    ep = cp + tif->tif_rawcc; \
} while (0)

To make a proper signature for a function, it is necessary to find out what are the types of all the arguments. Please note that BitAcc, BitsAvail, EOLcnt, cp and ep get assigned within the preprocessor command. These variables will become arguments of new functions and they should be passed by reference. That is, you should use uint32& for BitAcc in the function's signature.

Programmers sometimes abuse preprocessor. Check out an example of such misuse:

#define HUFF_DECODE(result,state,htbl,failaction,slowlabel) \
{ register int nb, look; \
  if (bits_left < HUFF_LOOKAHEAD) { \
    if (! jpeg_fill_bit_buffer(&state,get_buffer,bits_left, 0)) {failaction;} \
    get_buffer = state.get_buffer; bits_left = state.bits_left; \
    if (bits_left < HUFF_LOOKAHEAD) { \
      nb = 1; goto slowlabel; \
    } \
  } \
  look = PEEK_BITS(HUFF_LOOKAHEAD); \
  if ((nb = htbl->look_nbits[look]) != 0) { \
    DROP_BITS(nb); \
    result = htbl->look_sym[look]; \
  } else { \
    nb = HUFF_LOOKAHEAD+1; \
slowlabel: \
    if ((result=jpeg_huff_decode(&state,get_buffer,bits_left,htbl,nb)) < 0) \
	{ failaction; } \
    get_buffer = state.get_buffer; bits_left = state.bits_left; \
  } \
}

In the code above, PEEK_BITS and DROP_BITS are also "functions", created similarly to HUFF_DECODE. In this case, the most reasonable approach is probably to include code of PEEK_BITS and DROP_BITS "functions" into HUFF_DECODE to ease transformation.

You should go to the next stage of refining the code only when most harmless (as seen below) preprocessor directives are left.

#define DATATYPE_VOID		0

2.3 switch and goto Operators

You can get rid of goto operators by introducing boolean variables and/or changing the code of a function. For example, if a function has a loop that uses goto to break out of it, then such construction could be changed to setting of a boolean variable, a break clause and a check of the variable's value after the loop.

My next step is to scan the code for all the switch statements containing a case without a matching break.

switch ( test1(buf) )
{
   case -1:
      if ( line != buf + (bufsize - 1) )
        continue;
        /* falls through */
   default:
       fputs(buf, out);
       break;
}

This is allowed in C++, but not in C#. Such switch statements can be replaced with if blocks, or you can duplicate code if a fallthrough case takes up a couple of lines.

2.4 Time to Gather Stones

Everything that I described until now is not supposed to take up much time - not compared to what lies ahead. The first massive task that we're facing is combining of data and functions into classes. What we're aiming for is making every function a method of a class.

If the code was initially written in C++, it will probably contain few free (non-member) functions. In this case, a relationship between existing classes and free functions should be found. Usually, it turns out that free functions play an ancillary role for the classes. If a function is only used in one class, it can be moved into that class as a static method. If a function is used in several classes, then a new class can be created with this function as its static member.

If the code was created in C, there'll be no classes in it. They'll have to be created from the ground up, grouping functions around the data that they manipulate. Fortunately, this logical relationship is quite easy to figure out - especially if the C code is written using some OOP principles.

Let's examine the example below:

struct tiff
{
  char* tif_name;
  int tif_fd;
  int tif_mode;
  uint32 tif_flags;
......
};
...
extern	int TIFFDefaultDirectory(tiff*);
extern	void _TIFFSetDefaultCompressionState(tiff*);
extern	int TIFFSetCompressionScheme(tiff*, int);
...

It's easy to see that the tiff struct begs to become a class and the three functions declared below - to be changed into public methods of this class. So, we're changing struct to class and the three functions to static methods of the class.

As most functions become methods of different classes, it'll become easier to understand what to do with the remaining non-member functions. Don't forget that not all of the free functions will become public methods. There are usually a few ancillary functions not intended for use from the outside. These functions will become private methods.

After the free functions have been changed to static methods of classes, I suggest getting down to replacing calls to malloc/free functions with new/delete operators and adding constructors and destructors. Then static methods can be gradually turned into full-blown class methods. As more and more static methods are converted to non-static ones, it'll become clear that at least one of their arguments is redundant. This is the pointer to the original struct that has become the class. It may also turn out that some arguments of private methods can become member variables.

2.5 Preprocessor Again and Multiple Inheritance

Now that a set of classes replaced the set of functions and structs, it's time to get back to the preprocessor. That is, to defines like the one below (there should be no other ones remaining by now):

#define STRIP_SIZE_DEFAULT 8192

Such defines should be turned into constants and you should find or create an owner class for them. The same as with functions, the newly-created constants may require creating a special new class for them (maybe, called Constants). As well as the functions, the constants may have to be public or private.

If the original code was written in C++, it may rely upon multiple inheritance. This is another thing to get rid of before converting code to C#. One way to deal with it is to change the class hierarchy in a way that excludes multiple inheritance. Another way is to make sure that all the base classes of a class that use multiple inheritance contain only pure virtual methods and contain no member variables. For example:

class A
{
public:
    virtual bool DoSomething() = 0;
};
class B
{
public:
    virtual bool DoAnother() = 0;
};
class C : public A, B
{ ... };

This kind of multiple inheritance can be easily transferred to C# by declaring A and B classes as interfaces.

2.6 typedef Operator

Before going over to the next big-scale task (getting rid of pointer arithmetic), we should pay special attention to type synonyms declarations (typedef operator). Sometimes these are used as shorthand for proper types. For instance:

typedef vector<command* /> Commands;

I prefer to inline such declarations - that is, locate Commands in the code, change them to vector, and delete typedef.

A more interesting case of using typedef is this:

typedef signed char int8;
typedef unsigned char uint8;
typedef short int16;
typedef unsigned short uint16;
typedef int int32;
typedef unsigned int uint32;

Mind the names of the types being created. It's obvious that typedef short int16 and typedef int int32 are somewhat of a hindrance, so it makes sense to change int16 to short and int32 to int in the code. Other typedefs, on the other hand, are quite useful. It's a good idea, however, to rename them so that they match type names in C#, like so:

typedef signed char sbyte;
typedef unsigned char byte;
typedef unsigned short ushort
typedef unsigned int uint;

Special attention should be paid to the declarations similar to following one:

typedef unsigned char JBLOCK[64]; /* one block of coefficients */

This declaration defines a JBLOCK as array of 64 elements of the type unsigned char. I prefer to convert such declarations into classes. In other words, to create JBLOCK class that serves as a wrapper around array and implements methods to access the individual elements of the array. It facilitates better understanding of the way array of JBLOCKs (particularly 2- and 3-dimensional ones) are created, used and destroyed.

2.7. Pointer Arithmetic

Another large-scale task is getting rid of pointer arithmetic. Many C/C++ programs rely quite heavily on this feature of the language.

E.g.:

void horAcc32(int stride, uint* wp, int wc)
{
  if (wc > stride) {
    wc -= stride;
    do {
      wp[stride] += wp[0];
      wp++;
      wc -= stride;
    } while ((int)wc > 0);
  }
}

Such functions are to be rewritten, since pointer arithmetic is unavailable in C# by default. You may use such arithmetic in unsafe code, but such code has its disadvantages. That's why I prefer to rewrite such code using "index arithmetic". It goes like this:

void horAcc32(int stride, uint* wp, int wc)
{
  int wpPos = 0;
  if (wc > stride) {
    wc -= stride;
    do {
      wp[wpPos + stride] += wp[wpPos];
      wpPos++;
      wc -= stride;
    } while ((int)wc > 0);
  }
}

The resulting function does the same job, but uses no pointer arithmetic and can be easily ported to C#. It could also be somewhat slower than the original, but again, this is not our priority for now.

Special attention should be paid to the functions that change pointers passed to them as arguments. Below is an example of such a function:

void horAcc32(int stride, uint* & wp, int wc)

In this case, changing wp in function horAcc32 changes the pointer in the calling function as well. Still, introducing an index would be a suitable approach here. You just need to define the index in the calling function and pass it to horAcc32 as an argument.

void horAcc32(int stride, uint* wp, int& wpPos, int wc)

It is often convenient to convert int wpPos into a member variable.

2.8 Function Pointers

After pointer arithmetic is out of the way, it is time to deal with function pointers (if there are any in code). Function pointers can be of three different types:

Function pointers are created and used within one class / function
Function pointers are created and used by different classes in the program
Function pointers are created by the users and are passed into the program (program in this case is a dynamically or statically created library)

An example of the first type:

typedef int (*func)(int x, int y);

class Calculator
{
    Calculator();
    int (*func)(int x, int y);
    
    static int sum(int x, int y) { return x + y; }
    static int mul(int x, int y) { return x * y; }
public:
    static Calculator* CreateSummator()
    {
        Calculator* c = new Calculator();
        c->func = sum;
        return c;
    }
    static Calculator* CreateMultiplicator()
    {
        Calculator* c = new Calculator();
        c->func = mul;
        return c;
    }
    int Calc(int x, int y) { return (*func)(x,y); }
};

In this case, functionality of the Calc method will vary depending on which one of CreateSummator and CreateMultiplicator methods was called to create an instance if the class. I prefer to create a private enum in the class that describes all possible choices for the functionality and the field that keeps a value from enum. Then, instead of a function pointer, I create a method that consists of a switch operator (or several ifs). The created method selects the necessary function based on the value of the field. The changed code:

class Calculator
{
    enum FuncType
    {  ftSum, ftMul };
    FuncType type;

    Calculator();

    int func(int x, int y)
    {
        if (type == ftSum)
            return sum(x,y);
        return mul(x,y);
    }

    static int sum(int x, int y) { return x + y; }
    static int mul(int x, int y) { return x * y; }
public:
    static Calculator* createSummator()
    {
        Calculator* c = new Calculator();
        c->type = ftSum;
        return c;
    }
    static Calculator* createMultiplicator()
    {
        Calculator* c = new Calculator();
        c->type = ftMul;
        return c;
    }
    int Calc(int x, int y) { return func(x,y); }
};

You can choose another way: change nothing for the moment and use delegates at the time of transferring to C#.

An example for the second case (function pointers are created and used by different classes of the program):

typedef	int (*TIFFVSetMethod)(TIFF*, ttag_t, va_list);
typedef	int (*TIFFVGetMethod)(TIFF*, ttag_t, va_list);
typedef	void (*TIFFPrintMethod)(TIFF*, FILE*, long);
    
class TIFFTagMethods
{
public:
    TIFFVSetMethod  vsetfield;
    TIFFVGetMethod  vgetfield;
    TIFFPrintMethod  printdir;
};

This situation is best resolved by turning vsetfield/vgetfield/printdir into virtual methods. Code that has used vsetfield/vgetfield/printdir will have to create a class derived from TIFFTagMethods with required implementation of the virtual methods.

An example of the third case (function pointers are created by users and passed into the program):

typedef int (*PROC)(int, int); 
int DoUsingMyProc (int, int, PROC lpMyProc, ...);

Delegates are best suited here. That is, at this stage, while the original code is still being polished, nothing else should be done. At the later stage, when the project is transferred into C#, a delegate should be created instead of PROC, and the DoUsingMyProc function should be changed to accept an instance of the delegate as an argument.

2.9 Isolation of the "Problem Code"

The last change of the original code is the isolation of anything that may be a problem for the new compiler. It may be a code that actively uses standard C/C++ library (functions like fprintf, gets, atof and so on) or WinAPI. In C#, this will have to be changed to use .NET Framework methods or, if need be, p/invoke technique. Take a look at www.pinvoke.net site in the latter case.

"Problem code" should be localized as much as possible. To this end, you could create a wrapper class for the functions from C/C++ standard library or WinAPI. Only this wrapper will have to be changed later.

2.10 Changing Compiler

This is the moment of truth - the time to bring the changed code into the new project built using C# compiler. It's quite trivial, but labor-intensive. A new empty project is to be created, then the necessary classes should be added to that project and the code from the corresponding original classes copied into them.

You'll have to remove the ballast at this stage (like various #includes, for instance) and make some cosmetic modifications. "Standard" modifications include:

combining code from .h and .cpp files
replacing obj->method() with obj.method()
replacing Class::StaticMethod with Class.StaticMethod
removing * in func(A* anInstance)
replacing func(int& x) with func(ref int x)

Most of the modifications are not particularly complex, but some of the code will have to be commented out. Mostly the problem code that I discussed in part 2.9 will be commented out. The main goal here is to get C# code that compiles. It most probably won't work, but we'll come to that in due time.

2.11 Making It All Work

After we made converted code compile, we need to adjust the code till the functionality matches the original. For that, we need to create a second set of tests that uses the converted code. The methods, commented out earlier, need to be carefully revised and rewritten using .NET Framework. I think this part needs no further explaining. I just want to expand on a few fine points.

When creating strings from byte arrays (and vice versa), a proper encoding should be selected carefully. Encoding.ASCII should be avoided due to its 7-bit nature. It means that bytes with values higher than 127 will become "?" instead of proper characters. It's best to use Encoding.Default or Encoding.GetEncoding("Latin1"). The actual choice of encoding depends on what happens next with the text or the bytes. If the text is to be displayed to the user - then Encoding.Default is a better choice, and if text is to be converted to bytes and saved into a binary file, then Encoding.GetEncoding("Latin1") suites better.

Output of formatted strings (code related to the family of printf functions in C/C++) may present certain problems. Functionality of the String.Format in .NET Framework is both poorer and different in syntax. This problem can be solved in two ways:

Create a class that mimics functionality of printf functions
Change the format strings so that String.Format shows the same result (not always possible).

Check out "A printf implementation in C#" if you choose the first option.

I prefer the second option. If you will choose it too, then a search for "c# format specifiers" (without quotes) in Google and "Format Specifiers Appendix from C# in a Nutshell" may prove useful for you.

When all the tests that use the converted code will pass successfully, we can be sure that the conversion is completed. Now we can return to the fact that the code does not quite conform to the C# ideology (for example, the code full of get/set methods instead of properties) and deal with refactoring of the converted code. You may use profiler to identify bottlenecks in the code and optimize it. But that's quite a different story.

Happy porting!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Adapting Old Code to New Realities

Table of Contents

License