Table of Contents
- Introduction
- Prerequisites
- Transfer process
- Removing the unnecessary code
- Preprocessor and conditional compilation
- switch and goto operators
- Time to gather stones
- Preprocessor again and multiple inheritance
- typedef operator
- Pointer arithmetic
- Function pointers
- Isolation of the "problem code"
- Changing compiler
- Making it all work
In this article, I shall describe one of the methods that can be used to transform C/C++ code into C# code with the least amount of effort. The principles laid out in this article are also suitable for other pairs of languages, though. I want to warn you straight-off that this method is not applicable to porting of any GUI-related code.
What is this useful for? For example, I have used this method to port libtiff, the well-known TIFF library, to C# (and libjpeg too). This allowed me to reuse work of many people contributed to libtiff along with the .NET Framework Class Library in my program. Code examples in my article are taken mainly from libtiff / libjpeg libraries.
What you will need:
- Original code that you can build "in one click"
- A set of tests, also runnable "in one click"
- A version control system
- Some basic understanding of refactoring principles
The "one-click" build and test runs requirement is there to speed up the "change - compile - run tests" cycle as much as possible. The more time and effort goes into each such cycle, the fewer times it will be executed. This may lead to massive and complex roll-backs of erroneous changes.
You can use any version control system. I personally use Subversion - you may pick up whatever you're comfortable with. Anything instead of a set of folders on the hard disk will do.
Tests are required to make sure that the code still retains all of its features at any given time. Being safe in the knowledge that no functional changes are introduced into the code is what sets my method apart from the "let's rewrite it from scratch in the new language" approach. Tests are not required to cover a 100% of the code, but it's desirable to have the tests for all the main features of the code. The tests shouldn't be accessing the internals of the code to avoid constant rewriting of them.
Here's what I used to port LibTiff:
- A set of images in TIFF format
- tiffcp, the command-line utility that converts TIFF images between different compression schemes
- A set of batch scripts that use tiffcp for conversion tasks
- A set of reference output images
- A program that performs binary comparison of output images with the set of reference images
To grasp refactoring concepts, you only need to read one book. Martin Fowler's Refactoring: Improving the Design of Existing Code. Be sure to read it if you still haven't. Any programmer can only gain by knowing refactoring principles. You don't have to read the entire book, first 130 pages from the beginning is enough. This is the first five chapters and the beginning of the sixth, up to the "Inline Method".
It goes without saying, the better you know the languages that are being used in your source and destination code, the easier the transformation will go. Please note that a deep knowledge of the internals of the original code is not required when you begin. It's enough to understand what the original code does, a deeper understanding of how it does it will come in the process.
The essence of the method is that the original code is simplified through a series of simple and small refactorings. You shouldn't attempt to change a large chunk of code and try to optimize it all at once. You should progress in small steps, run tests after every change cycle and make sure to save every successful modification. That is, make a small change - test it. If all is well, save the change in the VCS repository.
Transfer process could be broken down into 3 big stages:
- Replacement of everything in the original code that uses language-specific features with something simpler, but functionally equivalent. This frequently leads to slower and not so neat looking code, but let it not concern you at this stage.
- Modification of the altered code so that it can be compiled in the new language.
- Transferal of the tests and making the functionality of the new code match the code in the source language.
Only after completing these stages, you should look at the speed and the beauty of the code.
The first stage is the most complex. The goal is to refactor C/C++ code into "pure C++" code with syntax that is as close to C# syntax as possible. This stage means getting rid of:
- preprocessor directives
- goto operators
- typedef operators
- pointer arithmetic
- function pointers
- free (non-member) functions
Let us go over these steps in detail.
First of all, we should get rid of the unused code. For instance, in the case of libtiff, I removed the files that were not used to build Windows version of the library. Then, I found all the conditional compilation directives ignored by Visual Studio compiler in the remaining files and removed them, as well. Some examples are given below:
#if defined(__BORLANDC__) || defined(__MINGW32__)
# define XMD_H 1
#endif
#if 0
extern const int jpeg_zigzag_order[];
#endif
In many cases, the source code contains unused functions. They should be sent off to greener pastures, too.
Frequently, conditional compilation is used for creating specialized versions of the program. That is, some files contain #define
as a compiler directive, while code in other files is enclosed in #ifdef
and #endif
. Example:
.....
#define BMP_SUPPORTED
#define GIF_SUPPORTED
.....
....
#ifdef BMP_SUPPORTED
...
#endif
I would suggest selecting what to use straight away and get rid of conditional compilation. For example, should you decide that BMP format support is necessary, you should remove #ifdef BMP_SUPPORTED
from the entire code base.
If you do have to keep the possibility to create several versions of the program, you should make tests for every version. I suggest leaving around the most full version and work with it. After the transition is complete, you may add conditional compilation directives back in.
But we are not done working with preprocessor yet. It's necessary to find preprocessor commands that emulate functions and change them into real functions.
#define CACHE_STATE(tif, sp) do { \
BitAcc = sp->data; \
BitsAvail = sp->bit; \
EOLcnt = sp->EOLcnt; \
cp = (unsigned char*) tif->tif_rawcp; \
ep = cp + tif->tif_rawcc; \
} while (0)
To make a proper signature for a function, it is necessary to find out what are the types of all the arguments. Please note that BitAcc
, BitsAvail
, EOLcnt
, cp
and ep
get assigned within the preprocessor command. These variables will become arguments of new functions and they should be passed by reference. That is, you should use uint32&
for BitAcc
in the function's signature.
Programmers sometimes abuse preprocessor. Check out an example of such misuse:
#define HUFF_DECODE(result,state,htbl,failaction,slowlabel) \
{ register int nb, look; \
if (bits_left < HUFF_LOOKAHEAD) { \
if (! jpeg_fill_bit_buffer(&state,get_buffer,bits_left, 0)) {failaction;} \
get_buffer = state.get_buffer; bits_left = state.bits_left; \
if (bits_left < HUFF_LOOKAHEAD) { \
nb = 1; goto slowlabel; \
} \
} \
look = PEEK_BITS(HUFF_LOOKAHEAD); \
if ((nb = htbl->look_nbits[look]) != 0) { \
DROP_BITS(nb); \
result = htbl->look_sym[look]; \
} else { \
nb = HUFF_LOOKAHEAD+1; \
slowlabel: \
if ((result=jpeg_huff_decode(&state,get_buffer,bits_left,htbl,nb)) < 0) \
{ failaction; } \
get_buffer = state.get_buffer; bits_left = state.bits_left; \
} \
}
In the code above, PEEK_BITS
and DROP_BITS
are also "functions", created similarly to HUFF_DECODE
. In this case, the most reasonable approach is probably to include code of PEEK_BITS
and DROP_BITS
"functions" into HUFF_DECODE
to ease transformation.
You should go to the next stage of refining the code only when most harmless (as seen below) preprocessor directives are left.
#define DATATYPE_VOID 0
You can get rid of goto
operators by introducing boolean variables and/or changing the code of a function. For example, if a function has a loop that uses goto
to break out of it, then such construction could be changed to setting of a boolean variable, a break
clause and a check of the variable's value after the loop.
My next step is to scan the code for all the switch
statements containing a case
without a matching break
.
switch ( test1(buf) )
{
case -1:
if ( line != buf + (bufsize - 1) )
continue;
default:
fputs(buf, out);
break;
}
This is allowed in C++, but not in C#. Such switch
statements can be replaced with if
blocks, or you can duplicate code if a fallthrough case
takes up a couple of lines.
Everything that I described until now is not supposed to take up much time - not compared to what lies ahead. The first massive task that we're facing is combining of data and functions into classes. What we're aiming for is making every function a method of a class.
If the code was initially written in C++, it will probably contain few free (non-member) functions. In this case, a relationship between existing classes and free functions should be found. Usually, it turns out that free functions play an ancillary role for the classes. If a function is only used in one class, it can be moved into that class as a static
method. If a function is used in several classes, then a new class can be created with this function as its static
member.
If the code was created in C, there'll be no classes in it. They'll have to be created from the ground up, grouping functions around the data that they manipulate. Fortunately, this logical relationship is quite easy to figure out - especially if the C code is written using some OOP principles.
Let's examine the example below:
struct tiff
{
char* tif_name;
int tif_fd;
int tif_mode;
uint32 tif_flags;
......
};
...
extern int TIFFDefaultDirectory(tiff*);
extern void _TIFFSetDefaultCompressionState(tiff*);
extern int TIFFSetCompressionScheme(tiff*, int);
...
It's easy to see that the tiff
struct
begs to become a class and the three functions declared below - to be changed into public
methods of this class. So, we're changing struct
to class
and the three functions to static
methods of the class.
As most functions become methods of different classes, it'll become easier to understand what to do with the remaining non-member functions. Don't forget that not all of the free functions will become public
methods. There are usually a few ancillary functions not intended for use from the outside. These functions will become private
methods.
After the free functions have been changed to static
methods of classes, I suggest getting down to replacing calls to malloc/free
functions with new/delete
operators and adding constructors and destructors. Then static
methods can be gradually turned into full-blown class methods. As more and more static
methods are converted to non-static
ones, it'll become clear that at least one of their arguments is redundant. This is the pointer to the original struct that has become the class. It may also turn out that some arguments of private
methods can become member variables.
Now that a set of classes replaced the set of functions and struct
s, it's time to get back to the preprocessor. That is, to defines like the one below (there should be no other ones remaining by now):
#define STRIP_SIZE_DEFAULT 8192
Such defines should be turned into constants and you should find or create an owner class for them. The same as with functions, the newly-created constants may require creating a special new class for them (maybe, called Constants
). As well as the functions, the constants may have to be public
or private
.
If the original code was written in C++, it may rely upon multiple inheritance. This is another thing to get rid of before converting code to C#. One way to deal with it is to change the class hierarchy in a way that excludes multiple inheritance. Another way is to make sure that all the base classes of a class that use multiple inheritance contain only pure virtual methods and contain no member variables. For example:
class A
{
public:
virtual bool DoSomething() = 0;
};
class B
{
public:
virtual bool DoAnother() = 0;
};
class C : public A, B
{ ... };
This kind of multiple inheritance can be easily transferred to C# by declaring A and B classes as interfaces.
Before going over to the next big-scale task (getting rid of pointer arithmetic), we should pay special attention to type synonyms declarations (typedef
operator). Sometimes these are used as shorthand for proper types. For instance:
typedef vector<command* /> Commands;
I prefer to inline such declarations - that is, locate Commands
in the code, change them to vector
, and delete typedef
.
A more interesting case of using typedef
is this:
typedef signed char int8;
typedef unsigned char uint8;
typedef short int16;
typedef unsigned short uint16;
typedef int int32;
typedef unsigned int uint32;
Mind the names of the types being created. It's obvious that typedef short int16
and typedef int int32
are somewhat of a hindrance, so it makes sense to change int16
to short
and int32
to int
in the code. Other typedefs
, on the other hand, are quite useful. It's a good idea, however, to rename them so that they match type names in C#, like so:
typedef signed char sbyte;
typedef unsigned char byte;
typedef unsigned short ushort
typedef unsigned int uint;
Special attention should be paid to the declarations similar to following one:
typedef unsigned char JBLOCK[64];
This declaration defines a JBLOCK
as array of 64 elements of the type unsigned char
. I prefer to convert such declarations into classes. In other words, to create JBLOCK
class that serves as a wrapper around array and implements methods to access the individual elements of the array. It facilitates better understanding of the way array of JBLOCKs
(particularly 2- and 3-dimensional ones) are created, used and destroyed.
Another large-scale task is getting rid of pointer arithmetic. Many C/C++ programs rely quite heavily on this feature of the language.
E.g.:
void horAcc32(int stride, uint* wp, int wc)
{
if (wc > stride) {
wc -= stride;
do {
wp[stride] += wp[0];
wp++;
wc -= stride;
} while ((int)wc > 0);
}
}
Such functions are to be rewritten, since pointer arithmetic is unavailable in C# by default. You may use such arithmetic in unsafe code, but such code has its disadvantages. That's why I prefer to rewrite such code using "index arithmetic". It goes like this:
void horAcc32(int stride, uint* wp, int wc)
{
int wpPos = 0;
if (wc > stride) {
wc -= stride;
do {
wp[wpPos + stride] += wp[wpPos];
wpPos++;
wc -= stride;
} while ((int)wc > 0);
}
}
The resulting function does the same job, but uses no pointer arithmetic and can be easily ported to C#. It could also be somewhat slower than the original, but again, this is not our priority for now.
Special attention should be paid to the functions that change pointers passed to them as arguments. Below is an example of such a function:
void horAcc32(int stride, uint* & wp, int wc)
In this case, changing wp
in function horAcc32
changes the pointer in the calling function as well. Still, introducing an index would be a suitable approach here. You just need to define the index in the calling function and pass it to horAcc32
as an argument.
void horAcc32(int stride, uint* wp, int& wpPos, int wc)
It is often convenient to convert int wpPos
into a member variable.
After pointer arithmetic is out of the way, it is time to deal with function pointers (if there are any in code). Function pointers can be of three different types:
- Function pointers are created and used within one class / function
- Function pointers are created and used by different classes in the program
- Function pointers are created by the users and are passed into the program (program in this case is a dynamically or statically created library)
An example of the first type:
typedef int (*func)(int x, int y);
class Calculator
{
Calculator();
int (*func)(int x, int y);
static int sum(int x, int y) { return x + y; }
static int mul(int x, int y) { return x * y; }
public:
static Calculator* CreateSummator()
{
Calculator* c = new Calculator();
c->func = sum;
return c;
}
static Calculator* CreateMultiplicator()
{
Calculator* c = new Calculator();
c->func = mul;
return c;
}
int Calc(int x, int y) { return (*func)(x,y); }
};
In this case, functionality of the Calc
method will vary depending on which one of CreateSummator
and CreateMultiplicator
methods was called to create an instance if the class. I prefer to create a private enum
in the class that describes all possible choices for the functionality and the field that keeps a value from enum
. Then, instead of a function pointer, I create a method that consists of a switch
operator (or several ifs
). The created method selects the necessary function based on the value of the field. The changed code:
class Calculator
{
enum FuncType
{ ftSum, ftMul };
FuncType type;
Calculator();
int func(int x, int y)
{
if (type == ftSum)
return sum(x,y);
return mul(x,y);
}
static int sum(int x, int y) { return x + y; }
static int mul(int x, int y) { return x * y; }
public:
static Calculator* createSummator()
{
Calculator* c = new Calculator();
c->type = ftSum;
return c;
}
static Calculator* createMultiplicator()
{
Calculator* c = new Calculator();
c->type = ftMul;
return c;
}
int Calc(int x, int y) { return func(x,y); }
};
You can choose another way: change nothing for the moment and use delegates at the time of transferring to C#.
An example for the second case (function pointers are created and used by different classes of the program):
typedef int (*TIFFVSetMethod)(TIFF*, ttag_t, va_list);
typedef int (*TIFFVGetMethod)(TIFF*, ttag_t, va_list);
typedef void (*TIFFPrintMethod)(TIFF*, FILE*, long);
class TIFFTagMethods
{
public:
TIFFVSetMethod vsetfield;
TIFFVGetMethod vgetfield;
TIFFPrintMethod printdir;
};
This situation is best resolved by turning vsetfield/vgetfield/printdir
into virtual methods. Code that has used vsetfield/vgetfield/printdir
will have to create a class derived from TIFFTagMethods
with required implementation of the virtual methods.
An example of the third case (function pointers are created by users and passed into the program):
typedef int (*PROC)(int, int);
int DoUsingMyProc (int, int, PROC lpMyProc, ...);
Delegates are best suited here. That is, at this stage, while the original code is still being polished, nothing else should be done. At the later stage, when the project is transferred into C#, a delegate should be created instead of PROC
, and the DoUsingMyProc
function should be changed to accept an instance of the delegate as an argument.
The last change of the original code is the isolation of anything that may be a problem for the new compiler. It may be a code that actively uses standard C/C++ library (functions like fprintf
, gets
, atof
and so on) or WinAPI. In C#, this will have to be changed to use .NET Framework methods or, if need be, p/invoke technique. Take a look at www.pinvoke.net site in the latter case.
"Problem code" should be localized as much as possible. To this end, you could create a wrapper class for the functions from C/C++ standard library or WinAPI. Only this wrapper will have to be changed later.
This is the moment of truth - the time to bring the changed code into the new project built using C# compiler. It's quite trivial, but labor-intensive. A new empty project is to be created, then the necessary classes should be added to that project and the code from the corresponding original classes copied into them.
You'll have to remove the ballast at this stage (like various #includes
, for instance) and make some cosmetic modifications. "Standard" modifications include:
- combining code from
.h
and .cpp
files
- replacing
obj->method()
with obj.method()
- replacing
Class::StaticMethod
with Class.StaticMethod
- removing
*
in func(A* anInstance)
- replacing
func(int& x)
with func(ref int x)
Most of the modifications are not particularly complex, but some of the code will have to be commented out. Mostly the problem code that I discussed in part 2.9 will be commented out. The main goal here is to get C# code that compiles. It most probably won't work, but we'll come to that in due time.
After we made converted code compile, we need to adjust the code till the functionality matches the original. For that, we need to create a second set of tests that uses the converted code. The methods, commented out earlier, need to be carefully revised and rewritten using .NET Framework. I think this part needs no further explaining. I just want to expand on a few fine points.
When creating string
s from byte arrays (and vice versa), a proper encoding should be selected carefully. Encoding.ASCII
should be avoided due to its 7-bit nature. It means that bytes with values higher than 127 will become "?" instead of proper characters. It's best to use Encoding.Default
or Encoding.GetEncoding("Latin1")
. The actual choice of encoding depends on what happens next with the text or the bytes. If the text is to be displayed to the user - then Encoding.Default
is a better choice, and if text is to be converted to bytes and saved into a binary file, then Encoding.GetEncoding("Latin1")
suites better.
Output of formatted strings (code related to the family of printf
functions in C/C++) may present certain problems. Functionality of the String.Format
in .NET Framework is both poorer and different in syntax. This problem can be solved in two ways:
- Create a class that mimics functionality of
printf
functions
- Change the format strings so that
String.Format
shows the same result (not always possible).
Check out "A printf implementation in C#" if you choose the first option.
I prefer the second option. If you will choose it too, then a search for "c# format specifiers" (without quotes) in Google and "Format Specifiers Appendix from C# in a Nutshell" may prove useful for you.
When all the tests that use the converted code will pass successfully, we can be sure that the conversion is completed. Now we can return to the fact that the code does not quite conform to the C# ideology (for example, the code full of get
/set
methods instead of properties) and deal with refactoring of the converted code. You may use profiler to identify bottlenecks in the code and optimize it. But that's quite a different story.
Happy porting!