Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

ARM Assembly for eVC with the Mono Jit Macros

0.00/5 (No votes)
14 Jul 2007 1  
ARM assembly for eVC with the Mono Jit macros
Screenshot - ArmJitCE.jpg

Introduction

Microsoft's eMbedded Visual C compilers have no possibility for using inline assembly for the ARM family of microprocessors other than emitting the raw opcode. The macros of the Mono Jit for ARM can be rather easily used to do that in a more convenient manner. This article only deals with assembly for ARMV4 versions of Windows CE, which means versions before Windows Mobile 5. The methods used, in the exact form that they are presented here, may not work on ARMV4I and ARMV4T platforms; that has not been tested.

Background

Code written in machine language is often still faster than code written in C, but we generally don't use assembly anymore to compete with C for the purpose of speed. In the past couple of years, however, bytecode languages as Lua and Cil have become increasingly important and they can be dramatically sped-up with Jits. A Jit first translates the otherwise "bureaucratically" interpreted individual bytecodes to one or more assembly instruction(s), puts them together in a row, and only then executes them. This is one of the reasons that Jits start up slowly, but are subsequently real fast.

I feel that abstract bytecode and Jits are the way to go for the future, because the combination is theoretically platform-independent and simpler to use for the common programmer. Unfortunately, no generally used standard for bytecode has yet surfaced: there are (way too) many variations on the same theme around, but hopefully this will change in time.

I discovered that the macros of the Mono ARM Jit can be easily used for eVC by looking at the source code of an implementation of Ogl/Es for Windows CE that can be downloaded from Sourceforge. The author seems to have developed more general wrappers around the macros, but I will not use them here, although I think that an abstract machine language in itself could be very useful. Maybe it could even turn out to be a condition for developing a bytecode standard.

Using the Code

The Mono ARM assembly macros are in a couple of header files; most notably arm-codegen.h and arm_dpimacros.h, and optionally arm-dis.h for disassembling the generated code, which I found real neat. I used Mono version 1.2.4, and corrected a bug for backward branches:

#define ARM_DEF_BR(offs, l, cond) \
((offs) | ((l) << 24) | (ARM_BR_TAG) | (cond << ARMCOND_SHIFT))

in arm-codegen.h should be:

#define ARM_DEF_BR(offs, l, cond) \
((offs & 0x00FFFFFF) | ((l) << 24) |(ARM_BR_TAG) | (cond << ARMCOND_SHIFT))

because branch offsets are 24 bit for the ARM, and negative offset numbers when using the branch macros extend to 32 bit, which interferes with the rest of the opcode. Possibly, Mono uses backward branches in a different way; I looked into that too briefly to tell. Branches are used by regarding the branch instruction itself as being at offset -2, the next instruction at -1 and the previous one at -3, etc.

Some of the more complicated basic ARM operations are not implemented as macros but as functions; like arm_mov_reg_imm32(). In these functions, the index pointer to the instruction array is local and it is therefore not automatically updated, but its new value is returned. I put these functions, and the disassembler functions in a library named ArmJit.lib. In order to use the macros, you have to include the appropriate header files in your source code and if you need the functions, link with the lib.

To make the procedure more understandable, I made a small test program that implements the simple Fibonaccio benchmark in 13 ARM instructions. For fun, the speed is compared to the commonly used C implementation, and it turns out that the ARM version is over 30% faster, but, as stated, competing with C is generally not the purpose of using assembly anymore. Here's the program and I'll comment below it:

#include <windows.h>
#include <stdio.h>
#include <arm-codegen.h>
#include <arm-dis.h>

unsigned long fib_c(unsigned long n) {
if (n < 2)
    return(1);
else
    return(fib_c(n-2) + fib_c(n-1));
}

void setup_fib_jit (unsigned int *pins) {

/* label1 */
ARM_CMP_REG_IMM8 (pins, ARMREG_R0, 2); /* is n < 2 ? */
ARM_MOV_REG_IMM8_COND (pins, ARMREG_R0, 1, ARMCOND_LO); /* if yes return value is 1 */
ARM_MOV_REG_REG_COND (pins, ARMREG_PC,  ARMREG_LR, ARMCOND_LO);
                                        /* if yes return address in PC; */
                                        /* and exit to main or previous recursive call */
ARM_PUSH2 (pins, ARMREG_R0, ARMREG_LR); /* save n and return address to the stack*/
ARM_SUB_REG_IMM8(pins, ARMREG_R0, ARMREG_R0, 2); /* n = n-2 */
ARM_BL (pins, -7); /* recurse to label1 for fib(n-2) */

ARM_LDR_IMM (pins, ARMREG_R1, ARMREG_SP, 0); /* load n from the stack */
ARM_STR_IMM (pins, ARMREG_R0, ARMREG_SP, 0); /* store result fib(n-2) */

ARM_SUB_REG_IMM8(pins, ARMREG_R0, ARMREG_R1, 1); /* n = n-1 */
ARM_BL (pins, -11); /* recurse to label1 for fib(n-1) */
ARM_POP2 (pins, ARMREG_R1, ARMREG_LR); /* pop result fib(n-2) and return address */

ARM_ADD_REG_REG (pins, ARMREG_R0, ARMREG_R0, ARMREG_R1); /* add both results */

ARM_MOV_REG_REG (pins, ARMREG_PC,  ARMREG_LR);
                                        /* return address in PC; */
                                        /* and exit to main or previous recursive call */
}

int main (int argc, char *argv[]) {

unsigned int n, ins[500], *pins = ins;
unsigned long (*fib_jit)(int n) = (unsigned long (*)(int n)) ins;
unsigned long r1, r2, t0, t1, t2;

setup_fib_jit (pins);
_armdis_dump (stdout, ins, 56);

if (argc <= 2) {
    if (argc == 1)
        n=1;
    else
        n=atoi (argv[1]);
t0 = GetTickCount();
r1 = fib_c (n);
t1 = GetTickCount();
r2 = fib_jit (n);
t2 = GetTickCount();
}

else {
    fprintf (stderr, "%s: Wrong number of arguments\n", argv[0]);
    exit (-1);
}

printf ("  fib_c(%d) result: %d\n\texcution time: %lf\n", n, r1, (t1-t0) / 1000.0);
printf ("fib_jit(%d) result: %d\n\texcution time: %lf\n", n, r2, (t2-t1) / 1000.0);

return 0;
}

Using the macros presupposes a great deal of basic knowledge of machine language programming in general, and some basic knowledge of the ARM microprocessor family in particular. The assembly instructions in the function setup_fib_jit() are commented in the program, and I will not get into that here; it is beyond the scope of this article. Comparing the comments to the C version above it will probably give an adequate impression of what goes on; it is virtually a 1 on 1 translation of the C algorithm. I will now rather concentrate on setting up the code and using the macros in practice.

First we need to have an array for the actual opcode instructions; in this case the array is named ins[]. The ARM opcodes are 32 bits each on the ARMV4 platform, which corresponds to unsigned int on my version of Windows CE. Use UINT32 as the type of the array instead if you want to be safe. An index pointer to this array, *pins is also needed for the macros to determine where to put the opcodes. The macros update this pointer themselves, so we can use them without having to care for that. Note that compared to normal programming, the actual assembly function is being "set up" rather than that the function *is* the code: the actual code will not be "in" setup_fib_jit(), but in the array ins[]!

After the opcode array has been filled with the instructions, we must enable ourselves to jump to it by casting the array to a function variable. That is done with the declaration:

unsigned long (*fib_jit)(int n) = (unsigned long (*)(int n)) ins;

We can then call fib_jit() as an ordinary function. The n argument will be passed to it in ARM register 0, which is usual for the first parameter of a __cdecl function on the ARM WinCE platform. Register 0 is also used to pass the return value to main() when the function exits. I think that the rest of main() is self-explanatory. Input and output of the benchmark are from/to the console; you need one to run this program. I personally use PocketConsole (that I adapted to work in VGA, hence hidpi.res in the project), which is available on the Internet.

Points of Interest

Once I got started, I found using the Mono code surprisingly easy and very interesting. I've been having plans for writing a C interpreter for my Toshiba E800 Pocket PC for a long time, and maybe this tool will get me to actually do it. Don't count on it though; rather write it yourself ;).

History

The original zip file and the article have had a minor update: the ARM function of the test program has been speed improved.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here