In this tip, you will find an implementation of the Fast Integer Square Root algorithm (described by Ross M. Fossler in Microchip's TB040), in 8051 assembly.
Introduction
Some microcontroller (MCU) appications need to compute the integer square root (sqrt
) function, quickly (i.e. avoiding division), and using a small number of instructions. This tip shows the implementation of 'Fast Integer Square Root' algorithm, described by Ross M. Fossler in Microchip's application note TB040.
While such application note reports the source code for the Microchip P18C252
MCU, here, the implementation for the 8051
MCU is shown.
Note, just a sketch of the algorithm is presented here, since the exact specification is given by the (provided) C source code. Anyway, I warmly encourage you reading the original Fossler paper.
The Goal
The goal is finding the integer approximation of the square root of a 16-bit
unsigned number, expressed as an unsigned 8-bit
(byte
) number.
Namely, the 16-bit number will be the argument of the assembly routine, the approximation of the number square root will be the routine return value.
The Algorithm
The algorithm is very simple: Iteratively, starting from most significant position (bit 7
of the byte), an 'additional' bit is added to a guess
number, then the squared guess
is compared with the routine argument (say arg
):
- If the
squared guess
is less than arg
, then we keep the added bit in the guess
, shift the additional bit right and goto next iteration. - On the other hand, if the
squared guess
is greater than arg
, then we remove the added bit from the guess
, shift the additional bit right and goto the next iteration.
The process terminates either when the squared guess
is equal to the arg
or there is no more additional bit (it was shifted out of its byte).
The Code
The algorithm can be easily formulated using C
code.
uint8_t fast_sqrt_c(uint16_t n)
{
uint8_t g = 0x80; uint8_t b = 0x80; for (;;)
{
uint16_t g2 = g*g;
if ( g2 == n )
break; if ( g2 > n )
g ^= b; b >>= 1; if ( b == 0 )
break; g |= b; }
return g;
}
The C
is useful because it
- represents a precise specification for the algorithm.
- can be compared with the
sqrt
function of the standard C
library, in order to fully establish the correctness of the algorithm. - can be compiled with a 'suitable' compiler in order to generate
8051
assemebly. Then, the produced assembly can be compared with the hand crafted algorithm implementation (for correctness, speed and MCU usage).
Step 2 was actually performed using GCC
on a Linux
box (the source fast_sqrt_c_test.c code is provided).
Step 3 was actually performed using the SDCC
compiler.
The resulting code (provided as fast_sqrt_c_sdcc.asm is a bit cluttered, so here is reported a 'rearranged' listing, somehow cleaned up:
_fast_sqrt_c:
mov r6,dpl
mov r7,dph
mov r5,#0x80
mov r4,#0x80
C_Loop:
mov a,r5
mov b,a
mul ab
mov r2,a
mov r3,b
cjne a,ar6,C_NotEqual
mov a,r3
cjne a,ar7,C_NotEqual
sjmp C_TheEnd
C_NotEqual:
clr c
mov a,r6
subb a,r2
mov a,r7
subb a,r3
jnc C_NotGreater
mov a,r4
xrl ar5,a
C_NotGreater:
mov a,r4
clr c
rrc a
mov r4,a
jz C_TheEnd
mov a,r4
orl ar5,a
sjmp C_Loop
C_TheEnd:
mov dpl,r5
ret
SDCC
implementation is, in my opinion, pretty good. Anyway, we can do better with hand-crafted assembly (source provided as fast_sqrt_hand_crafted.asm):
_fast_sqrt_hc:
mov r6, dpl
mov r7, dph
mov r4, #0x80
mov r5, #0x80
mov a,r5
HC_Loop:
mov b,a
mul ab
clr c
subb a,r6
xch a,b
subb a,r7
jc HC_Less
orl a,b
jz HC_TheEnd
mov a, r5
xrl a, r4
mov r5, a
HC_Less:
clr c
mov a,r4
rrc a
jz HC_TheEnd
mov r4,a
orl a, r5
mov r5,a
sjmp HC_Loop
HC_TheEnd:
mov dpl, r5
ret
Some Numbers
The SDCC
produced and the hand crafted one were assembled and compared using the EdSim51DI
simulator.
The results are shown in the following table:
Routine | Code size | Average number of MCU cycles |
SDCC produced | 48 bytes | 170.9 |
Hand crafted | 39 bytes | 140.4 |
Useful 8051 Resources
History
- 27th December, 2020: First release