Fast Integer Square Root For 8051

CPallini

5.00/5 (7 votes)

28 Dec 2020CPOL2 min read

10.6K

Fast integer square root computation in 8051 assembly

In this tip, you will find an implementation of the Fast Integer Square Root algorithm (described by Ross M. Fossler in Microchip's TB040), in 8051 assembly.

Download assembly and C sources - 4 KB

Introduction

Some microcontroller (MCU) appications need to compute the integer square root (sqrt) function, quickly (i.e. avoiding division), and using a small number of instructions. This tip shows the implementation of 'Fast Integer Square Root' algorithm, described by Ross M. Fossler in Microchip's application note TB040.

While such application note reports the source code for the Microchip P18C252 MCU, here, the implementation for the 8051 MCU is shown.

Note, just a sketch of the algorithm is presented here, since the exact specification is given by the (provided) C source code. Anyway, I warmly encourage you reading the original Fossler paper.

The Goal

The goal is finding the integer approximation of the square root of a 16-bit unsigned number, expressed as an unsigned 8-bit (byte) number.

Namely, the 16-bit number will be the argument of the assembly routine, the approximation of the number square root will be the routine return value.

The Algorithm

The algorithm is very simple: Iteratively, starting from most significant position (bit 7 of the byte), an 'additional' bit is added to a guess number, then the squared guess is compared with the routine argument (say arg):

If the squared guess is less than arg, then we keep the added bit in the guess, shift the additional bit right and goto next iteration.
On the other hand, if the squared guess is greater than arg, then we remove the added bit from the guess, shift the additional bit right and goto the next iteration.

The process terminates either when the squared guess is equal to the arg or there is no more additional bit (it was shifted out of its byte).

The Code

The algorithm can be easily formulated using C code.

uint8_t fast_sqrt_c(uint16_t n)
{
  uint8_t g = 0x80; // initial guess
  uint8_t b = 0x80; // 'additional' bit is in position 7
  for (;;)
  {
    uint16_t g2 = g*g;
    if ( g2 == n )
      break;        // an exact match is found
    if ( g2 > n )
      g ^= b;       // here g*g > n, remove the added bit
    b >>= 1;        // shift right the 'addional' bit
    if ( b == 0 )
      break;        // the 'additional' bit was shifted out, the iteration is complete
    g |= b;         // add the current 'additional bit'
  }
  return g;
}

The C is useful because it

represents a precise specification for the algorithm.
can be compared with the sqrt function of the standard C library, in order to fully establish the correctness of the algorithm.
can be compiled with a 'suitable' compiler in order to generate 8051 assemebly. Then, the produced assembly can be compared with the hand crafted algorithm implementation (for correctness, speed and MCU usage).

Step 2 was actually performed using GCC on a Linux box (the source fast_sqrt_c_test.c code is provided).

Step 3 was actually performed using the SDCC compiler.

The resulting code (provided as fast_sqrt_c_sdcc.asm is a bit cluttered, so here is reported a 'rearranged' listing, somehow cleaned up:

ASM

; Implementation of Ross M. Fossler 'Fast Integer Square Root' algorithm 
; (Microchip's application note TB040)
;------------------------------------------------------------
;Allocation info for local variables in function 'fast_sqrt_c'
;------------------------------------------------------------
;n                         Allocated to registers r6-r7 
;g                         Allocated to registers r5 
;b                         Allocated to registers r4 
;g2                        Allocated to registers r2-r3 
;------------------------------------------------------------
;  function fast_sqrt_c
; -----------------------------------------
_fast_sqrt_c:
  mov r6,dpl      ; function argument low byte
  mov r7,dph      ; function argument high byte
  mov r5,#0x80    ; g = 0x80
  mov r4,#0x80    ; b = 0x80

C_Loop:
  mov a,r5
  mov b,a
  mul ab          ; a-b = g*g

  mov r2,a        ; here start the comparison of g2 and n ( g2 == n)
  mov r3,b        ; g2 = g*g
  cjne  a,ar6,C_NotEqual
  mov a,r3
  cjne  a,ar7,C_NotEqual
  sjmp  C_TheEnd  ; break if g2 matches n

C_NotEqual:
  clr c           ; here start the check to establish if g2 is bigger or smaller than n
  mov a,r6
  subb  a,r2
  mov a,r7
  subb  a,r3
  jnc C_NotGreater
  mov a,r4
  xrl ar5,a       ;  here (g2 > n), emove the 'added' bit

C_NotGreater:
  mov a,r4
  clr c
  rrc a           ; shift the additional bit right
  mov r4,a
  jz  C_TheEnd    ; no more 'additional' bit to add, exit
  mov a,r4
  orl ar5,a       ; here the 'additional' bit is actually added
  sjmp  C_Loop
C_TheEnd:

  mov dpl,r5
  ret ; return g;
;------------------------------------------------------------

SDCC implementation is, in my opinion, pretty good. Anyway, we can do better with hand-crafted assembly (source provided as fast_sqrt_hand_crafted.asm):

ASM

; Hand crafted Implementation of Ross M. Fossler 'Fast Integer Square Root' algorithm 
; (Microchip's application note TB040)
; function fast_sqrt_hc
; r6-r7 is n, the function argument
; r5 is g, the guess
; r4 is b, the 'additional' bit
; r2-r3 is g2 = g*g
;
_fast_sqrt_hc:
mov r6, dpl
mov r7, dph
mov r4, #0x80
mov r5, #0x80     ; initialization: function argument n = r6-r7, g = 0x80, b = 0x80

mov a,r5

HC_Loop:
  mov b,a
  mul ab          ; a-b is g2
  clr c
  subb a,r6
  xch a,b
  subb a,r7
  jc HC_Less      ; this means g2 < n
  orl a,b
  jz HC_TheEnd    ; jump to end if (g2 == n)

  ; here g2 > n, remove the added bit
  mov a, r5
  xrl a, r4       ; the XOR operration removes the bit
  mov r5, a
HC_Less:
  clr c
  mov a,r4
  rrc a           ; here the shift of the 'additional' bit is performed
  jz HC_TheEnd    ; if a is 0 then there is no more 'additional' bit in the byte
  mov r4,a 
  orl a, r5       ; here the 'additional bit' is added
  mov r5,a
  sjmp HC_Loop
HC_TheEnd:
  mov dpl, r5
  ret

Some Numbers

The SDCC produced and the hand crafted one were assembled and compared using the EdSim51DI simulator.

The results are shown in the following table:

Routine	Code size	Average number of MCU cycles
SDCC produced	48 bytes	170.9
Hand crafted	39 bytes	140.4

Useful 8051 Resources

History

27^th December, 2020: First release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)