Introduction
In 1998, I wrote an article on fixed block memory allocators for Embedded Systems Programming magazine. I received $750 for the piece. Now, articles are written for free on websites such as Code Project. Oh, how times keep a-changin’.
One thing that hasn’t changed is the usefulness of fixed block allocators. It’s a no-no using the global heap on some devices. Throughout my career, I’ve written numerous fixed block memory allocators to provide high-speed dynamic-like allocation in memory constrained systems. A generic, well-designed fixed block allocator opens up application implementation possibilities by offering dynamic memory allocations in systems where heap usage is forbidden.
On Code Project, I’ve documented various C++ allocator implementations (see the Reference Articles section). This time, I’ll present a C language version with unique features suitable for embedded devices or otherwise.
The solution presented here will:
- be faster than the global heap
- be easy to use
- be thread safe
- support fixed block versions of malloc, free, realloc and calloc
- use minimal code space
- execute in fixed time
- eliminate heap fragmentation memory faults
- require no additional storage overhead (except for a few bytes of static memory)
- handle memory alignment
- automatically dispense variable block size based on demand (a la malloc)
Two simple C modules that dispense and reclaim memory will provide all of the aforementioned benefits, as I'll show.
CMake is used to create the build files. CMake is free and open-source software. Windows, Linux and other toolchains are supported. See the CMakeLists.txt file for more information.
See GitHub for latest source code:
Background
Custom fixed block memory allocators are used to solve at least two types of memory related problems. First, global heap allocations/deallocations can be slow and nondeterministic. You never know how long the memory manager is going to take. Secondly, to eliminate the possibility of a memory allocation fault caused by a fragmented heap – a valid concern, especially on mission-critical type systems.
Even if the system isn't considered mission-critical, some embedded systems are designed to run for weeks or years without a reboot. Depending on allocation patterns and heap implementation, long-term heap use can lead to heap faults.
fb_allocator
Each fb_allocator
instance handles a single block size. The interface is shown below:
void ALLOC_Init(void);
void ALLOC_Term(void);
void* ALLOC_Alloc(ALLOC_HANDLE hAlloc, size_t size);
void* ALLOC_Calloc(ALLOC_HANDLE hAlloc, size_t num, size_t size);
void ALLOC_Free(ALLOC_HANDLE hAlloc, void* pBlock);
ALLOC_Init()
is called one time at startup and ALLOC_Term()
at shutdown. Each of the remaining APIs operate in the same manner as the CRT counterparts; ALLOC_Alloc()
allocates memory and ALLOC_Free()
deallocates.
An fb_allocator
instance is created using the ALLOC_DEFINE
macro at file scope. The TestAlloc
allocator below defines a fixed block allocator with a maximum of five 16-byte blocks.
ALLOC_DEFINE(TestAlloc, 16, 5);
Allocating a 16-byte block is simple.
void* data = ALLOC_Alloc(TestAlloc, 16);
Deallocate the block when done.
ALLOC_Free(TestAlloc, data);
x_allocator
The x_allocator
module handles multiple memory block sizes using two or more fb_allocator
instances; one fb_allocator
per block size. During allocation, x_allocator
returns a block sized from one of the fb_allocator
instances based on the caller’s requested size. The x_allocator
API is shown below:
void* XALLOC_Alloc(XAllocData* self, size_t size);
void XALLOC_Free(void* ptr);
void* XALLOC_Realloc(XAllocData* self, void *ptr, size_t new_size);
void* XALLOC_Calloc(XAllocData* self, size_t num, size_t size);
Users of x_allocator
typically create a thin wrapper module that (a) defines two or more fb_allocator
instances and (b) provides a custom API to access the x_allocator
memory. It’s easier to explain with a simple example.
my_allocator
Let’s say we want a fixed block allocator to dispense two block sizes: 32 and 128. We’ll call it my_allocator
and the API is shown below:
void* MYALLOC_Alloc(size_t size);
void MYALLOC_Free(void* ptr);
void* MYALLOC_Realloc(void *ptr, size_t new_size);
void* MYALLOC_Calloc(size_t num, size_t size);
The implementation creates multiple fb_allocator
instances; one to handle each desired block size. In this case, we’ll allow at most ten 32-byte blocks and five 128-byte blocks.
#include "my_allocator.h"
#include "x_allocator.h"
#define MAX_32_BLOCKS 10
#define MAX_128_BLOCKS 5
#define BLOCK_32_SIZE 32 + XALLOC_BLOCK_META_DATA_SIZE
#define BLOCK_128_SIZE 128 + XALLOC_BLOCK_META_DATA_SIZE
ALLOC_DEFINE(myDataAllocator32, BLOCK_32_SIZE, MAX_32_BLOCKS)
ALLOC_DEFINE(myDataAllocator128, BLOCK_128_SIZE, MAX_128_BLOCKS)
static ALLOC_Allocator* allocators[] = {
&myDataAllocator32Obj,
&myDataAllocator128Obj
};
#define MAX_ALLOCATORS (sizeof(allocators) / sizeof(allocators[0]))
static XAllocData self = { allocators, MAX_ALLOCATORS };
Now, simple one line wrapper functions provide access to the underlying x_allocator
module.
void* MYALLOC_Alloc(size_t size)
{
return XALLOC_Alloc(&self, size);
}
void MYALLOC_Free(void* ptr)
{
XALLOC_Free(ptr);
}
void* MYALLOC_Realloc(void *ptr, size_t new_size)
{
return XALLOC_Realloc(&self, ptr, new_size);
}
void* MYALLOC_Calloc(size_t num, size_t size)
{
return XALLOC_Calloc(&self, num, size);
}
When the caller calls MYALLOC_Alloc()
with a size between 1 to 32, a 32-byte block is returned. If the requested size is between 33 and 128, a 128-byte block is provided. MYALLOC_Free()
returns the block to the originating fb_allocator
instance. In this way, a collection of fixed block memory allocators are grouped together providing variable sized memory blocks at runtime based on application demand. The sample wrapper pattern is used again and again offering groups of memory blocks for specific purposes within the system.
Implementation Details
Most of the allocator implementation is relatively straight forward. However, I’ll explain a few details to assist with key concepts.
fb_allocator Free-List
This is a handy technique for linking blocks together in the free-list without consuming any extra storage for the pointers. After the user calls ALLOC_Free()
, a fixed memory block is no longer being utilized and is freed to be used for other things, like a next pointer. Since the fb_allocator
module needs to keep the deleted blocks around, we put the list's next pointer in that currently unused block space. When the block is reused by the application, the pointer is no longer needed and will be overwritten by the user object. This way, there is no per-instance storage overhead incurred linking blocks together.
The free-list is actually implemented as a singly linked list, but the code only adds/removes from the head so the behavior is that of a stack. Using a stack makes the allocation/deallocations really fast and deterministic. No loop searching for a free block – just push or pop a block and go.
Using freed object space as the memory to link blocks together means the object must be large enough to hold a pointer. The ALLOC_BLOCK_SIZE
macro ensures that the minimum size is met.
fb_allocator Memory Alignment
Some embedded systems require memory to be aligned on a particular byte boundary. Since the allocator’s memory is a contiguous static
byte array, having blocks start on an unaligned boundary could cause a hardware exception on some CPUs. For instance, 13-byte blocks will cause a problem if 4-byte alignment is required. Change ALLOC_MEM_ALIGN
to the byte boundary desired. The block size will be rounded up to the next nearest aligned boundary.
x_allocator Meta Data
The x_allocator
implementation adds 4-bytes of meta data per block. For instance, if 32-byte blocks are required by the user, the x_allocator
actually uses 36-byte blocks. The extra 4-bytes are used to hide an fb_allocator
pointer inside the block (assuming the pointer is 4-bytes in size).
When deleting memory, x_allocator
needs the original fb_allocator
instance so the deallocation request can be routed to the correct fb_allocator
instance for processing. Unlike XALLOC_Alloc()
, XALLOC_Free()
does not take a size and only uses a void*
argument. Therefore, XALLOC_Alloc()
actually hides a pointer to the fb_allocator
within an unused portion of the memory block by adding an additional 4-bytes to the request. The caller gets a pointer to the block’s client region where the hidden fb_allocator
pointer is not overwritten.
void* XALLOC_Alloc(XAllocData* self, size_t size)
{
ALLOC_Allocator* pAllocator;
void* pBlockMemory = NULL;
void* pClientMemory = NULL;
ASSERT_TRUE(self);
pAllocator = XALLOC_GetAllocator(self, size);
if (pAllocator)
{
pBlockMemory = ALLOC_Alloc(pAllocator, size + XALLOC_BLOCK_META_DATA_SIZE);
if (pBlockMemory)
{
pClientMemory = XALLOC_PutAllocatorPtrInBlock(pBlockMemory, pAllocator);
}
}
else
{
ASSERT();
}
return pClientMemory;
}
When XALLOC_Free()
is called, the allocator pointer is extracted from the memory block so the correct fb_allocator
instance can be called to deallocate the block.
void XALLOC_Free(void* ptr)
{
ALLOC_Allocator* pAllocator = NULL;
void* pBlock = NULL;
if (!ptr)
return;
pAllocator = XALLOC_GetAllocatorPtrFromBlock(ptr);
if (pAllocator)
{
pBlock = XALLOC_GetBlockPtr(ptr);
ALLOC_Free(pAllocator, pBlock);
}
}
Benchmarking
Benchmarking the allocator performance vs. the global heap on a Windows PC shows just how fast the code is. A basic test of allocating and deallocating 20000 4096 and 2048 sized blocks in a somewhat interleaved fashion tests the speed improvement. See the attached source code for the exact algorithm.
Windows Allocation Times in Milliseconds
Allocator | Mode | Run | Benchmark Time (mS) |
Global Heap | Release | 1 | 36.3 |
Global Heap | Release | 2 | 33.8 |
Global Heap | Release | 3 | 32.8 |
fb_allocator | Static Pool | 1 | 22.6 |
fb_allocator | Static Pool | 2 | 3.7 |
fb_allocator | Static Pool | 3 | 4.9 |
x_allocator | Static Pool | 1 | 33.9 |
x_allocator | Static Pool | 2 | 6.9 |
x_allocator | Static Pool | 3 | 7.7 |
Windows uses a debug heap when executing within the debugger. The debug heap adds extra safety checks slowing its performance. The release heap is much faster as the checks are disabled. The debug heap can be disabled within Visual Studio by setting _NO_DEBUG_HEAP=1
in the Debugging > Environment project option.
This benchmark test is very simplistic and a more realistic scenario with varying blocks sizes and random new/delete intervals might produce different results. However, the basic point is illustrated nicely; the memory manager is slower than allocator and highly dependent on the platform's implementation.
The fb_allocator
uses a static memory pool and doesn't rely upon the heap. This has a fast execution time of around 4ms once the free-list is populated with blocks. The 22.6ms on fb_allocator
Run 1 accounts for dicing up the fixed memory pool into individual blocks on the first run.
The x_allocator
used within the bm_allocator
module is a bit slower at ~7ms since it has overhead to allocate/deallocate multiple sized blocks whereas fb_allocator
only supports one block size.
In comparison to the Windows global heap, the fb_allocator
is about 8 times faster and x_allocator
is about 5 times faster. On embedded devices, I've seen as high as a 15x speed increase over the global heap.
Thread Safety
The LK_LOCK
and LK_UNLOCK
macros within the LockGuard
module implement the software locks needed for thread safety. Update the lock implementation as required for your platforms operating system.
Conclusion
The C-based fixed block memory allocator presented here is suitable for any C or C++ system. For a C++ specific implementation with its own unique features, see the referenced articles.
Use the fb_allocator
whenever you need to allocate a single block size. Use x_allocator
when dispensing multiple block sizes is desired. Create multiple x_allocator
wrappers to segregate memory pools depending on intended usage.
If you have an application that really hammers the heap and is causing slow performance, or if you’re worried about a fragmented heap fault, integrating fb_allocator
and x_allocator
may help solve those problems. The implementation was kept to a minimum facilitating use on even the smallest of embedded systems.
History
- 23rd December, 2018
- 24th December, 2018
- Added benchmark section to article
- Updated attached source code