Fast IPC Communication Using Shared Memory and InterlockedCompareExchange (updated!)

studio_ukc

4.44/5 (49 votes)

28 Sep 200618 min read

13.6K

Guide to writing a very fast interprocess communication class. This article describes a method of IPC that requires no locking or thread sync.

Introduction

IPC (Inter Process Communication) has been covered by many other articles, so there is no real need for 'another' IPC article to show how to implement IPC. However, there was a severe lack of information of how to implement a fast IPC class. This is the purpose of this article.

Due to the availability of information already out there on IPC implementation, I will not dive too deeply on the inner workings of how to implement IPC, but will concentrate on how to make them very fast.

There are several ways to implement an IPC, here are just a few:

Shared memory
TCP
Named Pipe
File Mapping
Mailslots
MSMQ (Microsoft Queue Solution)

With so many choices, how is one to know the fastest? The answer is simple, there is no ideal fastest solution to the problem. Each has its advantages and disadvantages, but one most definitely stands out above the others....... Shared Memory.

Not only is the shared memory implementation one of the easiest to implement, it's also one of the fastest. Why would this be one of the fastest, I hear you ask? Minimal overhead. Overhead is generated whenever you make a call to another function. Be it a kernel or library, if your IPC makes no calls to any other function, then you've done away with a large bottleneck. Shared memory IPCs have no requirement for third party function calls.

So we now know one of the best and most widely used IPC implementations, now we just need to find one of the fastest ways of implementing it.

Updated

This article has been updated. The latest changes include the implementation of a circular linked list with read and write pointers. The changes were introduced to combat the problem of LIFO (Last In First Out). The previous implementation could only provide a transfer system that operated with a LIFO posting system due to the singly linked list design. This problem has how been rectified, with the added benefit of increased transfer rate. If you would like to skip the initial article, jump to the next updated header.

(Thanks for the ideas Jean!)

Background

In any IPC implementation, it's always best to derive a server/client approach to communication. Also, for simplicity, it's wise to have the communication one way, normally to the server. It's very easy to modify an IPC to perform two way communication. Simply create two IPC servers on both processes. Making your IPC communication one way allows you to concentrate on performance issues.

To write a fast shared memory IPC, you will need to implement several things.

Firstly, you need to have multiply blocks within the allocated memory. The reason you use more than one block is so that while a thread has CPU execution time, it has the ability to post lots of different blocks of data to the other threads without blocking. One of the fastest optimizations is batching your requests. To just post one block, then wait for a context switch for the server to process a single block, would be extremely inefficient. My implementation uses the following block structure:

// Block that represents a piece of data to transmit between the
// client and server
struct Block
{
    // Variables
    // Next block in the single linked list
    volatile LONG    Next;
    // Amount of data help in this block
    DWORD            Amount;
    // Data contained in this block
    BYTE            Data[IPC_BLOCK_SIZE];
};

Secondly, you need some form of inter-process thread synchronization. Without it, multiple threads could write to the same block, resulting at best to data corruption, and worst, race conditions leading to deadlock (100% CPU usage). For my implementation, I use events, as these are very fast, much faster than semaphores or Mutexs:

// Handle to the mapped memory file
HANDLE                m_hMapFile;
// Event used to signal when data exists
HANDLE                m_hSignal;
// Event used to signal when some blocks become availa
HANDLE                m_hAvail;

Finally, you need some way of blocking thread execution on the server thread while it waits for data. One of the fastest is the WaitForSingleObject function call. My implementation uses this function and waits on the named events.

Most shared memory IPCs are implemented similar to the following:

Server

Create named shared memory with N number of blocks of fixed size X
Create inter-thread sync objects used to prevent simultaneous access blocks and race conditions
Wait for a signal that blocks need to be processed
Process the blocks and flag them as available again
Go to step 3

Client

Open named shared memory
Wait for blocks to become available
Write to blocks
Signal that blocks needs to be processed
Go to step 2

Fast Synchronization

One of the biggest problems with using shared memory is preventing multiple threads accessing blocks at the same time. We, therefore, need a way of only allowing one thread to access a block at a time.

The first hurdle you must overcome is deciding how you will organize your blocks. You must provide, effectively, two groups of blocks. One set is the blocks that are available for new data, while the second is the blocks that require processing. One very effective way of doing this is to create two double linked lists, one containing the blocks that are available for new data, while the other containing the blocks that requiring processing. The programmer would only, then, need to code in critical sections to protect the two lists. This method works very well, and is one of my very first implementations of IPCs. However, after implementing it, I was very much disappointed at the speed/bandwidth.

Using Critical Sections and double linked lists, I could only produce 40-50k block transfers per second on a 3.2Ghz P4, before I hit 100% CPU usage. Not very impressive.

So, what's the bottleneck here? Synchronization. Most of the wasted CPU time is down to entering and leaving critical sections, blocking CPU execution and thread context switching. One of the best advise I ever read was with the graphics engine design, "The fastest polygon is a polygon that isn't drawn". Applying this to the current context, the fastest critical section is no critical section at all. But if we don't use critical sections, how can we protect our lists? The answer is simple, use lists that are multi-thread safe..... single linked lists. Enter "InterlockedCompareExchange".

Now, originally, I used the InterlockedPopEntrySList and InterlockedPushEntrySList functions with single linked lists. However, I ran into quite a serious problem. At first, it seemed to work perfectly as I was testing the IPC server and IPC client within the same process; however, when testing the class on two separate processes, I ran into a big dilemma. Virtual memory.

Every block within a single linked list has to have a pointer to the next block. Now, if we dig into the SLIST_ENTRY, we see the following problem:

typedef struct _SLIST_ENTRY {
    struct _SLIST_ENTRY* Next;
} SLIST_ENTRY,  *PSLIST_ENTRY;

The standard Windows interlocked lists use memory pointers to the next block, but this is a serious problem when you use multiple processes, as the other process has a completely different memory address space. While the memory block pointer is valid under one thread context, as soon as it switches to the other, it's no longer valid, leading to access violations.

InterlockedPopEntrySList can't be used. But the concept can, we just need to rewrite the function so that it doesn't use pointers. This is where my block structure comes into play. If you look back at the definition, you'll notice that it has this:

volatile LONG    Next;

The volatile syntax tells the compiler to make sure it doesn't use the CPU cache with this variable. If the CPU uses the cache, it could assume the Next pointer is some cached value, but another thread could have changed it between fetching the cache and using the variable. Also, note that the variable type is a LONG. This is because it's actually representing a distance in bytes within the entire mapped memory where the next block starts. This makes the next pointer relative to the current address space. We simply now need to write our own InterlockedPushEntrySList and InterlockedPopEntrySList functions for this new block structure. These are my implementations of the functions:

void osIPC::Client::postBlock(Block *pBlock)
{
    // Loop attempting to add the block to
    // the singlely linked list
    LONG blockIndex = (PointerConverter(pBlock).ptrLong - 
                       PointerConverter(&m_pBuf->m_Blocks).ptrLong) 
                       / sizeof(Block);
    for (;;) {
        LONG iFirst = pBlock->Next = m_pBuf->m_Filled;
        if (InterlockedCompareExchange(&m_pBuf->m_Filled, 
                                  blockIndex, iFirst) == iFirst)
            break;
    }
    
    // Signal the event
    SetEvent(m_hSignal);
};

osIPC::Block* osIPC::Client::getBlock(void)
{
    // Loop attempting to grab a block
    for (;;) {
        LONG blockIndex = m_pBuf->m_Available;
        if (blockIndex == 0) return NULL;
        Block *pBlock = m_pBuf->m_Blocks + blockIndex;
        if (InterlockedCompareExchange(&m_pBuf->m_Available, 
                        pBlock->Next, blockIndex) == blockIndex)
            return pBlock;
    }
};

m_pBuf is a pointer to the mapped memory which has the following structure:

struct MemBuff
{
    // List of available blocks
    volatile LONG    m_Available;
    // List of blocks that have been filled with data
    volatile LONG    m_Filled;
    // Array of buffers that are used in the communication
    Block            m_Blocks[IPC_BLOCK_COUNT];
};

The server will use very similar functions, but on opposing lists.

WaitForSingleObject and Signals

We are almost there. We now have two lists that both the client and the server can manipulate at the same time, with no possibility of multithreading issues. We now simply need to have a way of telling the client that blocks are now available for writing, and tell the server that blocks are available for processing.

This is simply done with Events and the WaitForSingleObject with appropriate timeouts. Once you have these and the lists, you've a very fast IPC that so long as both threads are getting sufficient CPU time, should never enter a blocking state or make any external function calls.

Speed

By now, you must be wondering how fast this implementation can really go. It's fast, very fast! Not only is the blocks per second rate very fast, but the bandwidth can be huge as long as you keep your code optimized.

On my 3.2Ghz P4 1024 MB RAM laptop, I recently measured these speeds:

Bandwidth Test (Packet Size: 3000 Bytes) -> 800,000,000 Bytes/Sec

Rate Test (Packet Size: 100 Bytes) -> 650,000 Packets/Sec

These figures really do speak for themselves, I've not seen any implementation that comes close. I've tested it on a dual core computer, with the client and server on separate CPUs, and got bandwidths of 1,000,000,000 Bytes/Sec and rates of 2,000,000 Packets / Sec!!!!!!

The most interesting performance gains are when you stress test it with multiple clients. As long as the block count is scaled with the number of clients, the total bandwidth is hardly affected with a moderate number of clients.

Problems?

There should be one glaring design flaw with this implementation? FIFO (First in first out). Double linked lists are used in other implementations for a very good reason, they allow the programmer to add blocks to the end of the list. This ensures that the first block added to the list will be the first block that's processed. Single linked lists only allow adding blocks to the front of the list, meaning the first block added will be the last block (FILO) that's processed. If blocks are added faster than they are processed, then that block that was added first could spend quite some time sitting in the list waiting to be processed, while new blocks get processed instead. If you know and understand this limitation, then that's OK, but what if this limitation is unacceptable, for example, let's say you are trying to duplicate the behavior of a local TCP connection. TCP connections have to guarantee FIFO.

The answer is simple. When the server processes blocks, simple create another single linked list and fill this first. When the server pops a block from the 'to be processed' list, it simply adds this block to the intermittent list before it's processed. As long as the server ensures it takes all the blocks out of the 'to be processed' list before going on to process the intermittent list as per usual, it will ensure that it's processing blocks in a FIFO fashion. Try and work it out in your head, you will see how it works.

Happy coding!

Updated

Introduced

The need for a FIFO has resulted in updating this document. As previously mentioned in the Problems section of this article, the existing implementation has a serious flaw, LIFO. It will only read the last block added to the list. They are not only reversed, but also jumbled up. This means all data posted to the IPC will become out of order, for most programs this is an unacceptable restriction.

My first plan to rectify this problem was to use another single linked list and push all the blocks into this list before processing them. This would result in the order being reversed. Although this method would have worked to some degree, it still has a major problem. If a block was added to the filled list before the second list is fully populated, this new block could appear anywhere between the start and the end. This would result in neither FIFO nor LIFO.

Comments from Jean and others pointed me towards another implementation I could try. Circular linked lists with read and write cursors. The IPC class would need complete rewriting to work with this model, so let's jump right into the concept.

Concept

First, let me explain what's being proposes. A double circular linked list is the equivalent of a double linked list, but with the start and end of the list pointing to each other. Effectively, if you were to try and transverse such a list, you would circle though every entry forever. E.g.:

We now introduce some read and write pointers. They must all follow some set rules:

All pointers must never pass each other
End pointers may equal start pointers (indicating zero data)
Read and write pointers must never equal each other

What this design concept means is we can simultaneously have many blocks being written to while having many blocks being read at the same time. As long as there is space in front of the write start cursor, writing will never be blocked. Likewise, as long as there is space in front of the reading cursor, read operations will never block.

This design also deals with multiple reading and writing. The number of threads that can work on the same IPC is equal to the number of blocks held in it, this includes both reading and writing. The area where reading and writing is taking place (green and red area on the diagram) will only contain blocks that are owned by other threads, meaning they are multi-thread safe. Once a block has been retrieved, that thread effectively owns the block. The main advantage of this method is that this system now operates in a FIFO nature.

For performance reasons, we do not want to use any critical sections, mutex's, or other thread sync just like the existing implementation. This means, all operations to move the pointers must be performed atomically. I.e., they appear to complete in a single instruction from all other threads' perspective. This ensures the pointers are not corrupted by multiple threads trying to move the same read and write cursors forward.

OK. Let's dive into the code:

First, we must modify our data structures and setup the circular linked list. This is all done once the IPC server is created.

// Block that represents a piece of data to transmit between the
// client and server
struct Block
{
  // Variables
  // Next block in the circular linked list
  LONG Next;
  // Previous block in the circular linked list
  LONG Prev;
  // Flag used to signal that this block has been read
  volatile LONG doneRead;
  // Flag used to signal that this block has been written
  volatile LONG doneWrite;

  // Amount of data held in this block
  DWORD Amount;
  // Padded used to ensure 64bit boundary
  DWORD _Padding;
  // Data contained in this block
  BYTE Data[IPC_BLOCK_SIZE];
};

We've made a few changes to this structure. Firstly, we've added a previous pointer that will point to the previous block in the circular list. Secondly, we've added doneRead and doneWrite flags. More on these later. We've also aligned everything to a 64 bit boundary. This is an optimization for 64 bit machines.

// Shared memory buffer that contains everything required to transmit
// data between the client and server
struct MemBuff
{
  // Block data, this is placed first
  // to remove the offset (optimisation)
  Block m_Blocks[IPC_BLOCK_COUNT];
  // Array of buffers that are used in the communication
  
  // Cursors
  // End of the read cursor
  volatile LONG m_ReadEnd;
  // Start of read cursor
  volatile LONG m_ReadStart;
  // Pointer to the first write cursor,
  // i.e. where we are currently writting to
  volatile LONG m_WriteEnd;
  // Pointer in the list where we are currently writting
  volatile LONG m_WriteStart;
};

We've moved the blocks array to the front of the shared memory buffer. This is a performance optimization so that we remove the offset that needs adding every time we reference a block.

All the singly linked lists have been removed and replaced with cursors (one for reading, one for writing).

Let's now build the circular linked list and set the default cursor positions (server initializing code).

// Create a circular linked list
int N = 1;
m_pBuf->m_Blocks[0].Next = 1;
m_pBuf->m_Blocks[0].Prev = (IPC_BLOCK_COUNT-1);

for (;N < IPC_BLOCK_COUNT-1; N++)
{
  // Add this block into the available list
  m_pBuf->m_Blocks[N].Next = (N+1);
  m_pBuf->m_Blocks[N].Prev = (N-1);
}
m_pBuf->m_Blocks[N].Next = 0;
m_pBuf->m_Blocks[N].Prev = (IPC_BLOCK_COUNT-2);

// Initialize the pointers
m_pBuf->m_ReadEnd = 0;
m_pBuf->m_ReadStart = 0;
m_pBuf->m_WriteEnd = 1;
m_pBuf->m_WriteStart = 1;

The first block in the array is linked to the second and last block, likewise the last block is linked to the first and second to last block. This completes the circle.

We, then, set the read and write pointers. The start and end are equal because no read or write operations are currently taking place, also the read cursor is right behind the write cursor as no data has been added yet.

Now, we need to implement a method for adding data. This should move the writeStart pointer forward (but only if there's space to expand), thus increasing the gap between the writeStart and writeEnd cursors. Once this gap is expanded by the thread, the resulting block can be accessed, allowing data to be written to it.

// Grab another block to write too
// Enter a continous loop (this is to make
// sure the operation is atomic)
for (;;)
{
  // Check if there is room to expand the write start cursor
  LONG blockIndex = m_pBuf->m_WriteStart;
  Block *pBlock = m_pBuf->m_Blocks + blockIndex;
  if (pBlock->Next == m_pBuf->m_ReadEnd)
  {
    // No room is available, wait for room to become available
    if (WaitForSingleObject(m_hAvail, dwTimeout) == WAIT_OBJECT_0)
        continue;
    // Timeout
    return NULL;
  }
  // Make sure the operation is atomic
  if (InterlockedCompareExchange(&m_pBuf->m_WriteStart, 
               pBlock->Next, blockIndex) == blockIndex)
      return pBlock;
  // The operation was interrupted by another thread.
  // The other thread has already stolen this block, try again
  continue;
}

First, we need to ensure the operation is atomic. To do this, the block index is recorded to be used by the InterlockedCompareExchange later on.

Next, we make sure the writeStart cursor will not pass any outstanding read operations (readEnd). If it would, then we need to wait for some space to become available. A timeout is provided in case the user does not wish to wait. This situation would occur if the IPC buffers have been completely filled with data.

The cursor is then moved forward using an InterlockedCompareExchange call. This makes sure the cursor was not modified by another thread while all the checks took place, the previously recorded value (blockIndex) is used as the reference. If the call is successful, the block is then returned.

// Grab a block
Block *pBlock = getBlock(dwTimeout);
if (!pBlock) return 0;
 
// Copy the data
DWORD dwAmount = min(amount, IPC_BLOCK_SIZE);
memcpy(pBlock->Data, pBuff, dwAmount);
pBlock->Amount = dwAmount;
 
// Post the block
postBlock(pBlock);
// Fail
return 0;

This block of code does the actual data writing itself. Once the data is copied into the block, the thread gives up ownership of the block using the 'postBlock()' call. This will move the writeEnd cursor up, allowing read operations to take place on the data we've posted.

void osIPC::Client::postBlock(Block *pBlock)
{
    // Set the done flag for this block
    pBlock->doneWrite = 1;

    // Move the write pointer as far forward as we can
    for (;;)
    {
        // Try and get the right to move the poitner
        DWORD blockIndex = m_pBuf->m_WriteEnd;
        pBlock = m_pBuf->m_Blocks + blockIndex;
        if (InterlockedCompareExchange(&pBlock->doneWrite, 0, 1) != 1)
        {
            // If we get here then another thread
            // has already moved the pointer
            // for us or we have reached as far
            // as we can possible move the pointer
            return;
        }

        // Move the pointer one forward (interlock protected)
        InterlockedCompareExchange(&m_pBuf->m_WriteEnd, 
                             pBlock->Next, blockIndex);
    
        // Signal availability of more data
        // but only if threads are waiting
        if (pBlock->Prev == m_pBuf->m_ReadStart)
            SetEvent(m_hSignal);
    }
};

Remember the flags we added to the block structure earlier? These are now used in the 'postBlock()' function call, one flag for reading and one for writing. This particular case uses the 'doneWrite' flag. The reason for the use of these flags is explained in detail below.

Without the 'doneWrite' flag, there is a serious problem. Imagine this situation. An outstanding block owned by a thread posts back a block. This block is not the first block in the 'write' section of the memory (i.e., it's not equal to m_WriteEnd). We, therefore, cannot move the writeEnd cursor forward. To do so would mean that another thread that actually owns the first block has lost its exclusive ownership, not to mention the problems that would occur if this other thread then tried to post back this very block that's already been taken into account. We therefore have delay moving the writeEnd cursor forward, leaving the responsibility of the cursor to the thread that owns the first block. But to signal that we've given up ownership of the block, we set this 'doneWrite' flag.

Without this 'doneWrite' flag, when it comes to actually moving the first block forward later on, there's no way of knowing if the rest of the blocks have been returned yet, hence the need for the flag. The flag is initialized to zero; once a block is returned, it's set to one. When the block is made available again by moving the writeEnd pointer forward, the flag is again zeroed (InterlockedCompareExchange).

The writeEnd pointer will be moved forward as far as the last block that's completed, i.e., all sequential blocks that have their 'doneWrite' flag set to one. Notice that by design, this also ensures that the writeEnd cursor can never pass the writeStart pointer.

Finally, we signal any threads that are waiting to read data.

We now have a way of filling blocks in the buffer, we just need a way of reading those blocks back again. Onto read operations:

osIPC::Block* osIPC::Server::getBlock(DWORD dwTimeout)
{

    // Grab another block to read from
    // Enter a continous loop (this is to
    // make sure the operation is atomic)
    for (;;)
    {
        // Check if there is room to expand the read start cursor
        LONG blockIndex = m_pBuf->m_ReadStart;
        Block *pBlock = m_pBuf->m_Blocks + blockIndex;
        if (pBlock->Next == m_pBuf->m_WriteEnd)
        {
            // No room is available, wait for room to become available
            if (WaitForSingleObject(m_hSignal, 
                  dwTimeout) == WAIT_OBJECT_0)
                continue;
            // Timeout
            return NULL;
        }
        // Make sure the operation is atomic
        if (InterlockedCompareExchange(&m_pBuf->m_ReadStart, 
                    pBlock->Next, blockIndex) == blockIndex)
            return pBlock;
        // The operation was interrupted by another thread.
        // The other thread has already stolen this block, try again
        continue;
    }
};


void osIPC::Server::retBlock(osIPC::Block* pBlock)
{
    // Set the done flag for this block
    pBlock->doneRead = 1;
    // Move the read pointer as far forward as we can
    for (;;)
    {
        // Try and get the right to move the poitner
        DWORD blockIndex = m_pBuf->m_ReadEnd;
        pBlock = m_pBuf->m_Blocks + blockIndex;
        if (InterlockedCompareExchange(&pBlock->doneRead, 0, 1) != 1)
        {
            // If we get here then another
            // thread has already moved the pointer
            // for us or we have reached as far
            // as we can possible move the pointer
            return;
        }
        // Move the pointer one forward (interlock protected)
        InterlockedCompareExchange(&m_pBuf->m_ReadEnd, 
                             pBlock->Next, blockIndex);


        // Signal availability of more data
        // but only if a thread is waiting
        if (pBlock->Prev == m_pBuf->m_WriteStart)
        SetEvent(m_hAvail);
    }
};

DWORD osIPC::Server::read(void *pBuff, DWORD buffSize, DWORD dwTimeout)
{
    // Grab a block
    Block *pBlock = getBlock(dwTimeout);
    if (!pBlock) return 0;
 
    // Copy the data
    DWORD dwAmount = min(pBlock->Amount, buffSize);
    memcpy(pBuff, pBlock->Data, dwAmount);

    // Return the block
    retBlock(pBlock);

    // Success
    return dwAmount;
};

Notice that this implementation is also identical to the operation to write to blocks, except the cursors and flags are different. If you think about it, this makes sense as we are effectively doing the same thing, as in moving an area of blocks sequentially. We could, if we wanted, add another cursor somewhere between the read and write cursors that performs some other action. Obviously, this particular implementation has no need of this, but it could be useful to others.

That's basiclly all there is to it. We have the ability to write to blocks and the ability to read from them. All functions are multi-thread safe, they all use minimal overhead and don't block / pass CPU execution unless they have too.

We now have all the functionality needed to perform IPC communication quickly and in a FIFO fashion. Let's see how it performs!!!

Performance

Bandwidth Test:

Transfer Rate Test:

The numbers all go up, so that's good!! Pay special attention to the increased transfer rate. Over 2 million block transfers a second! That's a huge transfer potential. Bandwidth is mainly restricted by the speed that data can be transferred onto and out of a block, i.e., the time it takes to complete the 'memcpy' call, but even with huge memory transfers, the IPC achieves well over 7.5 GBit/s. That's getting rather close to the physical RAM writing speed. To put this all in perspective, try putting a 'memcpy' call in a simple loop, and see what transfer rates you get. If you compare the overhead of this IPC implementation to the rest of a program, you will see it uses almost no overhead. The potential for more performance is only limited by the hardware itself.

With the best BSP tree I could write, I can only achieve two million BSP tree lookups a second, and with my best hash table, lookups of four million a second were observed. A simple loop that does almost nothing at all could only perform ten million iterations per second. The use of this IPC implementation can be regarded as almost CPU free.

In terms of memory consumption, only 2 MB of RAM is used regardless of how much you use the IPC. This memory usage can even be reduced by lowering the block count (IPC_BLOCK_COUNT) or the block size (IPC_BLOCK_SIZE). Memory fragmentation is impossible as all the memory blocks are in a continuous piece of memory. (This is also a benefit to the CPU cache.)

All this extra performance is a major bonus considering all we wanted to do was make the IPC operate in a FIFO nature!

I hope this code is of benefit to others. Happy coding!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here