Testing distributed memory pools

Valery Grebnev

5.00/5 (8 votes)

20 Jun 2009CPOL7 min read

28.9K

802

This article describes a simple approach and test results when creating distributed pools of objects for high-performance applications on a Multi-core PC.

Download source code - 152 KB

Introduction

Developing high-performance applications often requires usage of fast memory allocator storage. Using distributed memory pools may improve performance of multi-threaded applications on Multi-core PCs [1-3].

Distributed pools attempt to minimize the impact of "false sharing" and thread contention on a Multi-core PC. Instead of accessing the global memory storage concurrently (which might be a bottleneck), per-thread allocated memory pools can be used along with balancing private and global pools to leverage fast non-locking algorithms. Such distributed solutions are not simple. Latest versions of Microsoft Windows provide optimized synchronization primitives. We might want to take advantage of them, designing simple, but still fast pool solutions.

A simplest distributed pool can be an array of pools. Application threads are "distributed" within these pools according to some distribution rules. The rules may depend on how threads are being created and end - it can be a uniform distribution created at the application start up, or it can be a "hash-based" distribution. To try these techniques out, we introduce a template class, Pool_t, representing an element of a pool array. Then, we will implement a class, Array_of_pools_t, representing a distributed pool. As for a Pool_t object, we use a straightforward implementation leveraging the Windows Interlockedxxx API for single linked lists:

Figure 1. Pool_t - an element of pool array.

C++

/*
Copyright (c) 2009 Valery Grebnev.
CPOL (Code Project Open License) Licensed http://www.codeproject.com/info/cpol10.aspx
*/
template <typename Entry_t, size_t pool_size> 
class CACHE_ALIGN Pool_t
{
public:
   inline Entry_t* pop(void)
   {
      List_entry* plist = static_cast<List_entry*>(interlocked_pop_entry_slist(&m_head));
      return (plist)? reinterpret_cast<Entry_t*>(reinterpret_cast<char*>(plist) + 
                      data_offset): NULL; 
   }
   inline void push(Entry_t* data)
   {
      List_entry* plist =
      reinterpret_cast<List_entry*>(reinterpret_cast<char*>(data) - data_offset);
      interlocked_push_entry_slist(&m_head, plist);
   }
   Pool_t()
   {
      ::InitializeSListHead(&m_head);
      for(size_t i = 0; i < pool_size; i++)
      {
         interlocked_push_entry_slist(&m_head, &m_pool[i].m_list_entry);
      }
   }
   virtual ~Pool_t()
   {
   }
   
private:   
   typedef SLIST_HEADER CACHE_ALIGN SLIST_HEADER_;
   struct CACHE_ALIGN List_entry: public SLIST_ENTRY
   {
      //some extra data if needed
   };

   struct CACHE_ALIGN
   _Internal_entry_t
   {
      CACHE_PADDING(1);
      List_entry m_list_entry;
      Entry_t m_value;
   };
   static const int data_offset = offsetof(_Internal_entry_t, m_value) -
                                  offsetof(_Internal_entry_t, m_list_entry);
   
   CACHE_PADDING(1);
   _Internal_entry_t CACHE_ALIGN m_pool[pool_size];
   
   CACHE_PADDING(2);
   SLIST_HEADER_ m_head;
};

This pool implementation is thread safe. It is fast on a Uniprocessor PC. On a Multi-core PC, sharing of the head of the list m_head incurs a performance penalty. The impact depends on the hardware and may be quite significant. Introducing a distributed pool, we want different threads to update lists' heads at different memory locations, so that thread contention and "false sharing" could be less significant. If all threads are created at application start up, a simple solution is to uniformly distribute all the threads within an array of pools:

Figure 2. Simple distributed pool Array_of_pools_t.

C++

/*
Copyright (c) 2009 Valery Grebnev.
CPOL (Code Project Open License) Licensed
http://www.codeproject.com/info/cpol10.aspx
*/
template <typename Entry_t, size_t pool_size, size_t number_of_pools> 
class CACHE_ALIGN Array_of_pools_t
{
public:
   inline Entry_t* pop(size_t index)
   {
      return m_pools[index].pop();
   }
   inline void push(Entry_t* data, size_t index)
   {
      m_pools[index].push(data);
   }
   inline size_t get_pool_index()
   {
      volatile static LONG m_counter = 0L;
      return (::InterlockedIncrement(&m_counter) - 1) % number_of_pools;
   }
   Array_of_pools_t()
   {
   }
private:
   Pool_t<Entry_t, pool_size> m_pools[number_of_pools];
};

When a thread is created, first it gets and stores the index = get_pool_index(). This index addresses a pool within the array, which the thread will be accessing in further pop/push calls. It is important for a thread to use the same index all the time. Another approach of creating a thread distribution might be using a hash as a pool index, for example:

Figure 4. Using a multiplicative hash to distribute threads within an array of pools.

C++

inline size_t Array_of_pools_t::get_pool_index()
{
   return hash(::GetCurrentThreadId());
}

long Array_of_pools_t::hash(long key)
{
   const float golden=0.6180339f;
   float val = (float)key * golden;
   int ival = (int) val;
   return((long) ((val - (float)ival) * (float) number_of_pools));
}

The test results of a single Pool_t object and an Array_of_pools_t object are summarized in Figure 5. Eight threads (4 writers and 4 readers) were concurrently accessing pools on a Quad-core PC. The writer thread popped a buffer from the pool, filled the buffer with random bytes, and calculated the CRC. Then, the writer pushes the buffer back. Each reader thread pops a buffer from the pool, calculates the CRC, and compares the latter with the CRC saved by the writer (an exception was supposed to be thrown in case of a CRC error, which would indicate incorrectness of the pool implementation). The test application is attached to this article.

Figure 5. Comparison performance of single Pool_t and distributed Array_of_pools_t objects when running 8 threads (4 writers and 4 readers) on a Quad-CPU PC, Intel Q6700, 2.66 GHz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP1, compiled in VC++ 2008 Express (x32- and x64-bit versions). Duration per operation in microseconds.

The average duration per read/write in microseconds was calculated based on 50 tests; each of them included 1,000,000 and 4,000,000 iterations for a writer and a reader, respectively. Test results show that the performance of the distributed pool Array_of_pools_t is better as compared to a single pool, considerably amended (up to ~5-6 times) when the number of pools increases from 1 to 4 on a Quad-core PC. When the number of threads increases from 8 to 32, performance is improving more considerably, up to 10 times. Two of the factors affecting the performance of these pools are "false sharing" and thread contention. In order to reduce the impact of thread contention, we may try a distribution rule "per processor" in place of the distribution rule "per threads". Figure 6 illustrates "per processor" related modifications of the Pool_t and Array_of_pools_t code:

Figure 6. "Per processor" distributed pool.

C++

/*
Copyright (c) 2009 Valery Grebnev.
CPOL (Code Project Open License) Licensed
http://www.codeproject.com/info/cpol10.aspx
*/
template <typename Entry_t, size_t pool_size> 
class CACHE_ALIGN Pool_t
{
public:
   inline Entry_t* pop(void)
   {
      List_entry* plist = 
        static_cast<List_entry*>(interlocked_pop_entry_slist(&m_head));
      return (plist)? reinterpret_cast<Entry_t*>(reinterpret_cast<char*>(plist) +
                      data_offset): NULL; 
   }
   inline void push(Entry_t* data)
   {
      List_entry* plist =
      reinterpret_cast<List_entry*>(reinterpret_cast<char*>(data) - data_offset);
      interlocked_push_entry_slist(&m_head, plist);
   }
   inline static size_t get_pool_id(Entry_t* data)
   {
      List_entry* plist =
      reinterpret_cast<List_entry*>(reinterpret_cast<char*>(data) - data_offset);
      return plist->m_pool_id;
   }
   void set_pool_id(size_t id)
   {
      for(size_t i = 0; i < pool_size; i++)
      {
         m_pool[i].m_list_entry.m_pool_id = id;
      }
   }
   Pool_t()
   {
      ::InitializeSListHead(&m_head);
      for(size_t i = 0; i < pool_size; i++)
      {
         m_pool[i].m_list_entry.m_pool_id = 0;
         interlocked_push_entry_slist(&m_head, &m_pool[i].m_list_entry);
      }
   }
   virtual ~Pool_t()
   {
   }

private:
   typedef SLIST_HEADER CACHE_ALIGN SLIST_HEADER_;
   struct CACHE_ALIGN List_entry: public SLIST_ENTRY
   {
      size_t m_pool_id;
   };

   struct CACHE_ALIGN
   _Internal_entry_t
   {
      CACHE_PADDING(1);
      List_entry m_list_entry;
      Entry_t m_value;
   };
   
   static const int data_offset = offsetof(_Internal_entry_t, m_value) -
                                  offsetof(_Internal_entry_t, m_list_entry);
   
   CACHE_PADDING(1);
   _Internal_entry_t CACHE_ALIGN m_pool[pool_size];
   
   CACHE_PADDING(2);
   SLIST_HEADER_ m_head;
};

template <typename Entry_t, size_t pool_size, size_t number_of_pools> 
class CACHE_ALIGN Array_of_pools_t
{
public:
   inline Entry_t* pop()
   {
      return m_pools[::GetCurrentProcessorNumber()% number_of_pools].pop();
   }
   inline void push(Entry_t* data)
   {
      m_pools[Pool_t<Entry_t, pool_size>::get_pool_id(data)].push(data);
   }
   Array_of_pools_t()
   {
      for(size_t i = 0; i < number_of_pools; i++)
      {
         m_pools[i].set_pool_id(i);
      }
   }
private:
   Pool_t<Entry_t, pool_size> m_pools[number_of_pools];
};

The idea of the "per processor" distribution is to reduce a number of threads which can access shared members of a pool object at the same time on a Multi-core PC. This should further minimize thread contention on the pool heads, and thus amend performance. But the test results in Figure 5 show that there is not much difference in the performance of the pools having "per processor" and "per thread" distributions. "Per thread" distributed pools are not slower. In order to see the difference in contention between "per thread" and "per processor" distributions, we may introduce our test version of Interlockedxxx functions for single linked lists with a "contention" counter. The x32-bit implementation below is simple and perhaps is not too far from what Microsoft does for x32-bit single linked list API:

Figure 7. SLIST pop/push testing functions with a "contention" counter for x32-bit code.

C++

#if defined(TEST_CONTENTION_COUNT)
/*
 The functions test_interlocked_pop_entry_slist32 and test_interlocked_push_entry_slist32
 are for testing purposes. You should have not used these functions in your production
 code without reviewing them "" if SLIST_ENTRY objects are allocated and disposed dynamically,
 the instructions (exch.Next).Next = (prev_header.Next).Next->Next may not refer
 to a valid memory location.
*/
#define interlocked_pop_entry_slist test_interlocked_pop_entry_slist32
#define interlocked_push_entry_slist test_interlocked_push_entry_slist32


extern volatile long g_contention_count;
 
PSLIST_ENTRY WINAPI 
  test_interlocked_pop_entry_slist32(__inout PSLIST_HEADER ListHead)
{
   SLIST_HEADER cmp_header, prev_header;
   prev_header.Alignment = ListHead->Alignment;
   _read_compiler_barrier_();
   
   long count = -1;
   do
   {
      count++;
      if ((prev_header.Next).Next == 0)
      {
         return NULL;
      }
      else
      {
      }
      SLIST_HEADER exch = {0};
      (exch.Next).Next = (prev_header.Next).Next->Next;
      exch.Sequence = prev_header.Sequence + 1;
      cmp_header.Alignment = prev_header.Alignment;

      prev_header.Alignment = ::_InterlockedCompareExchange64(
                                      reinterpret_cast<__int64*>(&(ListHead->Alignment)),
                                      exch.Alignment,
                                      cmp_header.Alignment
                                 );
      __asm{pause}
   }
   while(prev_header.Alignment != cmp_header.Alignment);

   ::_InterlockedExchangeAdd(&g_contention_count, count);
   return prev_header.Next.Next;
}
 
PSLIST_ENTRY WINAPI test_interlocked_push_entry_slist32(
                    __inout PSLIST_HEADER ListHead,
                    __inout PSLIST_ENTRY ListEntry)
{
   PSLIST_ENTRY prev_entry;
   PSLIST_ENTRY first_entry;
   prev_entry = ListHead->Next.Next;
   _read_compiler_barrier_();

   long count = -1;
   do
   {
      count++;
      first_entry = prev_entry;
      ListEntry->Next = prev_entry;

      prev_entry = reinterpret_cast<PSLIST_ENTRY>(
                                   ::_InterlockedCompareExchange(
                                   reinterpret_cast<long*>(&(ListHead->Next.Next)),
                                   reinterpret_cast<long>(ListEntry),
                                   reinterpret_cast<long>(first_entry)));
      __asm{pause}
   } while (first_entry != prev_entry);

   ::_InterlockedExchangeAdd(&g_contention_count, count);
   return prev_entry;
}
#else

#define interlocked_pop_entry_slist ::InterlockedPopEntrySList
#define interlocked_push_entry_slist ::InterlockedPushEntrySList

#endif

Test results are summarized in Figure 8. The "contention" counter is measured in % (counter = (g_contention_count / total_number_of_iterations) *100%). The "contention" counter is higher (~20%) for the pools having the "per thread" distribution as compared with "per processor" distributed pools.

Figure 8. "Contention" counter for a single Pool_t and distributed Array_of_pools_t objects when running 8 threads (4 writers and 4 readers) on a Quad-CPU PC, Intel Q6700, 2.66 GHz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP1, compiled in VC++ 2008 Express (x32-bit application).

Despite the higher values of "contention" counter for the "per thread" distributed pools, they are not slower, but even faster than pools with the "per processor" distribution, see Figure 5. If a distributed pool is using fast synchronization primitives (such as the Windows slist's API), the thread contention does not appear to be the only factor affecting performance. Moreover, the test results of "per thread" distributed pools in Figure 9 shows that the increase in the number of pools greater than 4 does not further improve performance, even if the number of threads increases from 8 to 32.

Figure 9. Performance and "contention" counter for "per thread"-distributed pools when running 8, 16, and 32 threads on a Quad-CPU PC, Intel Q6700, 2.66 GHz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP2, compiled in VC++ 2008 Express (x32-bit application).

If the number of threads equals 32, and the number of pools equals 4, eight threads are concurrently accessing a pool in the array. In this case, we might expect improving performance along with further increase in the number of pools. But, as it is shown in Figure 9, both the "contention" counter and performance stays the same on a Quad-core PC when the number of pools reaches 4. Practically, it means that there is no sense to create a private memory pool per each application thread (as sometimes it is suggested). Depending on the hardware, creating a distributed pool with a size (number of pools) greater than the number of processors/cores might also be a waste of resources.

Conclusion

Using a simple array of pools may considerably improve x32/x64 application performance, developed with Microsoft compiler VC++9.0 for Microsoft Windows Vista, running on a Multi-core PC.
It might be a disadvantage to create more pools in an array than the number of processors/cores.

References

Charles Leiserson. "Multicore Storage Allocation", http://www.cilk.com/multicore-blog/bid/7904/Multicore-Storage-Allocation
Barry Tannenbaum. "Miser - A Dynamically Loadable Memory Allocator for Multi-Threaded Applications", http://www.cilk.com/multicore-blog/?Tag=memory%20management
"The Hoard Memory Allocator", http://www.hoard.org/

Acknowledgments

Many thanks to Dmitriy V'jukov and Serge Dukhopel. It was their idea to try "per processor" distributed pools on a Multi-core PC, including methods of thread disstribution within pools. Dmitriy V'jukov also made some valuable comments on the usage and limitation of the code in this article.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)