Imagine if memcache were single-threaded and UDP-centric. I explore this idea with a C++ server and a .NET client, showing how to do UDP programming and LRU cache development in C++, and how to do packet processing in .NET.
Introduction
In this article, I will show you how to do UDP network programming, how to write a Least-Recently-Used (LRU) cache, and how to do packet processing in .NET. These programming skills are applied to writing a network memory cache, like memcache
. It is UDP-centric, uses an LRU cache, and is single-threaded. I then show how to access this cache from a .NET client.
UDP Programming
User Datagram Protocol (UDP) programming is a way of sending small pieces of information between computers over a network. Unlike Transmission Control Protocol (TCP), there is no guarantee of delivery of data, or even that data is received in the order it was sent, or that the data was not mangled along the way.
In UDP programming, there is a server and a client. The server binds to a port and listens for datagrams (network packets) that are sent by clients, and (usually) responds to the client with a datagram response of some sort. Clients connect to servers by address and the server's port and send requests to the server, and receive responses from the server. Unlike the full duplex persistent streams of TCP, UDP is more like playing catch with a baseball.
The most popular UDP programming system is the Domain Name Service (DNS), responsible for turning domains like michaelballoni.com into network addresses. In DNS, the client sends the domain name to the server, and the server sends the network address back to the client, using basic datagrams with no persistent connections like TCP uses. This makes the protocol lightweight without the processing or resource overhead of persistent connections.
Besides being unreliable, the other big limitation of UDP is that only about 500 bytes can be sent to or received from a UDP server over the internet. If you are willing to work within these restrictions, if you are basically willing to send little things back and forth and are willing to tolerate or compensate for unreliable networking, then UDP is right for you.
UDP Server Implementation
In UDP programming, the server code typically creates a socket, binds to a port, and goes into a receive-request / process-request / send-response loop.
Create UDP Socket
my_socket = ::socket(AF_INET, SOCK_DGRAM, 0)
If my_socket
came back < 0
, you'd do some error handling.
Bind UDP Socket to Port
sockaddr_in servaddr;
memset(&servaddr, 0, sizeof(servaddr));
servaddr.sin_family = AF_INET;
servaddr.sin_addr.s_addr = INADDR_ANY; servaddr.sin_port = ::htons(my_port); result = ::bind(my_socket, (const sockaddr*)&servaddr, sizeof(servaddr))
If result came back non-zero, you would do some error handling, and clean up my_socket
from above.
Receive Request Datagrams and Send Responses
Here is a basic request / response UDP server loop.
const size_t max_packet_size = 508;
while (true)
{
sockaddr_in clientaddr;
int addr_len = sizeof(clientaddr);
uint8_t input_buffer[max_packet_size];
int read_amount =
::recvfrom(my_socket, (char*)input_buffer,
(int)max_packet_size, 0, (sockaddr*)&clientaddr, &addr_len);
if (read_amount <= 0)
uint8_t output_buffer[max_packet_size];
size_t output_len = 0;
if (::sendto(my_socket, (const char*)output_buffer,
(int)output_len, 0, (sockaddr*)&clientaddr, addr_len) < 0)
}
UDP Client Programming
UDP client programming is much like UDP server programming, with the request being sent and response received, and a bit simpler. The style I use here is after TCP with the connect / send / recv functions, as this provides the useful illusion of a persistent connection.
Create UDP Socket, Just Like the Server
my_socket = ::socket(AF_INET, SOCK_DGRAM, 0)
If my_socket came back < 0 you'd do some error handling.
Connect to the UDP Server
sockaddr_in servaddr;
memset(&servaddr, 0, sizeof(servaddr));
servaddr.sin_family = AF_INET;
servaddr.sin_port = ::htons(port); servaddr.sin_addr.s_addr = ::inet_addr(server); result = ::connect(my_socket, (sockaddr*)&servaddr, sizeof(servaddr))
If result != 0
, you'd handle the error and close my_socket
.
Interact with the UDP Server
uint8_t request_buffer[max_packet_size];
size_t request_len;
result = ::send(my_socket, (const char*)request_buffer, (int)request_len, 0);
uint8_t response_buffer[max_packet_size];
int response_len = ::recv(my_socket, (char*)response_buffer, (int)max_packet_size, 0);
And that's it for UDP programming, client and server. Well... you should really set a timeout on that recv
call. As written, if no response were delivered to the client for whatever reason, be it server error or network error, the client application would hang. Oh, and you'd want to validate the response in case it got garbled on the network. Yeah, packet loss, package damage, yep, that should do it.
Network Cache Implementation
The attached source code contains a "packetcache
" solution. The mission of this solution is two-fold: first, develop a cache server the likes of memcache
; second, develop cache clients to prove out performance differences between packetcache
and memcache
.
To this end, there is a C# class library and two C# applications, a smoke test POC packetapp client, and a performance benchmark packetstress
application, and there is C++ code for a static
library with all the guts of the C++ UDP server and a C++ POC UDP client used only for development.
You can dig into all that C# and C++ code and draw your own conclusions on the performance profiles of packetcache
vs. memcache
. Here, I want to lay out what an LRU cache can look like in C++.
LRU Cache Overview
A cache maps keys to values. Values for keys are added to the cache as needed, then accessed "for free" by other users of the cache. A Least-Recently-Used (LRU) cache maintains the most-recently used cache entries, trimming, you guessed it, the least recently used cache entries, whenever the cache takes up too much memory.
The LRU Cache I created is composed of two data structures: first, a list of cache entries ordered most to least recently used; second, a hash table mapping keys to cache entries. The trick is to have the cache entry objects chained together to form the list, and the hash table contains pointers to the cache entries.
LRU Cache Class
The cache class contains the linked list (cache_list
) and a hash table (std::unordered_map
). cache_list
has head and tail pointers to objects of type cache_entry
. The hash table maps std::vector<uint8_t>
to cache_entry*
. Having a std::vector<uint8_t>
hash table key is really no different than having a std::string
key. I did not want to deal with character encodings, or even assume that clients want to cache by string. Keys can be any binary value of any length taking UDP packet size into account.
Here's the trick for allowing std::vector<uint8_t>
hash table keys:
#include "MurmurHash2.h" // https://github.com/aappleby/smhasher
struct byte_array_hasher
{
std::size_t operator()(const std::vector<uint8_t>& v) const noexcept
{
return MurmurHash2(v.data(), int(v.size()), 0);
}
};
Then you can declare the hash table in the cache class like so:
std::unordered_map<std::vector<uint8_t>, cache_entry*, byte_array_hasher> m_cache
Cache Entry Class
Each cache entry is an object of class cache_entry
:
time_t expiration;
time_t last_access;
cache_entry* prev;
cache_entry* next;
std::vector<uint8_t> data;
The cache entry serves as holding the data value and the last access timestamp, as well as being a linked list node with previous and next pointers.
The cache entry constructor elevates itself to the head of the list:
cache_entry(time_t _expiration, const std::vector<uint8_t>& _data, cache_entry*& head)
: expiration(_expiration)
, last_access(::time(nullptr))
, prev(nullptr)
, next(nullptr)
, data(_data)
{
if (head != nullptr)
{
head->prev = this;
next = head;
prev = nullptr;
}
head = this;
}
Cache List Class
The cache_list
class manages the list of cache_entry
s. It only has head and tail and size member variables, but it runs the show with this API:
cache_entry* cache_list::add(time_t expiration, const std::vector<uint8_t>& data)
Create a new cache entry, making it the new head, and possibly tail, of the list.
cache_entry* new_entry =
new cache_entry(expiration, data, m_head); if (m_tail == nullptr)
m_tail = new_entry;
m_size += new_entry->size();
return new_entry;
const std::vector<uint8_t>& cache_list::get(cache_entry* entry)
When a cache entry is hit, move it to the front of the list and return its data.
entry->last_access = ::time(nullptr);
if (m_head == entry)
return entry->data;
auto prev_neighbor = entry->prev;
auto next_neighbor = entry->next;
if (prev_neighbor != nullptr)
prev_neighbor->next = next_neighbor;
if (next_neighbor != nullptr)
next_neighbor->prev = prev_neighbor;
if (m_head != nullptr)
{
m_head->prev = entry;
entry->next = m_head;
entry->prev = nullptr;
}
m_head = entry;
if (next_neighbor != nullptr && next_neighbor->next == nullptr)
m_tail = next_neighbor;
else if (prev_neighbor != nullptr && prev_neighbor->next == nullptr)
m_tail = prev_neighbor;
return entry->data;
cache_list::set(cache_entry* entry, time_t expiration, const std::vector<uint8_t>& data)
When a cache entry gets a new value, update the entry and update the size of the cache.
NOTE: I don't move the entry to the front of the list here, as it is not recently used yet.
entry->last_access = ::time(nullptr);
size_t old_size = entry->data.size();
entry->expiration = expiration;
entry->data = data;
int64_t size_change = int64_t(data.size()) - old_size;
if (-size_change > int64_t(m_size))
m_size = 0;
else
m_size += size_change;
cache_list::remove(cache_entry* old_entry)
Remove an entry from the cache.
if (m_head == old_entry)
m_head = old_entry->next;
if (m_tail == old_entry)
m_tail = old_entry->prev;
if (old_entry->prev != nullptr)
old_entry->prev->next = old_entry->next;
if (old_entry->next != nullptr)
old_entry->next->prev = old_entry->prev;
if (old_entry->size() > m_size)
m_size = 0;
else
m_size -= old_entry->size();
delete old_entry;
cache_list::trim(size_t target_size)
Remove entries until the cache size dips below a target size.
while (m_size > target_size && m_tail != nullptr)
remove(m_tail);
Over on the other side... C# Packet Processing
On the C# side of things, we want to implement a memcache
-like interface like this:
public interface ICache
{
Task<string?> GetAsync(string key);
Task<bool> SetAsync(string key, string value, int cacheSeconds);
Task<bool> DelAsync(string key);
}
To this end, we write a central class for working with request and response data, Packet
. Packet
has the following member variables:
public CacheOp Op;
public long TtlSeconds;
public MemoryStream Key = new MemoryStream(MaxPacketSize);
public MemoryStream Value = new MemoryStream(MaxPacketSize);
The core purpose of the Packet
class is to pack Packet
members into a buffer, and parse Packet
members from a buffer. One key to good .NET performance is to avoid memory allocations, so the C# client code that processes network data maintains one set of Packet
objects and buffers for the long life of the object.
Here is the Pack
function for taking a Packet
object and writing it into an output buffer, including the amount written.
public static void Pack(Packet p, byte[] output, out int idx)
{
if
(
1 +
4 +
2 + p.Key.Length +
2 + p.Value.Length +
4
> MaxPacketSize
)
{
throw new PacketException("Too much data");
}
idx = 0;
output[idx++] = (byte)p.Op;
var ttl_bytes =
BitConverter.GetBytes(IPAddress.HostToNetworkOrder((int)p.TtlSeconds));
output[idx++] = ttl_bytes[0];
output[idx++] = ttl_bytes[1];
output[idx++] = ttl_bytes[2];
output[idx++] = ttl_bytes[3];
short key_len = (short)p.Key.Length;
var key_len_bytes = BitConverter.GetBytes(IPAddress.HostToNetworkOrder(key_len));
output[idx++] = key_len_bytes[0];
output[idx++] = key_len_bytes[1];
Buffer.BlockCopy(p.Key.GetBuffer(), 0, output, idx, key_len);
idx += key_len;
short value_len = (short)p.Value.Length;
var value_len_bytes =
BitConverter.GetBytes(IPAddress.HostToNetworkOrder(value_len));
output[idx++] = value_len_bytes[0];
output[idx++] = value_len_bytes[1];
Buffer.BlockCopy(p.Value.GetBuffer(), 0, output, idx, value_len);
idx += value_len;
var crc_bytes = Utils.Crc32(output, idx);
output[idx++] = crc_bytes[3];
output[idx++] = crc_bytes[2];
output[idx++] = crc_bytes[1];
output[idx++] = crc_bytes[0];
}
The Parse
function takes an input buffer and populates the members of a Packet
.
public static void Parse(byte[] input, Packet p)
{
p.Reset();
if (input.Length < MinPacketSize)
throw new PacketException("Not enough data");
else if (input.Length > MaxPacketSize)
throw new PacketException("Too much data");
short input_len = (short)input.Length;
byte[] crc_computed_bytes = Utils.Crc32(input, input_len - 4);
if
(
crc_computed_bytes[0] != input[input_len - 1]
||
crc_computed_bytes[1] != input[input_len - 2]
||
crc_computed_bytes[2] != input[input_len - 3]
||
crc_computed_bytes[3] != input[input_len - 4]
)
{
throw new PacketException("Checksum mismath");
}
int idx = 0;
p.Op = (CacheOp)input[idx++];
p.TtlSeconds = IPAddress.HostToNetworkOrder(BitConverter.ToInt32(input, idx));
idx += 4;
short key_len = IPAddress.HostToNetworkOrder(BitConverter.ToInt16(input, idx));
idx += 2;
if (key_len > 0)
{
if (idx + key_len >= input_len)
throw new PacketException("Invalid key length");
p.Key.Write(input, idx, key_len);
p.Key.Seek(0, SeekOrigin.Begin);
idx += key_len;
}
short value_len = IPAddress.HostToNetworkOrder(BitConverter.ToInt16(input, idx));
idx += 2;
if (value_len > 0)
{
if (idx + value_len >= input_len)
throw new PacketException("Invalid value length");
p.Value.Write(input, idx, value_len);
p.Value.Seek(0, SeekOrigin.Begin);
idx += value_len;
}
if (idx != input_len - 4)
throw new PacketException("Invalid packet format");
}
With the Parse
and Pack
functions doing the heavy lifting, the packet processing code's core "Do a cache operation" function is simply:
int request_len;
Packet.Pack(m_requestPacket, m_requestBuffer, out request_len);
if (await m_client.SendAsync(m_requestBuffer, request_len) != request_len)
throw new PacketException("Not all data sent");
byte[] response = (await m_client.ReceiveAsync()).Buffer;
Packet.Parse(response, m_responsePacket);
Apples and Oranges
I was interested in stacking my little packetcache
up against the proven memcache
. I'm a Windows guy, so I looked for a memcache
port / installer for Windows. I found this site and used it for my initial tests.
StackOverflow has other advice at this link.
In my all-on-my-PC tests, packetcache
did about 60K ops per seconds vs. memcache
's 40K. 50% better, that's cool. But the memcache
server seemed unstable as I simulated more clients with more client threads, and I wanted to bring things more into apples and apples, so I went to AWS and fired up an ElastiCache memcache
node and a basic Windows EC2, same price, loaded packetcache
onto the EC2, and accessed both from a common Windows EC2 in the same region. The observed performance was near identical at 20K ops per second. I can live with that: packetcache
on a weak Windows EC2 does at least as well at what it does as memcache
does on not-Windows and dedicated hardware. There's still apple / orange issues, such as that memcache
can be configured to do UDP. My point is that this approach is not worlds and worlds worse than the champ.
Conclusion and Points of Interest
I've shown you UDP programming, LRU cache implementation, and .NET packet packing and parsing. I invite you to take the code and build it on Linux with dedicated hardware and derive your own conclusions.
History
- 28th November, 2022: Initial version