Requirements
The article expects the reader to be familiar with the C++, Winsock API 2.0,
MFC, Multithreading.
Windows NT/2000 or later: Requires Windows NT
3.5 or later
Windows 95/98/Me: Unsupported
Motivation
This article which attempts to deal with the
thorny issue of using Completion Ports with Windows Sockets. It also addresses
some concerns of previous readers from the last article. Portions of the code
and been reengineered so its worth downloading again if you've haven't already
done so.
The article expects the reader to be familiar with the Winsock API 2.0, MFC,
Multithreading.
I have recently been working on a project that required me to develop a high
performance TCP/IP server, typically a server similar to a Web Server, where a
large amount of clients can connect and exchange data.
The initial design of my server was developed with a 1 thread per TCP/IP
client interface, I initially thought this was a good solution until I read an
article on High-load servers which suggested that the server could get into a
state of "Thread Thrashing" as the threads awake to service the client
connection and the operating system could possibly run out of system resources.
Another problem, I was using WSAAsyncSelect
for each client, the problem here
Winsock is limited to 64 event handles - whoops. The solution to the problem
was to develop a server with I/O Completion Ports.
During my research into I/O Completion ports, I found very few articles and
code samples on real world applications, especially demonstrating writing data
back to a client. This prompted me into writing this article.
Design
Instead creating 1 thread per client - hence 1000 clients a 1000 threads, we
create a Pool of worker threads to service our I/O events, I will discuss the
Worker Threads more later in the article.
To begin using completion ports we need to create a Completion Port which in
turn creates a number of concurrent threads (threads that exist with the
Completion Port - Not to be confused with Worker Threads) that you
specify. See function prototype below.
HANDLE CreateIoCompletionPort ( HANDLE FileHandle,
HANDLE ExistingCompletionPort,
ULONG_PTR CompletionKey,
DWORD NumberOfConcurrentThreads
Specifying zero for the NumberOfConcurrentThreads will create
concurrent threads as there are CPUs on the system. You can change this value to
experiment with performance, but for the purpose of this article and code we
will use the default value zero.
Once the Completion Port has been created, the next step is to associate all
accepted sockets with the Completion Port. The call to do this is
CreateIoCompletionPort
, this is somewhat confusing and its probably better
to call a function like AssociateSocketWithCompletionPort
to do the job
for you. Here's what AssociateSocketWithCompletionPort
looks like:
BOOL CClientListener::AssociateSocketWithCompletionPort(SOCKET socket,
HANDLE hCompletionPort,
DWORD dwCompletionKey)
{
HANDLE h = CreateIoCompletionPort((HANDLE) socket, hCompletionPort, dwCompletionKey, 0);
return h == hCompletionPort;
}
You'll notice that AssociateSocketWithCompletionPort
requires a
Completion key. A Completion key is essentially an OVERLAPPED
structure
with any other data you want to associate with the completion port and socket.
Examine the class below:
struct ClientContext
{
OVERLAPPED m_Overlapped;
LastClientIO m_LastClientIo;
SOCKET m_Socket;
CBuffer m_ReadBuffer;
CBuffer m_WriteBuffer;
WSABUF m_wsaInBuffer;
BYTE m_byInBuffer[8192];
WSABUF m_wsaOutBuffer;
HANDLE m_hWriteComplete;
LONG m_nMsgIn;
LONG m_nMsgOut;
};
The reason why a ClientContext
is associated with a socket and
completion port, is so we can keep a track of the socket when the I/O is
dequeued in the Worker Threads.
Now that the socket ha been attached/associated with the Completion Port, we
can discuss the Worker Threads in detail.
We create the worker threads during the creation of the completion port, the
worker threads handles are closed upon creation as they are not needed.
The worker threads now wait on GetQueuedCompletionStatus
. When an I/O
is request and been serviced it is queued in the Completion Port the last Worker
thread to issue a GetQueuedCompletionStatus
is woken and the I/O can be
processed. See GetQueuedCompletionStatus
below, notice it returns a
Completion Key, with this we can keep track of our associated socket.
BOOL GetQueuedCompletionStatus(
HANDLE CompletionPort,
LPDWORD lpNumberOfBytes,
PULONG_PTR lpCompletionKey,
LPOVERLAPPED *lpOverlapped,
DWORD dwMilliseconds
);
A rule of thumb for the number of Worker threads = 2 * CPU on the system,
this is a heuristic value and is explained in detail by Jeffery Richter in
"Programming Server Side Applications for Windows 2000". I've included in the
source code sample a dynamic thread pooling algorithm (This is not implemented
in the example), but you can experiment with the following values (Remember to
adjust the NumberOfConcurrentThreads accordingly).
m_nThreadPoolMin
m_nThreadPoolMax
m_nCPULoThreshold
m_nCPUHiThreshold
Now we have the process in place, its time to show the Completion Port
architecture in diagram form below:
The worker threads must issue a IO Request either by a WSARead
or WSAWrite
,
they then wait on GetQueuedCompletionStatus
for the IO complete. Once the
IO is completed GetQueuedCompletionStatus
returns and the data can be
processed.
So on a dual processor box we could quite comfortably handle 2000+ (Depending
on data throughput and workload etc.) clients with only 4 threads.
In my IOCP_Server example I have a class CListener
, which accepts
TCP/IP clients and associates with a Completion Port, CListener
also
holds a list of ClientContexts
(for stats/referencing).
I have created my own data protocol for incoming/outgoing data packets, this
is a 4 byte (integer) header (containing the size of the packet) and the actual
packet. e.g. 0500HELLO. This protocol is used to exchanged data to and from the
client.
The Project
Included in the project for completeness is a CBuffer
class to hold incoming
and outgoing data, a CCpuUsage
class for the ThreadPool
allocation/Deallocation.
Our code includes map to route the requests to function handlers, see below:
BEGIN_IO_MSG_MAP()
IO_MESSAGE_HANDLER(ClientIoInitializing, OnClientInitializing)
IO_MESSAGE_HANDLER(ClientIoRead, OnClientReading)
IO_MESSAGE_HANDLER(ClientIoWrite, OnClientWriting)
END_IO_MSG_MAP()
bool OnClientInitializing (ClientContext* pContext, DWORD dwSize = 0);
bool OnClientReading (ClientContext* pContext, DWORD dwSize = 0);
bool OnClientWriting (ClientContext* pContext, DWORD dwSize = 0);
Example Project
Well the best thing to do is fire up the examples and play with it. There's
plenty of comments littered throughout the code.
For example set the client up so it sends 99999 "Test Item " messages, it
takes around 3 seconds and the CPU usage hardly flinches. Wow.
The example MFC project contains the Server code which displays clients
accepting/connecting and any incoming data read from the IO Port. It also allows
data to be sent to a specified connected client.
Also included is a MFC Client. which sends and receives data and has a flood
option or sending the same string repeatedly.
This should be a good jumpstart for anybody wanting to create a High
performance Client/Server application for Windows NT/2000.
The server listens on port 999, please change in the client/server program,
if this conflicts with your system.
Any corrections, enhancements or suggestions please don't hesitate to
contact me.
Credits
Firstly I like to thank Ulf Hedlund for
taking time to fix some of the subtle problems with the code, and I'd also like
to thank many other readers you have sent in comments and suggestions.