Introduction
The Win32 SDK provides a set of APIs [Application Programming Interface] to manage the creation, destruction, and synchronization of threads. A frequent creation and destruction of threads is time and resource consuming, and therefore, impacts the performance of the server applications. The Windows 2000 operating system was designed for targeting the server market. To capture the server market, lots of scalability features like thread pooling and IO Completion Ports were introduced in Window 2000. Thread pooling is a pool of worker threads that are maintained by the system. A server developer would be free from the hassles of creating and destructing threads. To access the thread pool, a developer needs to use thread pooling APIs which makes creation, destruction, and general management easier.
Server applications make extensive use of thread pooling techniques to delegate client requests to worker threads, which is implemented as a thread pool so that they can achieve better throughput. After serving a client request, the thread goes back into the thread pool and gets ready to serve the next pending client’s request. Thread pooling will give better throughput, as less time is wasted in creating and destroying threads, because the application already has an idle pool of threads. The number of threads can vary in a pool depending upon the load on a server.
The purpose of this article is to cover a design that can be used to develop scalable network servers without impacting performance. The Windows 2000 operating system provides four components to support thread pooling. A thread pool consists of four components that can be accessed by using a set of APIs available for thread pooling. A basic technique that is widely used in designing a scalable server application is an asynchronous mechanism through which a server’s main thread can defer/delegate a client’s request to a pool of worker threads for further processing. A thread pool consists of four types of threads. They are as follows:
- Non-I/O threads: The purpose of these threads is to invoke the work items that have been queued up on these threads. These work items should not issue an asynchronous I/O call as these threads do not wait for the completion of the I/O request, and hence cause a notification on a completion of I/O operations to be lost.
- I/O threads: These threads are useful for work items that issue an asynchronous I/O request. These threads never die if they have a pending I/O request, i.e., they can wait on a signal that is posted on the completion of an I/O request.
- Timer thread: This thread is responsible for invoking callback functions on an expiration of a specified time. This timer expires at the specified time periodically. By default, the work items are queued to the non-I/O component’s thread. We have an option called
WT_EXECUTEINTIMETHREAD
, which causes a timer component’s thread to go into an alertable wait, waiting for the waitable timer to queue an APC to it. The work item should be executed quickly, otherwise it will block the timer component’s thread and therefore impact the performance of the timer thread component.
- Wait threads: This thread is responsible for invoking a queued work item when a kernel object gets signaled. Once an object becomes signaled, the work item is queued to non-I/O component threads, by default. We can use the
WT_EXECUTEINWAITTHREAD
, which will ensure that a work item will be queued up on a wait thread.
The Devil in the Thread Pooling Environment: TLS
A developer has to be careful while working with thread local storage in a thread pooling environment. The purpose of TLS is to provide a private storage space for each thread present in a multi-threaded environment. Consider an example where a server process’ main execution thread invokes the QueueUserWorkItem()
function on receiving a client request. The context passed to the thread function is of the type below:
The server has a default handler, which is a callback function invoked by a pool of worker threads.
DWORD WINAPI defaultHandler(PVOID pvContext);
The purpose of this handler is to invoke a function being exposed by specified modules for further processing, depending on the type of image being requested from a client. Each module provides a set of helper functions that perform different operations on the image being requested.
The defaultHandler
function checks that if szModuleName
is not NULL, then it loads module in its address space, and invokes a function name being specified by szFunctionName
, for further processing.
struct CThreadContext
{
char[256] szModuleName;
char[256] szFunctionName;
};
Consider a scenario where the first request comes to the server, and it is handed over to first thread by the thread pool manager. In the defaultHandler()
function, a callback function is exposed by the executable module itself, so there is no need to load a DLL. The szModuleName
will be NULL
for this type of a request. Now, suppose a second request comes to the server while thread-1 is busy in handling a request, a new request will be given to thread-2 which will load a DLL as the handler function is exposed by the DLL module specified in szModuleName
. A LoadLibrary()
will be called by second thread that will cause an entry point to be invoked with DLL_PROCESS_ATTACH
as an argument. A DLL_PROCESS_ATTACH
will be responsible for allocating a TLS for a specific thread. A callback function szFunctionName
exposed by the DLL will make use of TLS by invoking the TlsGetValue()
function.
Meanwhile, if a third request comes to the server and thread-1 has already completed its work, the thread pool manager will hand over the request to thread-1. As this thread existed before the DLL got loaded, there is no way to initialize a TLS for this thread as the DLL entry point will not be invoked. An access to TLS in a callback function will cause an access violation. These scenarios need to be considered while working with TLS in a thread pooling environment.
Thread Execution |
Handler Function |
DLL Attached |
Thread Attached |
Action |
Thread-1 |
defaultHandler
|
No |
No |
As defaultHandler is present in an executable module, no DLL loading is required. |
Thread-2 |
DLL::FunctionName
|
Yes |
No |
As the function is present in the DLL, the DLL is loaded by the thread, which causes to invoke the entry point with DLL_THREAD_ATTACH . In this, we should call TlsSetValue() to initialize the TLS slot for the thread. |
Thread-1 |
DLL::FunctionName
|
No |
No |
As this thread existed before the DLL gets loaded, no DLL_PROCESS_ATTACH and DLL_THREAD_ATTACH get called. This will cause a crash as the function exposed by the DLL tries to access a TLS via the TlsGetValue() function. |
Scalable Server Based on IO Completion Port
Windows provides a combination of overlapped I/O and IO Completion ports to design scalable servers. Threads consume resources as they have their own stack, and designing a server on one thread per client would not be advisable for large servers. Scalable servers should be able to handle multiple clients with a handful number of threads that constitute a thread pool.
The select()
API provides a way to deal with multiple I/O streams in a single thread, but still not an ideal solution to design a scalable server. Overlapped I/O is a way through which I/O operations are initiated asynchronously, and are notified on its completion by an operating system. A Completion Port is a queue maintained by the operating system on which all notifications of completed I/O operations are posted by the system. The worker threads polling on the IO Completion port will be responsible for processing the posted I/O completion notifications. An IO Completion Port is associated with non-I/O threads, by default.
A server’s main thread will be responsible for creating a listening socket that will be accepting connections from the client. A listening socket will be associated with a network event, WSAEvent
, and it will be registered against the FD_ACCEPT
notification. The network event will be created by calling the WSAEventSelect()
API provided by the Winsock library [Ws2_32.lib]. This is a way to synchronize a pending connection request on a listening socket and AcceptEx()
which accepts pending connections on a socket. The AcceptEx()
will be invoked when an event associated with a listening socket gets signaled.
WSAEvent wsaEvent = WSACreateEvent();
WSAEventSelect(g_ServerSocket,socketInfo.wsaEvent,FD_ACCEPT);
WSAWaitForMultipleEvents(1,&pServerSocket->wsaEvent,
TRUE,WSA_INFINITE,FALSE);
The thread function associated with a worker thread is responsible to call AcceptEx()
. The event object associated with FD_ACCEPT
gets signaled when a connection request is pending and waiting to be accepted by a server socket. The AcceptEx()
function gives better performance for scalable servers as a socket can be created before the connection occurs, hence speeding up the connection establishment. The AcceptEx()
function is responsible for posting an accept request on the listening socket, and once the accept request gets completed, the IO completion packets are posted to the port associated with the listening socket.
The AcceptEx()
function makes use of Overlapped I/O, which makes it suitable to be used in scalable servers as multiple clients can be served by a small pool of threads. On completion of an accept request, the IO completion packet will be dispatched to a port where the pool threads will process the IO packet and register newly accepted sockets with an IO Completion Port. All new accepted sockets will be registered with the IO Completion Port. This design provides us the benefit of having scalability on two stages, i.e., there could be different I/O operations being performed on multiple handlers associated with the IOCP, and there could be multiple I/O operations performed on a single handler. The processing threads should be able to distinguish these different I/O operations. A per-file completion data is associated with the completion key when a file handle is registered with the IOCP.
typedef struct __PER_SOCKET_DATA
{
SOCKET listenSocket;
SOCKET acceptSocket;
__PER_SOCKET_DATA()
{
listenSocket = INVALID_SOCKET;
acceptSocket = INVALID_SOCKET;
}
}PER_SOCKET_DATA,* PPER_SOCKET_DATA;
typedef struct __PER_IO_DATA {
WSAOVERLAPPED Overlapped;
char Buffer[MAX_BUFF_SIZE];
WSABUF wsabuf;
int nTotalBytes;
int nSentBytes;
IO_OPERATION opCode;
SOCKET activeSocket;
} PER_IO_DATA, * PPER_IO_DATA;
This sample will show how multiple clients can be served with a limited bunch of worker threads constituting a thread pool. This design is scalable as compared to having one thread per client where scalability is restricted.