Preamble
This article is based on a micro-benchmark which has nothing to do with real-world applications! The code shown here will only increase the speed of a little function (_getptd_noexit
) in the CRT. The general improvements in the VC2005 compiler / CRT are very good. And I strongly recommend you to move to VS2005 as fast as you can!
Introduction
In a micro-benchmark of a single CRT function which uses TLS (Thread Local Storage), you can find a performance decrease in the function. This implementation can be improved. This will increase the performance of _getptd()
by about 18 %.
This article shows the possible improvements that cane done inside the _getptd_noexit()
function which is called by _getptd()
. To use these changes you either need to link against the static CRT or recompile the whole CRT.
This function plays a major role in the TLS (or FLS). TLS (or better FLS) is used in many places inside the CRT. For example, all the functions which need to store some internal data for subsequent calls (like strtok) or others that depend on some locale-settings. The CRT stores a pointer to an internal data-structure in the FLS to make safe all these kinds of information for each fiber/thread. So it is very important for this function to be very fast!
If you want to have this performance improvement in a service pack or future release of VC; then please vote for it!
Changes to the _getptd_noexit() function
After analyzing the call to _getptd_noexit()
I found out that TlsGetValue
is called twice without any purpose. This is the point that will be improved by the following code changes.
Changes to tidtable.c
A few changes need to be done in the file tidtable.c in the CRT-source directory (normally located in: %ProgramFiles%\Microsoft Visual Studio 8\VC\crt\src). You need to replace the lines 546-553 with the following code:
#ifndef _M_AMD64
PVOID flsGetValue;
TL_LastError = GetLastError();
flsGetValue = FLS_GETVALUE;
if (!flsGetValue)
{
flsGetValue = _decode_pointer(gpFlsGetValue);
TlsSetValue(__getvalueindex, flsGetValue);
}
if ((ptd = ((PFLS_GETVALUE_FUNCTION)flsGetValue)(__flsindex)) ==
NULL)
{
#else
TL_LastError = GetLastError();
if ( (ptd = FLS_GETVALUE(__flsindex)) == NULL ) {
#endif
In general, you have two options for making this change, so that you can use it in your project:
- Rebuild the CRT
- Link against the static CRT and just recompile the tidtable.c file.
The recompilation of the CRT is explained in an article by Michael S. Kaplan and will not be discussed here. Here, I'll explain a (simple) step to add the tidtable.c file to your project and use your modified version of this file. To get this work done, you need to do the following:
- Create a Win32-console application.
- Change the project settings to use the static CRT (not the DLL version!).
- Remove "Precompiled-Headers" from your project.
- Change the linker settings to use no default libraries.
- Add libcmt(d).lib to the additional libs.
- Copy tidtable.c from the CRT-source directory into your project directory and add it to your project.
- Do the modification in the tidtable.c as explained above.
- Add the following at the top of the tidtable.c file:
#define _CRTBLD
- If you have a UNICODE-build then you also need to replace "
LoadLibrary
" with "LoadLibraryA
" and "GetModuleHandle
" with "GetModuleHandleA
" in the whole file.
- Right-click on the tidtable.c file and select Properties.
- Add an additional include path to this file:
"$(DevEnvDir)\..\..\VC\crt\src"
.
- Do the above changes for the Debug and Release project settings.
- Rebuild all.
You can download the sample project where all these steps are done for you... (expect for steps 6-9 due to copyright restrictions).
I hope this (or a similar) modification finds a place in the next service pack of VS8...
Origin of this article
Starting from a German newsgroup-thread (Schleifenlaufzeit VS2003 vs. VS2005 (Mon, 28 Nov 2005 17:01:02 +0100) from Kai Huebner), I had to dig deeper to find the reason why the following (micro-benchmark) code was slow when compiled with VC2005 compared to VC2003:
double d = 0;
for (int i=0; i<5000000; i++)
d += rand();
The complete analysis can be found here.
The origin of this solution
After finding out what was going "wrong" and why the implementation had changed, I tried to improve the code and reduce the number of calls to TlsGetValue
and also reduce the instructions. My first version was improved (better "inlined") by "Ted". The resulting code is an object of this article...
History
- 2005-12-01
- 2005-12-06
- Added a link to the lady-bug entry.