Introduction
In this article, we are going to talk about string
formatting. The standard way of doing this in C is the old sprintf
function. It has various flaws and is showing its age. C++ and STL introduce the iostreams
and the <<
operator. While convenient for simple tasks, its formatting features are clunky and underpowered.
On the other hand, we have the .NET Framework with its String
class, which has the formatting function String.Format
[^]. It is safer and easier to use than sprintf
- but can only be used from managed code. This article will show the main problems of sprintf
and will offer an alternative that can be used from native C++ code.
What are the Problems with sprintf?
sprintf is Prone to Buffer Overflows
There are different versions of sprintf
that provide different degrees of buffer overflow protection. The basic flavor of sprintf
provides none. It will happily write past the end of the given buffer and will probably crash the program. The _snprintf
function will not write past the end of the buffer, but will also not put a zero at the end if there is no space. The program will not crash immediately but will most likely crash later. The new _sprintf_s
function fixes the buffer overflow problems but it is only available for Visual Studio 2005 and up.
String.Format
allocates the output buffer itself from the managed heap and can make it as big as it needs to.
sprintf is Not Type-safe
The sprintf
function uses the ellipsis syntax (...
) to accept variable number of arguments. The downside is that the function has no direct information about the arguments' types and can't perform any validation. It assumes that the argument count and types match the formatting string
. This can lead to hard to spot bugs. For example:
std::string userName("user1");
int userData=0;
sprintf(buf,"user %d, data %s",userName.c_str(),userData);
sprintf(buf,"user %s, data %d",userName,userData);
In String.Format
the formats of the arguments are optional. If the argument is a string
it will be printed as a string
, if it is a number it will be printed as a number.
String.Format("user {0}, data {1}",userName,userData);
sprintf has Localization Problems
The sprintf
function requires that the order of the arguments is exactly the same as the order of the format specifiers. The bad news is that different languages have different word order. The program needs to provide the arguments in different order to accommodate different languages. For example:
sprintf(buf,"The population of %s is %d people.","New York",20000000);
sprintf(buf,"%d people live in %s.",20000000,"New York");
String.Format
wins in this case too. Its format items explicitly specify which argument to use and can do that in any order.
String.Format("The population of {0} is {1} people.","New York",20000000);
String.Format("{1} people live in {0}.","New York",20000000);
The FormatString Function
The FormatString
function is a smart and type-safe alternative to sprintf
that can be used by native C++ code. It is used like this:
FormatString(buffer, buffer_size_in_characters, format, arguments...);
The function has two versions - a char
version and a wchar_t
version.
The format string
contains items similar to String.Format
:
{index[,width][:format][@comment]}
index
is the zero-based index in the argument list. If the index
is past the last argument, FormatString
will assert.
width
is optional width of the result. If width
is less than zero, the result will be left-aligned. The width
can be in the format '*<index>'. Then <index> must be an index of another argument in the list that provides the width
value.
format
is optional format of the result. The available formats depend on the argument type. If the format
is not supported for the given argument FormatString
will assert.
comment
is ignored. It can be a hint that describes the meaning of the argument, or provides examples to aid the localization of the formatting string
.
The result of FormatString
always fits in the provided buffer and is always zero-terminated. Special cases like the buffer ending in the middle of a double-byte character or a in the middle of a surrogate pair are also handled.
Since the { and } characters are used to define format items, they need to be escaped in the format string as {{ and }}.
Available Formats
For 8, 16, 32 and 64 bit integers, including 32 and 64 bit pointers
- c - a character. it is an ANSI or UNICODE character depending on the type of the format
string
- d[+][0] - a signed integer. '+' will force the + sign for positive values. '0' will add leading zeros
- u[0] - unsigned integer. '0' will add leading zeros
- x[0] - lower case hex integer. '0' will add leading zeros
- X[0] - upper case hex integer. '0' will add leading zeros
- n - localized integer number (uses
GetNumberFormat
[^] but with no fractional digits) - f - localized file size (uses
StrFormatByteSize
[^]) - k - localized file size in KB (uses
StrFormatKBSize
[^]) - t[<number>] - localized time interval in ms (uses
StrFromTimeInterval
[^] with optional number of significant digits between 1 and 6)
The default format for signed integers is 'd' and for unsigned integers is 'u'.
For floats and doubles
- f[<number>] - fixed point (with optional number of fractional digits)
- f*<index> - fixed point.
<index>
is an index of another argument that provides the number of fractional digits - e or E - exponential format. Supports the number of fractional digits same as the 'f' format
- g or G - chooses between 'f' and 'e'/'E', whichever is shorter. Same rules apply for the fractional digits
- $ - localized currency (uses
GetCurrencyFormat
[^]) - n[<number>] or n*<index> - localized number (uses
GetNumberFormat
with optional number of fractional digits)
The default format for floats or doubles is 'f'.
For ANSI strings, including std::string
The char
version of FormatString
doesn't support any formats for ANSI strings. The wchar_t
version supports:
- <number> - a code page to be used when converting the ANSI string to UNICODE
- *<index> - index of another argument that provides the code page
If a code page is not given, the default (CP_ACP
) is used.
For UNICODE strings, including std::wstring
The wchar_t
version of FormatString
doesn't support any formats for UNICODE strings. The char
version supports:
- <number> - a code page to be used when converting the UNICODE string to ANSI
- *<index> - index of another argument that provides the code page
If a code page is not given, the default (CP_ACP
) is used.
For SYSTEMTIME (Passed as const SYSTEMTIME &)
- d[l/f][format] - short date format (uses
GetDateFormat
[^]). 'l' - converts the time from UTC to local. 'f' - same as 'l' but uses the file system rules *. format - optional format passed to GetDateFormat
- D[l/f][format] - long date format
- t[l/f][format] - time format, no seconds (uses
GetTimeFormat
[^]) - T[l/f][format] - time format
* 'l' uses SystemTimeToTzSpecificLocalTime
to convert from UTC to local time. 'f' uses FileTimeToLocalFileTime
instead. The difference is that FileTimeToLocalFileTime
uses the current daylight savings settings instead of the settings at the given date. This is incorrect but is more consistent with the way Windows displays the local file times. If STR_USE_WIN32_TIME
is not defined, then the localtime
function is used no matter if 'l' or 'f' is specified. localtime
produces results consistent with the file system (and FileTimeToLocalFileTime
). You can read why the file system behaves this way here: The Old New Thing: Why Daylight Savings Time is nonintuitive .
The default format for SYSTEMTIME
is 'd'.
Examples
char buf[100];
FormatString(buf,100,"{1} people live in {0}.","New York",20000000);
-> 20000000 people live in New York.
FormatString(buf,100,"{0}",-1);
-> -1
FormatString(buf,100,"{0}",(unsigned int)-1);
-> 4294967295
FormatString(buf,100,"{0}, 0x{0,8:X0}",1);
-> 1, 0x00000001
FormatString(buf,100,"{0}",L"test");
-> test
FormatString(buf,100,"{0:n}",12345678);
-> 12,345,678
FormatString(buf,100,"{0:t3}",12345678);
-> 3 hr, 25 min
FormatString(buf,100,"{0}",12345.678);
-> 12345.678000
FormatString(buf,100,"{0:n*1}",12345.678,2);
-> 12,345.68
SYSTEMTIME st;
GetSystemTime(&st);
FormatString(buf,100,"{0:dl} {0:tl}",st);
-> 11/25/2006 1:26 PM
FormatString(buf,100,"{0:ddddd',' MMM dd yy}",st);
-> Saturday, Nov 25 06
How It Works
The FormatString
function has 10 optional arguments arg1, ... arg10
of type const CFormatArg &
like this:
class CFormatArg
{
public:
CFormatArg( void );
CFormatArg( char x );
CFormatArg( unsigned char x );
CFormatArg( short x );
CFormatArg( unsigned short x );
..........
enum
{
TYPE_NONE=0,
TYPE_INT=1,
TYPE_UINT=2,
.....
};
union
{
int i;
__int64 i64;
double d;
const char *s;
const wchar_t *ws;
const SYSTEMTIME *t;
};
int type;
static CFormatArg s_Null;
;
int FormatString( char *string, int len, const char *format,
const CFormatArg &arg1=CFormatArg::s_Null, ...,
const CFormatArg &arg10=CFormatArg::s_Null );
The CFormatArg
class contains constructors for each of the supported types. Each constructor sets the type
member depending on the type of its argument. When the FormatString
function is called with an actual argument, a temporary CFormatArg
object is created that stores the value and the type of the argument. The FormatString
function can then determine the number of arguments that are provided and has access to their types and values.
Dynamically Allocated Strings
Often you don't want to use a buffer of a fixed size, but one that is dynamically allocated. Use the FormatStringAlloc
function instead:
char *string=FormatStringAlloc(alocator, format, arguments );
The first parameter is an object with a virtual member function responsible for allocating and growing the string
buffer:
class CFormatStringAllocator
{
public:
virtual bool Realloc( void *&ptr, int size );
static CFormatStringAllocator g_DefaultAllocator;
};
bool CFormatStringAllocator::Realloc( void *&ptr, int size )
{
void *res=realloc(ptr,size);
if (ptr && !res) free(ptr);
ptr=res;
return res!=NULL;
}
The Realloc
member function must reallocate the buffer pointed by ptr
with the given size (in bytes) and set ptr
to the new address. The allocator will be called every 256 characters (approximately) to enlarge the buffer. The first time Realloc
is called with ptr=NULL
. If error occurs, Realloc
must free the memory pointed by ptr
and return false
or throw an error. If Realloc
returns false
then FormatStringAlloc
terminates and returns NULL
.
The default allocator uses the realloc
function from the C run-time heap. To free the returned string
, you need to call free(string)
. You can write your own allocator that uses a different heap or some other means of allocating memory. See further below for one example.
Output to Stream
Often you don't want to output the formatted string
to a buffer, but to a file, to a text console, to the Visual Studio's debug window, etc. Use the FormatStringOut
function instead:
bool success=FormatStringOut(output, format, arguments );
The first parameter is an object with a virtual member function responsible for outputting portions of the result. There are separate classes for char
and wchar_t
:
class CFormatStringOutA
{
public:
virtual bool Output( const char *text, int len );
static CFormatStringOutA g_DefaultOut;
};
bool CFormatStringOutA::Output( const char *text, int len )
{
for (int i=0;i<len;i++)
if (putchar(text[i])==EOF) return false;
return true;
}
class CFormatStringOutW
{
public:
virtual bool Output( const wchar_t *text, int len );
static CFormatStringOutA g_DefaultOut;
};
bool CFormatStringOutW::Output( const wchar_t *text, int len )
{
for (int i=0;i<len;i++)
if (putwchar(text[i])==WEOF) return false;
return true;
}
The Output
member function will be called with each portion of the result. The len
parameter is the number of characters. Note that the text is not guaranteed to be zero-terminated. Output
must return false
or throw an exception if there is an error. If Output
returns false
then FormatStringOut
terminates and returns false
.
The default implementations just use putchar
/putwchar
to send the text to the console. You can write your own output class for iostream
, FILE*
, Win32 HANDLE
, etc.
Additional Functionality
Support for FILETIME, time_t and OLE time
The CFormatTime
class derives from CFormatArg
and allows you to use different date/time formats. You use it like this:
time_t t=time();
FormatString(buf, 100, "local time: {0:dl} {0:tl}", CFormatTime(t));
-> local time: 11/25/2006 1:26 PM
You can create your own classes that derive from CFormatArg
to support more data types or add more formatting options.
Passing CFormatArg Argument List to Other Functions
FormatString.h defines 3 macros to be used with the argument list:
FORMAT_STRING_ARGS_H
FORMAT_STRING_ARGS_CPP
and FORMAT_STRING_ARGS_PASS
You can use them to create other functions that have variable argument list and call FormatString
. For example, let's create a MessageBox
function that can format the message:
int MessageBox( HWND parent, UINT type, LPCTSTR caption,
LPCTSTR format, FORMAT_STRING_ARGS_H );
int MessageBox( HWND parent, UINT type, LPCTSTR caption,
LPCTSTR format, FORMAT_STRING_ARGS_CPP )
{
TCHAR *text=FormatStringAlloc(CFormatStringAllocator::g_DefaultAllocator,
format,
FORMAT_STRING_ARGS_PASS);
int res=MessageBox(parent,text,caption,type);
free(text);
return res;
}
Calling with No Variable Arguments
If FormatString
and its siblings are called with no variable arguments, the format string
is directly copied to the output. In the example above, you can call MessageBox(parent, type, caption, text)
and the text will be displayed in the message box directly without being parsed for any format items.
The CString Classes
The sample sources provide simple string
container classes CStringA
and CStringW
. The string
s stored in them have a reference count in the 4 bytes directly preceding the first character. When such a class is copied, the string
is not duplicated, just the reference count is incremented (so called copy-on-write with reference counting). When the string
is destroyed, the reference count is decremented and if it reaches 0, the memory is freed. The reference count is modified with InterlockedIncrement
and InterlockedDecrement
to be thread-safe.
The CString
type is set to CStringA
in ANSI configurations and to CStringW
in UNICODE configurations. This allows you to use the configuration-dependent CString
, while still being able to mix the ANSI and UNICODE types as needed.
The CString
classes have a Format
member function that formats a string
and assigns the result to the object. This is done by calling FormatStringAlloc
with a special allocator that allocates 4 bytes more than requested to store the reference count. The CString
classes also define a cast operator CFormatArg
, so they can be used directly as arguments to FormatString
:
CString s;
s.Format(_T("{0}"),"test");
FormatStringOut(CFormatStringOutA::g_DefaultOut,"s=\"{0}\"\n",s);
-> s="test"
The behavior ot CString
is very similar to the ATL/MFC string
s and is provided here merely to demonstrate the use of custom memory allocators for FormatStringAlloc
and the use of the CFormatArg
cast operator. To use them in a real application, you may wish to add more functionality, like comparison operators, conversion operators/constructors between CStringA
and CStringW
, string
manipulation functionality, etc. Or simply use the existing classes std::string
or ATL::CString
.
StringUtils.h
The source files contain a set of string
utilities that can be used independently from FormatString
. Most of them are wrappers for the system string
functions. The functions come in pairs - one for ANSI and one for UNICODE, like this:
inline int Strlen( const char *str ) { return (int)strlen(str); }
inline int Strlen( const wchar_t *str ) { return (int)wcslen(str); }
int Strcpy( char *dst, int size, const char *src );
int Strcpy( wchar_t *dst, int size, const wchar_t *src );
The advantage of this approach over _tcslen
and _tcscpy
is that you can easily mix ANSI and UNICODE code and always use the same function name.
Other wrappers provide safe versions of strncpy
, sprintf
, strcat
, etc. that don't write past the provided buffer and always leave the result zero-terminated. They all compile cleanly under VC 6.0, VS 2003 and VS 2005.
Output to STL Strings
These functions output the formatted result to an STL string
:
std::string FormatStdString( const char *format, ... );
std::wstring FormatStdString( const wchar_t *format, ... );
void FormatStdString( std::string &string, const char *format, ... );
void FormatStdString( std::wstring &string, const wchar_t *format, ... );
Output to STL Streams
You can output formatted string
to STL stream
s like this:
stream << StdStreamOut(format, parameters) << ...;
The Source Code
To use the source code, just drop the .h and .cpp files into your project:
- StringUtils.h/StringUtils.cpp - a set of
string
helper functions. They can be used on their own. - FormatString.h/FormatString.cpp - the
string
formatting functionality. Requires StringUtils
- CString.h/CString.cpp - the
string
container classes. Requires StringUtils
and FormatString
Configuring the Source Code
StringUtils.h defines several macros that can be used to enable or disable parts of the functionality:
- STR_USE_WIN32_CONV - If this macro is defined, the code will use the
Win32
functions WideCharToMultiByte
and MultiByteToWideChar
to convert between char
and wchar_t
strings. Otherwise, it will use wcstombs
and mbstowcs
. The advantage of using Win32
function is that they support conversions between Unicode and different code pages, including UTF8. - STR_USE_WIN32_NLS - If this macro is defined, the
FormatString
functions will use the Win32
functionality for formatting numbers, dates and times. Otherwise they will try to simulate their functionality to some extent. - STR_USE_WIN32_TIME - If this macro is defined, the
FormatString
functions will support the time types time_t
, SYSTEMTIME
, FILETIME
and DATE
. Otherwise only time_t
will be supported. - STR_USE_WIN32_DBCS - If this macro is defined, the code will use
IsDBCSLeadByte
to handle DBCS characters. Otherwise isleadbyte
will be used. - STR_USE_STL - If this macro is defined, the
FormatString
functions will support std::string
and std::wstring
as input parameters. Also FormatStdString
and StdStreamOut
will be defined that output to std::string
, std::wstring
, std::ostream
and std::wostream
.
With these macros, you can selectively enable only the functionality you need and is supported by your compiler or platform.
History
- Nov, 2006 - First version
FormatString
implementation for char
and wchar_t
- Support for numbers,
string
s and time formats - Formatting to fixed sized buffers, dynamically allocated buffers and output
stream
s
- Dec, 2006 – Better portability and more functionality
- Added configuration macros
- Added support for STL
string
s and stream
s - Added support for different sizes of
wchar_t
- Added more robust handling of numeric formats thanks to Mihai Nita's suggestion
- Feb, 2007
- Added conversion from UTC time to local time that is consistent with the file system (to be used with file times)