Introduction
The EsString
class described further is an attempt (one of many other people did) to create a lightweight independent (from other libraries) string class, which would provide clone-on-write as well, namely, several string objects may refer to the same actual string, unless one wants to change its contents.
In brief, what this string class is:
- copy-on-write string with chunk memory allocation (compiled conditionally).
- doesn't depend on ATL, MFC, STL.
- internally relies solely on CRT calls, and if support enabled, on Win API.
- may be (with some restrictions) used as items buffer, not zero-terminated string.
- provides conversion from single
char
to wide char and vice versa (though conversion is quite dumb).
- implements concatenation operators.
- string comparison, in form of operators as well as in functional form; case insensitive comparison included.
- string case manipulation
ToLower
\ToUpper
.
- substring extraction.
- left|right|whole string trimming by specified char pattern.
- string replacement.
- string search methods:
- for specified char in forward and reverse, starting from specified position.
- for substring in forward and reverse, starting from specified position.
- for first char that's in\not in char pattern, in forward and reverse, starting from specified position.
- provides C-style formatting method.
- provides simple numerical-to-string conversion for
double
and int
types.
- allows to represent arbitrary binary buffer as hex string.
- if used with
ES_WINDOWS
defined, it additionally allows to:
- load strings from Windows resources
- obtain textual information about system error codes
- convert from
BSTR
strings
...and what it is not so far:
- it doesn't provide object thread-safety, it's planned though.
- it doesn't support MBCS at all, don't rely on it if looking for proper MBCS locales handling.
Background
I believe, programmers like using string classes. As for myself, I'm a lazy developer, I hate to keep track of endless string buffer (re)(de)allocation, as well as thinking of how long each buffer should be to keep all necessary data, not causing overflows, etc. That's what one does when using CRT only calls to manipulate strings. Also, one should keep in mind that if A, B, C point to the same memory block, filled with (string) data, and then, later, some changes are made to this block via B, A and C will obviously point to the changed sequence as well, and maybe this doesn't make that guy happy. Well, let's see what's on standard menu. Basically, there are:
- ATL|MFC string,
- VCL (one from C++ Builder) string,
- quite stand-alone
_bstr_t
string,
- STL string.
Strings (1)-(3) use the clone-on-write (or, copy-on-write) approach. In a nutshell, you may have many instances of string objects, which may internally refer to one shared string buffer. Copy assignments are playing fast, no memory allocation and actual buffer copying needed, just shared data reference count changes. If only you make change to one of these string objects, it internally creates exact copy of original referenced buffer, releases previously referenced buffer, and becomes the only referrer of the new one. This makes possible passing such kind of string objects by value, because the object itself is relatively small, no actual string copy occurs in this case.
STL string (4) is plain and straightforward - unless you use references or pointers to string objects - you will have as many string content copies as were created via assignment operators or copy constructors. Safe, but dumb. But... STL string may have one advantage, because, actually it's not a string, but a buffer of characters, namely, you may have STL string with length n containing n binary zeroes.
So, my intention was to use the copy-on-write approach found in (1)-(3), and have the ability to store binary zeroes (but the latter wasn't the main goal). Also, I tried to keep away from using (and thus depending on) something like ATL, MFC, VCL (for God's sake), or STL, unless it's really needed.
Using the code
Code consists of two header files with self-explanatory names: EsRefCounted.hpp and EsString.hpp. The former contains:
class EsRefCounted;
template <class ValueT>
class EsRefCountedPtrT;
EsRefCounted
is the base class for shared data holder, that's what EsRefCountedPtrT
-derived wrapper classes expect to hold; the actual string data holder and manipulator class EsStringValueT
, described further, is inherited from EsRefCounted
.
EsRefCountedPtrT
templated class provides all basic refcounting logic, and overridden operator =
, such as all =
assignments of EsRefCountedPtrT
-derived objects of the same type will use it instead.
EsString.hpp contains several helper templates, refcounted string data holder, and actual string templated class derived from EsRefCountedPtrT
.
template <bool IsWide, typename CharT>
struct EsCharTraitsBaseT;
template <typename CharT>
struct EsCharTraitsBaseT<false, CharT>
template <typename CharT>
struct EsCharTraitsBaseT<true, CharT>
template <typename CharT>
struct EsIsWideCharTypeT
{
enum {Yes = sizeof(CharT) > sizeof(char), No = !Yes};
};
template <typename CharT >
class EsCharTraitsT : EsCharTraitsBaseT<
EsIsWideCharTypeT<CharT>::Yes, CharT >
First, why the hell these helpers are needed after all? String manipulation internally uses CRT function calls. Of course, these functions have different names depending on the specialization for single or wide strings\characters. OK, one may say, why not use uniform tchar
mappings? Tchar
mappings are OK, unless you have to use single byte and wide strings at the same time, and I wanted to make particular string template instantiation "decide" on which branch of CRT (and, in some cases, WinAPI) functions to use. Also, some methods should work different internally, for BSTR
- to string conversions, for example, while these differences should be hidden from string class itself. The "topmost" helper template EsCharTraitsT
provides uniform static methods for generic CharT
sequence manipulation, as well as (static const
) member equal to byte size of the CharT
used for current template instantiation. Why static
? Actually, EsCharTraitsT
is used just as a placeholder for internally used code specific for concrete CharT
type. It doesn't need to be ever created as object instance. All it's used for are EsCharTraitsT<>::SomeMethod()
calls inside string methods.
template <typename CharT>
class EsStringValueT : public EsRefCounted
EsStringValueT
class provides (inlined where appropriate) methods for string search, manipulation, formatting, extraction, etc. The referrer wrapper class basically delegates its calls to the corresponding methods of this object, maintaining string buffer uniformity if needed, and extending EsStringValueT
methods appropriately.
template <typename CharT>
class EsStringT : public EsRefCountedPtrT< EsStringValueT<CharT> >
This is the main string "worker", followed by its implementation. EsString.hpp file contains explicit specializations for single and wide chars as well as some UNICODE mappings:
typedef EsStringT<char> EsStringA;
typedef EsStringT<wchar_t> EsStringW;
#ifdef _LIST_
typedef std::list<EsStringA> EsAStrings;
typedef std::list<EsStringW> EsWStrings;
#endif //_LIST_
#ifdef _UNICODE
#define EsString EsStringW
#ifdef _LIST_
#define EsStrings EsWStrings
#endif //_LIST_
#else
#define EsString EsStringA
#ifdef _LIST_
#define EsStrings EsAStrings
#endif //_LIST_
#endif //_UNICODE
Well, there are two strings defined, for char
and wchar_t
, and "standard" string, that's char
or wchar_t
based depending on _UNICODE
flag. In addition, if STL <list>
is included somewhere before EsString.hpp header, string lists based on std::list
become available. Actually, the latter code is legacy thing, because string header was cropped from the project I'm currently working on, I just decided to leave it as-is.
Features:
Followed is the detailed description of the interface provided by EsString
class; ES_WINDOWS
symbol allows to include\exclude Windows dependent stuff:
EsStringT()
EsStringT( const CharT *pStr, size_t nCount = 0,
bool bAddZeroTerminator = true )
EsStringT( CharT cCh, size_t nCount = 1 )
template <typename OtherCharT>
EsStringT( const EsStringT<OtherCharT>& crefOther )
#ifdef ES_WINDOWS
explicit EsStringT( BSTR pStr, bool bReleaseBSTR = true )
#endif //ES_WINDOWS
inline size_t GetRawLen() const
inline const char* GetRaw() const
inline size_t GetLen() const
inline const CharT* c_str() const
inline const CharT& At(size_t nIdx) const
inline CharT& At(size_t nIdx)
inline const CharT& operator[] (size_t nIdx) const
inline CharT& operator[] (size_t nIdx)
inline bool IsZeroTerminated() const
inline int GetPos( CharT cCh, int iFrom = 0 ) const
inline int GetRPos( CharT cCh, int iFrom ) const
inline int FindFirstIn( const CharT* strPattern,
int iFrom = 0 ) const
inline int RFindFirstIn( const CharT* strPattern, int iFrom ) const
inline int FindFirstNotIn( const CharT* strPattern,
int iFrom = 0 ) const
inline int RFindFirstNotIn( const CharT* strPattern,
int iFrom ) const
inline int GetPos( const EsStringT<CharT> strPattern,
int iFrom = 0 ) const
inline int GetRPos( const EsStringT<CharT> strPattern, int iFrom ) const
inline int Compare( const EsStringT<CharT>& crefOther ) const
inline int CompareIC( const EsStringT<CharT>& crefOther ) const
inline void Add( const EsStringT<CharT>& crefOther )
void Replace( const EsStringT<CharT>& crefPattern,
const EsStringT<CharT>& crefReplaceBy )
inline void TrimLeft(const CharT* strPattern)
inline void TrimRight(const CharT* strPattern)
inline void Trim(const CharT* strPattern)
inline EsStringT SubString(size_t nStart, int iCount = -1) const
inline const EsStringT& ToLower()
inline const EsStringT& ToUpper()
void Format(const EsStringT<CharT> strFormat, ...)
inline const EsStringT& BinToHex( const char* pBuff,
size_t nLen, const CharT* strHexPfx = NULL )
inline const BaseValT* GetValue() const
template <class OtherCharT>
inline void ConvertFrom( const EsStringT<OtherCharT>& crefSrc )
#ifdef ES_WINDOWS
inline void ConvertFrom(const BSTR pSrc, bool bReleaseBSTR = true)
#endif //ES_WINDOWS
template <class OtherCharT>
inline void operator= (const EsStringT<OtherCharT>& crefSrc)
inline void ToString( double dVal )
inline void ToString( int iVal )
inline EsStringT operator+ ( const EsStringT<CharT>& crefOther )
inline void operator+= ( const EsStringT<CharT>& crefOther )
inline bool operator< ( const EsStringT<CharT>& crefOther ) const
inline bool operator== ( const EsStringT<CharT>& crefOther ) const
inline bool operator> ( const EsStringT<CharT>& crefOther ) const
inline bool operator!= ( const EsStringT<CharT>& crefOther ) const
inline bool operator<= ( const EsStringT<CharT>& crefOther ) const
inline bool operator>= ( const EsStringT<CharT>& crefOther ) const
static EsStringT IncludeTrailingChar( const EsStringT<CharT> strSrc,
CharT chTrail )
inline bool IsUnique() const
inline void Unique()
#ifdef ES_WINDOWS
static EsStringT GetErrorDescription(int iErrorCode,
DWORD nLangId = MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT) )
inline void AssignErrorDescription(int iErrorCode,
DWORD nLangId = MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT) )
static EsStringT GetResourceString( UINT uID,
HINSTANCE hInstance = NULL, size_t nSizeMax = 1024 )
inline void AssignResourceString( UINT uID,
HINSTANCE hInstance = NULL, size_t nSizeMax = 1024 )
#endif //ES_WINDOWS
History
- 11.24.2004 - Added compilation flag and code for chunk memory allocation, fixed several minor bugs and typos. Demo project updated to allow for performance counting.
- 11.10.2004 - First release.
Compatibility & Performance
First, I've tried to make source compatible with three compiles, that I use often. It was developed primarily under VC++ .NET, the other two are BCC from Borland C++ Builder 6, and recent GCC. Unfortunately, MS compilers from earlier Visual Studio releases don't "understand" partial template specialization, so this branch was omitted, I just didn't want to spend time working around cl.exe bugs. Code as it is for now, should normally compile and run under BCC and .NET's cl, GCC may also do, but I didn't run GCC-compiled demo.
As for performance testing, so far, I've implemented the simple code making 10^7 character summations, taking 10 iterations, for statistics, measuring time taken for each iteration, as well as the memory allocated within it. But the "memory used" parameter may be used only for very, very rough estimation, because it's rather inaccurate. People, who have experience in getting exact values of memory allocated by process, please give an advice. Times shown below were measured on PM-1400 notebook with 512 Mb RAM.
Borland C++ Builder 6:
AnsiString, 10000000 summations
#0 Run for: 2.58 sec, used mem: 10035200 bytes
#1 Run for: 2.39 sec, used mem: 10010624 bytes
#2 Run for: 2.39 sec, used mem: 10039296 bytes
#3 Run for: 2.39 sec, used mem: 9981952 bytes
#4 Run for: 2.39 sec, used mem: 9973760 bytes
#5 Run for: 2.39 sec, used mem: 10047488 bytes
#6 Run for: 2.39 sec, used mem: 9969664 bytes
#7 Run for: 2.40 sec, used mem: 10059776 bytes
#8 Run for: 2.38 sec, used mem: 9969664 bytes
#9 Run for: 2.39 sec, used mem: 9994240 bytes<
EsString, 10000000 summations:
#0 Run for: 1.47 sec, used mem: 9273344 bytes
#1 Run for: 1.30 sec, used mem: 9670656 bytes
#2 Run for: 1.32 sec, used mem: 9707520 bytes
#3 Run for: 1.31 sec, used mem: 9752576 bytes
#4 Run for: 1.32 sec, used mem: 9576448 bytes
#5 Run for: 1.30 sec, used mem: 9859072 bytes
#6 Run for: 1.31 sec, used mem: 10010624 bytes
#7 Run for: 1.32 sec, used mem: 9719808 bytes
#8 Run for: 1.30 sec, used mem: 10010624 bytes
#9 Run for: 1.31 sec, used mem: 10043392 bytes
STL string, 10000000 summations: bad_alloc assertion after 30sec run, used mem > 1Gb.
MS VC++ .NET
MFC\ATL string, 10000000 summations:
#0 Run for: 38.15 sec, used mem: 9039872 bytes
#1 Run for: 38.55 sec, used mem: 9969664 bytes
#2 Run for: 38.56 sec, used mem: 9977856 bytes
#3 Run for: 38.11 sec, used mem: 25337856 bytes
#4 Run for: 38.80 sec, used mem: 5672960 bytes
#5 Run for: 38.77 sec, used mem: 2985984 bytes
#6 Run for: 38.79 sec, used mem: 10039296 bytes
#7 Run for: 38.92 sec, used mem: 6062080 bytes
#8 Run for: 38.76 sec, used mem: 10522624 bytes
#9 Run for: 38.78 sec, used mem: 9994240 bytes
EsString, 10000000 summations:
#0 Run for: 1.06 sec, used mem: 10055680 bytes
#1 Run for: 1.05 sec, used mem: 10047488 bytes
#2 Run for: 1.05 sec, used mem: 9998336 bytes
#3 Run for: 1.05 sec, used mem: 10027008 bytes
#4 Run for: 1.05 sec, used mem: 10022912 bytes
#5 Run for: 1.04 sec, used mem: 9977856 bytes
#6 Run for: 1.04 sec, used mem: 10043392 bytes
#7 Run for: 1.06 sec, used mem: 10027008 bytes
#8 Run for: 1.04 sec, used mem: 10027008 bytes
#9 Run for: 1.05 sec, used mem: 10006528 bytes
EsString, without chunk allocation
#0 Run for: 41.74 sec, used mem: 9322496 bytes
#1 Run for: 42.10 sec, used mem: 9830400 bytes
#2 Run for: 42.13 sec, used mem: 9973760 bytes
#3 Run for: 42.14 sec, used mem: 9912320 bytes
#4 Run for: 42.14 sec, used mem: 9986048 bytes
#5 Run for: 42.14 sec, used mem: 10006528 bytes
#6 Run for: 42.14 sec, used mem: 9654272 bytes
#7 Run for: 42.12 sec, used mem: 10080256 bytes
#8 Run for: 42.11 sec, used mem: 10027008 bytes
#9 Run for: 42.12 sec, used mem: 9953280 bytes
These tests show that EsString
class performs quite well with chunk memory allocation switched on. It obviously outperforms standard strings during massive concatenations, while providing roughly the same memory usage. The closest rival is, to my surprise, Borland's AnsiString, that is only by 1.5 slower. MS's CString
seems surprisingly slow, I believe, its memory allocation policy is responsible for it. For comparison, I ran the same test without chunk memory allocation, and EsString
showed runtimes relatively close to CString
's, but in that case, the latter was slightly faster. I didn't test the small string case, because I believe that the relative results would be the same, except the CString
, which uses statically (on-stack) allocated buffer for short strings, and it might boost its performance then.
As for STL string, as I said at the beginning of this article, its logic is quite dumb and straight-forward, so if you look for assignment and concatenation performance - don't use these, unless you absolutely have to. Alternatively, if you know exactly how much characters you would expect to be added, reserve string's capacity beforehand.
Comments
ES_ASSERT(x)
macro is used in string code here, and it is defined in precompiled header of demo project as:
#ifndef ES_ASSERT
#if !defined(_DEBUG) && !defined(NDEBUG)
#define NDEBUG
#endif
#include <assert.h>
#define ES_ASSERT(x) assert(x)
#endif //ES_ASSERT
When used in some project, EsString
and related headers may be included in precompiled header after ES_ASSERT
define, like it's done in the demo:
#include "EsRefCounted.hpp"
#include "EsString.hpp"
Source archive contains additional files - stdafx.h and its .cpp. That's because the former has several helper defines, as well as sketchy EsException
class used in EsString
code.
Any code (and performance) improvements, bug fixes, etc. are welcomed.
Plans for further development - make this class thread-safe, use conditional defines to exclude the thread-safety locks, for the sake of performance. Optionally, if project development plans would demand it - add EsString
- based string stream, that will make use of EsString
's good concatenation performance.
Born in 05.20.1971, in Moscow.
Graduated from Moscow Physical Engineering Institute in 1993.
Gained PhD. in Phys. Math. sciences in 1998.
Programmer experience over 8 years.
Assembler(s), Pascal, VBasic, JScript, ANSI C, C++.
Microcontrollers, Serial communication, MSJet DB, MFC, ATL, COM.
MSDev Studio, Borland CBuilder.
Russian, English.
Married, with one child.