Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Strings UNDOCUMENTED

0.00/5 (No votes)
6 Jan 2004 5  
Detailed looked at the implementation of strings in .NET

Introduction

This is an in-depth, behind-the-scenes look at strings in the Common Language Runtime and the .NET Framework. This study provides detailed information of the implementation of strings, describes efficient and inefficient ways of using strings.

I plan on developing a series of articles, exploring the underlying implementations of various fundamental features in C# and the framework, of which this is the first article, if The Code Project community finds this information helpful and offers any additional suggestions to this piece.

Most of the information presented here cannot be found anywhere else, not in MSDN, not in any book, and not in any article. My intent was to understand how to use C# efficiently to develop a serious commercial application. I am ex-Microsoft developer of Excel, who has started his own software company, developing applications employing artificial intelligence.

Background

Strings, as you know are a fundamental type, in the .NET world. They are one of a small set of exclusive objects that the CLR and the JIT compiler have intimate knowledge of. The others include, of course, the primitive data types, StringBuilder, Array, Type, Enum, Delegates and a few other Reflection classes such as MethodInfo in the CLR.

In .NET version 1.0, each heap-allocated object consists of a 4 byte objectheader and 4 byte pointer to a method table. The 4 byte object header provides five bit flags, one of which is reserved for the garbage collector to mark an object as reachable and not free space. The remaining bits refer to a 27 bit index called a syncindex, which may refer to another table. This index has multiple purposes: As implied by its name it is used for synchronization, whenever the "lock" keyword is used. It is also used as the default hash code for Object.GetHashCode(). It does not provide the best distribution properties for a hash code, but it meets the minimum requirements for a hash code for reference equality. The syncindex remains constant throughout the life of the object.

The String class and the Array class (including descendant classes) are the only two variable length objects.

Internally, strings resembled OLE BSTRs - an array of Unicode characters preceded by length data and followed by a final null character. A string occupies additional space as it includes the following members in order. Later, I shall explain how to access and change some of these internal variables, with and without using reflection.

Variable Type Description
m_arrayLength int This is the actual number of characters allocated for the string. This is equal to the logical string length + 1 for the null character for a normally created string but the actual length can reach up to twice the string length for a string returned by StringBuilder.
m_stringLength int This is the logical length of the string, the one returned by String.Length.
Because a number of high bits are used for additional flags to enhance performance, the maximum length of the string is constrained to a limit much smaller than UInt32.Max for 32bit systems. Some of these flags indicate the string contains simple characters such as plain ASCII and will not required invoking complex UNICODE algorithms for sorting and comparison tests.
m_firstChar char This is the first character of the string or the null character for an empty string.

Strings are always appended with a null character, even though it is valid for a string to contain embedded. This facilitates interoperability with unmanaged code and the traditional Win32 API.

Altogether strings occupy 16 bytes of memory + 2 bytes per character allocated + 2 bytes for the final null character. As described in the table the number of characters allocated may be up to twice the string length, if is used to create the string.

Extremely Efficient StringBuilder

Closely related to Strings is the StringBuilder class. Although StringBuilder is buried within the System.Text namespace, this class is not an ordinary class, but one that is specially handled by the runtime and the JIT compiler. As a result it may not be possible to write an equivalent StringBuilder class that is as efficient.

StringBuilder will construct a string object (which, you thought, were immutable) and modify it directly. By default, StringBuilder will create a string of 16 characters. (However, it might have been more appropriate to use an odd number like 15, since the string will be allocated with space for a null character wasting space for another character that would otherwise not be used because objects aligned on 4-byte boundaries when created.)

Variable Type Description
m_MaxCapacity int The maximum capacity of the string.
m_currentThread int The id of the thread in which the string was created.
m_StringValue string The actual string to modify.

If the string needs to grow beyond its capacity, a new string will be constructed with the greater of twice the previous capacity or the new desired size, subject to the maxcapacity constraint. This approach takes O(3n) which is linear time. The alternative approach of growing the string by a fixed amount rather than a percentage results in quadratic time performance, O(n^2).

When StringBuilder.ToString() returns a string, the actual string being modified is returned. However, if the capacity of the string exceed more than twice the string length, StringBuilder.ToString() will construct a new compact version of the string. The same string is returned with multiple calls to ToString().

Modifying the StringBuilder after a ToString() results in copy-on-write; an entirely new copy of the string being used, so as not the change the previously returned immutable string.

StringBuilder costs 16 bytes not including the memory used by the string. However, the same StringBuilder object can be reused multiple times to create several strings by setting Length to 0 and Capacity to the desired size, so the cost of a StringBuilder is incurred just once.

As you can see, creating Strings using StringBuilder is an extremely efficient operation.

Other performance tips:

  • When concatenating strings with +, such as " a + b + c ", the compiler will call a version of Concat(a,b,c) with that many arguments, thereby eliminating extra copies of strings.
  • StringBuilder of course is faster to use than concatenating a string.
  • If you know the capacity of the string beforehand, you can set that in the constructor to avoid unnecessary copying. If there is also an upper bound (MaxCapacity), that can be set in the constructor as well--but only in the constructor.

Minimizing Garbage Collection

Garbage Collector

Utilizing StringBuilder to construct strings can significantly reduce allocations. However, a number of quick tests have demonstrated that even full garbage collection can occur in a fraction of a second -- an imperceptible amount of time. It may not be worth avoiding garbage collection without profiling the application first. On the other hand, frequent garbage collection could be detrimental to performance. I have sometimes noticed a short unexplained pause in .NET application; it's hard to say if it is due to the JIT compiler, the garbage collector or any other factor. Many old-fashioned applications, such as the Windows shell, Word, and Internet Explorer have unexplained pauses as well.

.NET uses a three-generation approach to collecting memory, based on the heuristic that newly allocated memory tends to be freed more frequently than older allocations, which tend to be more permanent. Gen 0 is the youngest generation and, after a garbage collection, any survivors go on to Gen 1. Likewise, any survivors of a Gen 1 collection go on to Gen 2. Usually garbage collection will occur only on Gen 0, and only if after it has reached some limit.

The cost of allocating memory on the heap under garbage collection is less than that under the C runtime heap allocator. Until memory is exhausted, the cost of allocating each new object is that of incrementing a pointer--which is close to the performance of advancing the stack pointer. According to Microsoft, the cost of a garbage collection of generation 0 is equivalent to a page fault--from 0 to 10 milliseconds. The cost of a generation 1 collection is from 10 to 30 ms, while a generation 2 collection depends on the working set. My own profiling indicated that generation 0 collections occur 10-100 times more frequently than generation the other two.

One book by Jeffrey Richter that I have read suggested hypothetical limits of 250Kb for Gen 0, 2Mb for Gen 1 and 10Mb for Gen 2. However, my own investigation into the Rotor shared source CLI indicated that the threshold appears to be initially 800Kb for Gen 0 and 1Mb for Gen 1; of course, these are undocumented and subject to change. The thresholds are automatically adjusted dynamically according to actual program allocations. If very little memory is being freed in Gen 0 and survives to Gen 1, the threshold is expanded.

Suboptimal string functions

Many of the functions provided by the String class often generated needless allocations that increase the frequency of garbage collections.

For example, the functions ToUpper and ToLower will generate a newly allocated string, whether or not any changes were actually made to the string. A more efficient implementation would return the original immutable string. Likewise, Substring will return a new string as well even if the entire string or an empty string is returned. It would have been more optimal for String.Empty to be return in the latter case.

It is very difficult, if not impossible, to escape the numerous hidden allocations occurring with the class library. For example, whenever any numeric data type such as int or float is formatted as string (for example, through String.Format or Console.WriteLine), a new hidden string is created. In such cases, it is possible but inconvenient to write our own code to format strings.

When using Console.WriteLine with a value type, the value type object will be boxed and then converted to a string, resulting in two allocations. For example, Console.WriteLine(format, myint) is effectively equivalent to Console.WriteLine(format, ((object)myint).ToString()). You can save an extra allocation by calling ToString explicitly with Console.WriteLine(format, myint.ToString()). Since WriteLine includes overloads for common primitive types, it's more of an issue for custom valuetypes or calls to WriteLine with many arguments.

Other parts of the library exhibit these inefficiencies, too. In the Windows Forms Library, for example, the Control Text property always returns a new string; this may actually be understandable, because the property may not be cached and therefore it must call into Windows API with the function GetWindowText to retrieve the value of the control.

GDI+ is the worst abuser of the garbage collector, because a new string must be constructed for each call to MeasureText or DrawText.

One good string function is the GetHashCode, which produces an integer code with a very good distribution and takes into account every character in its code.

Direct Modifications of Strings

1) in-place modification of strings

public static unsafe void ToUpper(string str)
{
    fixed (char *pfixed = str)
    for (char *p=pfixed; *p != 0; p++)
        *p = char.ToUpper(*p);
}

The example above demonstrates how to change an immutable string through the use of unsafe pointers. A good example of efficiency gained through this approach with is str.ToUpper() which returns a newly constructed string which is the uppercase version of the original string. An entirely new string is created, whether any changes were made or not; with the code above, the same string is modified in place.

The fixed keyword pins the string in heap so that is cannot move during a garbage collection and allows the address of the string to be converted to a pointer. The new address points to the start of the string (or if an index is included as in &str[index], to the location referred to by the index), which is guaranteed to be null-terminated.

The functions below provide a more complete set of functions for modify a string, without reflection.

public static unsafe int GetCapacity(string s)
{
    fixed(char *p = s)
    {
        int *pcapacity = (int *)p - 2;
        return *pcapacity;
    }    
}

public static unsafe int GetLength(string s)
{
    // This function is redundant, because it accomplishes 

    // the same role as s.Length

    // but it does demonstrate some of the precautions 

    // that must be taken to 

    // recover the length variable

    fixed(char *p = s)
    {
        int *plength = (int *)p - 1;
        int length = *plength & 0x3fffffff;
        Debug.Assert(length == s.Length);
        return length;
    }
}

public static unsafe void SetLength(string s, int length)
{
    fixed(char *p = s)
    {
        int *pi = (int *)p;
        if (length<0 || length > pi[-2])
            throw( new ArgumentOutOfRangeException("length") );
        pi[-1] = length;
        p[length] = '\0';
    }
}

public static unsafe string NewString(int capacity)
{
    return new string('\0', capacity);
}

public static unsafe void SetChar(string s, int index, char ch)
{
    if (index<0 || index >= s.Length)
        throw( new ArgumentOutOfRangeException("length") );
    fixed(char *p = s)
        p[index] = ch;
}

Modifying the values will indeed change the string. If strings can be change, why then are the immutable? The fact that strings are immutable allows them to be as easily passed as around as regular integers. Strings can be initialized with null and can be blitted from one structure to next, without having to use an elaborate mechanism for copy-on-write.

Capacity refers to the array length of the string, which cannot be changed. However, the logical string length can be changed, by referring to the previous 32-bit integer just before the string. However, examining or modifying this value requires caution. The string length must always be less then array length of the string. The high two bits contains flags that indicate whether the string consists entirely of 7-bit ASCII text, in which case in can be sorted and compared quickly. A value of zero for both of these bits indicate that the state is indeterminate. When reading the length value, the length must be anded with 0x3fffffff to avoid reading the two bits. Modifying the length is okay as long as the high two bits are clear, which is the normally the case.

Future implementations or current non-Windows implementations of the CLR may change the underlying implementation of the string. Fortunately, the version of the runtime that you built with will be the one selected when your executable is started.

System.Reflection also allows programmers to access hidden fields, members, and properties--yes, even those marked with internal and private. This reflection-based approach performs much more slowly than the manual approach above, but does not require unsafe code and should be more robust in the face of changing versions of the runtime. The line demonstrates the ability to change the length of an immutable string through reflection:

typeof(string).GetField("m_stringLength",
BindingFlags.NonPublic|BindingFlags.Instance).SetValue(s, 5);

I have written a short test function to demonstrate the use of the various string functions described above.

/// <SUMMARY>


/// The main entry point for the application.

/// </SUMMARY>


[STAThread]
static unsafe void Main(string[] args)
{
    StringBuilder sb = new StringBuilder();
    sb.Append("Good morning");

    string test = "How are you";
    SetLength(test, 5);
    SetChar(test, 1, 'a');

    string [] sarray = new String[] { String.Empty, "hello", "goodbye", 
                                      sb.ToString(), test};
    foreach (string s in sarray)
    Console.WriteLine("'{0}' has capacity {1}, length {2}", s,
        GetCapacity(s),
        GetLength(s));

    Console.ReadLine();
}

The test code above results in the following output, demonstrating the full ease at which strings can actually be changed.

'' has capacity 1, length 0
'hello' has capacity 6, length 5
'goodbye' has capacity 8, length 7
'Good morning' has capacity 17, length 12
'Haw a' has capacity 12, length 5

2) Retrieving the working copy of string in StringBuilder

For the faint-hearted, who prefer a less intimate, less low-level interaction with strings. It's possible to recover the actual string that is being used by StringBuilder. Through reflection, we can gain access to the hidden internal m_StringValue variable.

string test = (string) sb.GetType().GetField("m_StringValue", BindingFlags.NonPublic|BindingFlags.Instance).GetValue(sb);

The returned string "test" will continue to reflect changes made within StringBuilder, unless a modification requires more capacity than the string holds. It which case StringBuilder abandons the original string and constructs an entirely new larger working string.

One caveat is that StringBuilder will never reduce the capacity of the string, without a call to ToString. ToString basically ensures that the current string length is at least half the capacity. Without it, the current string will only grow, never shrink, always staying at its peak size.

A future article on "Objects UNDOCUMENTED" will demonstrate how to access hidden members without using reflection.

String alternatives

An alternative to constructing Strings without impacting the garbage collector is to do one of the following:

All of these alternatives use unsafe features, which have no real implications if you are writing a standalone application. Keep in mind, managed C++ is unsafe. I don't really recommend these approaches, but they are mentioned for your awareness.

1) stack-based character array

char *array = stackalloc char[arraysize ];

2) special struct for fixed-sized strings

[StructLayout(LayoutKind.Explicit, Size=514)]
public unsafe struct Str255
{
   [FieldOffset(0)] char start;
   [FieldOffset(512)] byte length;

     #region Properties
     public int Length
     {
    get { return length; }
    set 
    { 
        length = Convert.ToByte(value);
        fixed(char *p=&start) p[length] = '\0';
    }
     }

     public char this[int index]
     {
    get 
    {
    if (index<0 || index>=length) 
            throw(new IndexOutOfRangeException());
          fixed(char *p=&start)
            return p[index];
    }
    set
    {
    if (index<0 || index>=length) 
            throw(new IndexOutOfRangeException());
          fixed(char *p=&start)p[index] = value;
    }
     }
     #endregion
}

Str255 is a stack-allocated value type that allows string operations to be performed without impacting the garbage collector. It can handle up to 255 characters and includes a length byte.

A reference to the structure can be directly passed to a Windows API call, because the start of the structure is the first character of a C-string. Of course additional functions need to be written by editing strings. Careful attention would also be needed to ensure that the string is null-terminated for Windows interoperability.

Also a conversion routine would be necessary to convert the struct to a .NET string, so that others CLR functions can use it. If this is used to create a .NET string, it would be superior to StringBuilder in one respect: Str255 would require fewer allocations.

Range Check Elimination

Indexing an array or string normally includes range-checking. According to Microsoft, the compiler performs a special optimization that improves the performance of iterating through an array or string. First, compare the following three approaches to iterating through a string. Which is fastest?

1) standard iteration

int hash = 0;
for (int i=0; i< s.Length; i++)
{
    hash += s[i];
}

2) iterative loop with saved length variable

int hash = 0;
int length = s.length;
for (int i=0; i< length; i++)
{
    hash += s[i];
}

3) foreach iteration

foreach (char ch in s)
{
    hash += s[i];
}

Surprisingly, with NET v1.0 of the JIT compiler, the first example, which repeatedly calls string.Length, actually produces the fastest code, while the third "foreach" example produces the slowest. In later versions of the compiler, the foreach example will be special-cased for strings to provide the same performance or better as example 1.

Why is example 1 faster than example 2? This is because the compiler recognizes the pattern for (int i=0; i<s.length; i++) for both strings and arrays. Strings are immutable; they have constant length. Since both strings and array are fixed lengths, the compilers simply stores away the length, so that no function call is made on each iteration. (The JIT compiler, which automatically inlines non-virtual method calls that consist of simple control flow and less than 32 bytes of IL instructions, may actually be inlining references to the string length.)

In addition, the compiler eliminates all range-check tests on any instance of s[i] within the loop, because i is guaranteed in the for condition to be within the range 0 <= i < length. Normally, any indexing of a string results in a range-check being performed; this is why attempting to save time by stashing away the length variable in example 2 actually results in slower code than in example 1.

NOTE: In the version 1.1, the C# team special cases foreach to behave like the first loop, so there should be no difference.

There is a fourth approach that should be even faster, although I have not verified it; however, it does require unsafe code to be emitted.

fixed (char *pfixed = s)
{
     for (char *p = pfixed; *p; p++)
       hash += *p++;
}

Efficient String Switches

C# supports switches on strings. In doing so, it uses an efficient mechanism for switching.

    
public void Test(string test)
{
    
     switch(test)
     {
    case "a": break;
    case "b": break;
    ...
     }
}

For small number of cases, the compiler will generate code that looks at the internal string hash table. The compiler will first call String.IsIntern on the switch value. String.IsIntern actually returns a string instead of boolean, returning null for failure, and the interned value of the string for success.

Every string constant in a case label is automatically known at compile-time and is therefore guaranteed to be intern. At that point, the compiler will compare the interned value of the string to each string constant using reference equality.

    
public void Test(string test)
{
  object interned = IsInterned(test);
  if (interned != null)
  {
    if (interned == "a")
    {
        ; // Handle case a

    }
    else if (interned == "b")
    {
        ; // Handle case b

    }
  }
}

Here is the sample IL for this code.

.method public hidebysig instance void  Test(string test) cil managed
{
  // Code size       34 (0x22)

  .maxstack  2
  .locals ([0] string CS$00000002$00000000)
  IL_0000:  ldstr      "a"
  IL_0005:  leave.s    IL_0007
  IL_0007:  ldarg.1
  IL_0008:  dup
  IL_0009:  stloc.0
  IL_000a:  brfalse.s  IL_001f
  IL_000c:  ldloc.0
  IL_000d:  call       string [mscorlib]System.String::IsInterned(string)
  IL_0012:  stloc.0
  IL_0013:  ldloc.0
  IL_0014:  ldstr      "a"
  IL_0019:  beq.s      IL_001d
  IL_001b:  br.s       IL_001f
  IL_001d:  br.s       IL_0021
  IL_001f:  br.s       IL_0021
  IL_0021:  ret
} 

If the number of cases is large (in this example it was 14), the compiler generates a sparse hashtable constructed with a loadfactor of .5 and with twice the capacity. [[ In actuality, a hashtable of with a loadfactor of .5 is generated with nearly a 3-1 ratio, since the Hashtable will try to maintain keep make sure that the ratio of used buckets to available buckets in the table is at most (loadfactor * .72), where the default loadfactor is 1. The magic number .72 represents the optimal ratio to balance speed and memory as determined by Microsoft performance tests. ]]

The hashtable will map each string to a corresponding index as shown in the C# illustration of what the compiler generates below.

    
private static Hashtable CompilerGeneratedHash;

public void Test(string test)
{
    if (CompilerGeneratedVariable==null)
    {
        CompilerGeneratedHash=new Hashtable(28, 0.5);
        CompilerGeneratedHash.Add("a", 0);
        CompilerGeneratedHash.Add("b", 1);
        ...
    }
    object result = CompilerGeneratedHash[test];
    if (result != null)
    {
        switch( (int) result )
        {
            case 0:
            case 1:
            ...
        }
    }
}

The actual IL version is shown below.

.method public hidebysig instance void  Test(string test) cil managed
{
  // Code size       373 (0x175)

  .maxstack  4
  .locals ([0] object CS$00000002$00000000)
  IL_0000:  volatile.
  IL_0002:  ldsfld     class [mscorlib]
    System.Collections.Hashtable '<PRIVATEIMPLEMENTATIONDETAILS>
'
::'$$method0x6000106-1'
  IL_0007:  brtrue     IL_0100
  IL_000c:  ldc.i4.s   28
  IL_000e:  ldc.r4     0.5
  IL_0013:  newobj     instance void [mscorlib]
System.Collections.Hashtable::.ctor(int32, float32)
  IL_0018:  dup
  IL_0019:  ldstr      "a"
  IL_001e:  ldc.i4.0
  IL_001f:  box        [mscorlib]System.Int32
  IL_0024:  call       instance void [mscorlib]
System.Collections.Hashtable::Add(object, object)..... object)
  IL_00f9:  volatile.
  IL_00fb:  stsfld     class [mscorlib]
    System.Collections.Hashtable '<PRIVATEIMPLEMENTATIONDETAILS>
'
::'$$method0x6000106-1'
  IL_0100:  ldarg.1
  IL_0101:  dup
  IL_0102:  stloc.0
  IL_0103:  brfalse.s  IL_0172
  IL_0105:  volatile.
  IL_0107:  ldsfld     class [mscorlib]
    System.Collections.Hashtable '<PRIVATEIMPLEMENTATIONDETAILS>
'
::'$$method0x6000106-1'
  IL_010c:  ldloc.0
  IL_010d:  call       
instance object [mscorlib]System.Collections.Hashtable::get_Item(object)
  IL_0112:  dup
  IL_0113:  stloc.0
  IL_0114:  brfalse.s  IL_0172
  IL_0116:  ldloc.0
  IL_0117:  unbox      [mscorlib]System.Int32
  IL_011c:  ldind.i4
  IL_011d:  switch     ( 
                        IL_0158,
                        IL_015a,
                        IL_015c,
                        IL_015e,
            ....
                        IL_0170)
  IL_0156:  br.s       IL_0172
    ....
  IL_0174:  ret
}

Whidbey

The next version of the C# compiler will be introducing a number of improvements for strings. Strings will have an improved hashing function with better distributional properties, so you don't want to rely on the current behavior.

Additional new functions in the string classes include the following: the static method IsNullOrEmpty, functions to convert to lower or uppercase in the invariant culture (ToLowerInvariant, ToUpperInvariant), and additional string normalization functions (Normalize, IsNormalized). Hashtables now directly support in the constructer several forms of case-insensitive string searches.

These are in the current pre-release versions of Whidbey available, but there will likely be more changes prior to beta.

Conclusion

This concludes my discourse on strings for now. I will continue update this article with new source code and actual benchmarks in the future. Be sure to watch for update versions of this page.

As a result of the enthusiasm that this article has generated, I will continue to develop more UNDOCUMENTED articles. My next article will be a discussion of the implementations of arrays and collections. I hope to publish a couple dozen articles when I am done with the series.

My sources include various books Microsoft publishes, the shared source CLI, MSDN, magazine articles, interviews, inside sources, developer conference presentations, and good old disassembly. All of this behind-the-scenes information takes some amount of work to research and obtain, so, if you enjoyed this article, don't forget to vote.

Version History

Version Description
Dec 26, 2003 Updates and Whidbey
Dec 31, 2002 Added functions for extracting and modifying hidden information about strings, with and without reflection.
Nov 16, 2002 Original article on strings

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here