|
Title |
Inside C#, Second Edition |
Authors |
Tom Archer, Andrew Whitechapel |
Publisher |
Microsoft Press |
Published |
Apr 2002 |
ISBN |
0735616485 |
Price |
US 49.99 |
Pages |
912 | |
String Handling and Regular Expressions, Part I
This two part series on string handling and regular expressions is based on Chapter 10 of Inside C#, Second Edition. As the chapter was split into two sections, I've done the same here for ease of readability. Note that while each article presents a single primary class (String
and Regex
, respectively) and a set of ancillary classes, there is a great deal of overlap between the two - in most situations, you have the option to use all string methods, all regular expression operations, or some of each. Also worth noting is that code based on the string functionality tends to be easier to understand and maintain, while code using regular expressions tends to be more flexible and powerful.
Having said that, this article will start by examining the String
class, some of its simple methods, and its range of formatting specifiers. We�ll then look at the relationship between strings and other .NET Framework classes�including Console, the basic numeric types, and DateTime�and how culture information and character encoding can affect string formatting. We�ll also look at the StringBuilder
support class and under the hood at string interning.
String Handling with C# and .NET
The .NET Framework
System.String
class (or its alias, string) represents an immutable string of characters�immutable because its value can�t be modified once it�s been created. Methods that appear to modify a string actually return a new string containing the modification. Besides the string class, the .NET Framework classes offer
StringBuilder
,
String.Format
,
StringCollection
, and so on. Together these offer comparison, appending, inserting, conversion, copying, formatting, indexing, joining, splitting, padding, trimming, removing, replacing, and searching methods.
Consider this example, which uses Replace
, Insert
, and ToUpper
:
public class TestStringsApp
{
public static void Main(string[] args)
{
string a = "strong";
string b = a.Replace('o', 'i');
Console.WriteLine(b);
string c = b.Insert(3, "engthen");
string d = c.ToUpper();
Console.WriteLine(d);
}
}
The output from this application will be:
string
STRENGTHENING
The String
class has a range of comparison methods, including Compare
and overloaded operators, as this continuation of the previous example shows:
if (d == c)
{
Console.WriteLine("same");
}
else
{
Console.WriteLine("different");
}
The output from this additional block of code is:
different
Note that the string variable a in the second to last example isn�t changed by the Replace
operation. However, you can always reassign a string variable if you choose. For example:
string q = "Foo";
q = q.Replace('o', 'i');
Console.WriteLine(q);
The output is:
Fii
You can combine string objects with conventional char arrays and even index into a string in the conventional manner:
string e = "dog" + "bee";
e += "cat";
string f = e.Substring(1,7);
Console.WriteLine(f);
for (int i = 0; i < f.Length; i++)
{
Console.Write("{0,-3}", f[i]);
}
Here�s the output:
ogbeeca
o g b e e c a
If you want a null string, declare one and assign null to it. Subsequently, you can reassign it with another string, as shown in the following example. Because the assignment to g
from f.Remove
is in a conditional block, the compiler will reject the Console.WriteLine(g)
statement unless g
has been assigned either null
or some valid string value.
string g = null;
if (f.StartsWith("og"))
{
g = f.Remove(2,3);
}
Console.WriteLine(g);
This is the output:
ogca
If you�re familiar with the Microsoft Foundation Classes (MFC) CString
, the Windows Template Library (WTL) CString
, or the Standard Template Library (STL) string class, the String.Format
method will come as no surprise. Furthermore, Console.WriteLine
uses the same format specifiers as the String
class, as shown here:
int x = 16;
decimal y = 3.57m;
string h = String.Format(
"item {0} sells at {1:C}", x, y);
Console.WriteLine(h);
Here�s the output:
item 16 sells at �3.57
If you have experience with Microsoft Visual Basic, you won�t be surprised to find that you can concatenate a string with any other data type using the plus sign (+). This is because all types have at least inherited object.ToString
. Here�s the syntax:
string t =
"item " + 12 + " sells at " + '\xA3' + 3.45;
Console.WriteLine(t);
And here�s the output:
item 12 sells at �3.45
String.Format
has a lot in common with Console.WriteLine
. Both methods include an overload that takes an open-ended (params) array of objects as the last argument. The following two statements will now produce the same output:
Console.WriteLine(
"Hello {0} {1} {2} {3} {4} {5} {6} {7} {8}",
123, 45.67, true, 'Q', 4, 5, 6, 7, '8');
string u = String.Format(
"Hello {0} {1} {2} {3} {4} {5} {6} {7} {8}",
123, 45.67, true, 'Q', 4, 5, 6, 7, '8');
Console.WriteLine(u);
The output follows:
Hello 123 45.67 True Q 4 5 6 7 8
Hello 123 45.67 True Q 4 5 6 7 8
String Formatting
Both
String.Format
and WriteLine formatting are governed by the same formatting rules: the format parameter is embedded with zero or more format specifications of the form
"{ N [, M ][: formatString ]}", arg1, ... argN,
where:
N
is a zero-based integer indicating the argument to be formatted.
M
is an optional integer indicating the width of the region to contain the formatted value, padded with spaces. If M is negative, the formatted value is left-justified; if M
is positive, the value is right-justified.
formatString
is an optional string of formatting codes.
argN
is the expression to use at the equivalent position inside the quotes in the string.
If argN
is null
, an empty string is used instead. If formatString
is omitted, the ToString
method of the argument specified by N
provides formatting. For example, the following three statements produce the same output:
public class TestConsoleApp
{
public static void Main(string[] args)
{
Console.WriteLine(123);
Console.WriteLine("{0}", 123);
Console.WriteLine("{0:D3}", 123);
}
}
Here�s the output:
123
123
123
We�d get exactly the same results using String.Format
directly:
string s = string.Format("123");
string t = string.Format("{0}", 123);
string u = string.Format("{0:D3}", 123);
Console.WriteLine(s);
Console.WriteLine(t);
Console.WriteLine(u);
Therefore:
- The comma
(,M)
determines the field width and justification.
- The colon
(:formatString)
determines how to format the data�such as currency, scientific notation, or hexadecimal�as shown here: Console.WriteLine("{0,5} {1,5}", 123, 456);
Console.WriteLine("{0,-5} {1,-5}", 123, 456);
Console.WriteLine("{0,-10:D6} {1,-10:D6}", 123, 456);
The output is: 123 456
123 456
Of course, you can combine them�putting the comma first, then the colon:
Console.WriteLine("{0,-10:D6} {1,-10:D6}", 123, 456);
Here�s the output:
000123 000456
We could use these formatting features to output data in columns with appropriate alignment�for example:
Console.WriteLine("\n{0,-10}{1,-3}", "Name","Salary");
Console.WriteLine("----------------");
Console.WriteLine("{0,-10}{1,6}", "Bill", 123456);
Console.WriteLine("{0,-10}{1,6}", "Polly", 7890);
This is the output:
Name Salary
----------------
Bill 123456
Polly 7890
Format Specifiers
Standard numeric format strings are used to return strings in commonly used formats. They take the form
X
0, in which
X
is the format specifier and 0 is the precision specifier. The format specifier can be one of the nine built-in format characters that define the most commonly used numeric format types, as shown in Table 10-1.
Table 10-1 - String and WriteLine Format Specifiers
Character |
Interpretation |
C or c |
Currency |
D or d |
Decimal (decimal integer�don�t confuse with the .NET Decimal type) |
E or e |
Exponent |
F or f |
Fixed point |
G or g |
General |
N or n |
Currency |
P or p |
Percentage |
R or r |
Round-trip (for floating-point values only); guarantees that a numeric value converted to a string will be parsed back into the same numeric value |
X or x |
Hex |
Let�s see what happens if we have a string format for an integer value using each of the format specifiers in turn. The comments in the following code show the output.
public class FormatSpecApp
{
public static void Main(string[] args)
{
int i = 123456;
Console.WriteLine("{0:C}", i);
Console.WriteLine("{0:D}", i);
Console.WriteLine("{0:E}", i);
Console.WriteLine("{0:F}", i);
Console.WriteLine("{0:G}", i);
Console.WriteLine("{0:N}", i);
Console.WriteLine("{0:P}", i);
Console.WriteLine("{0:X}", i);
}
}
The precision specifier controls the number of significant digits or zeros to the right of a decimal:
Console.WriteLine("{0:C5}", i);
Console.WriteLine("{0:D5}", i);
Console.WriteLine("{0:E5}", i);
Console.WriteLine("{0:F5}", i);
Console.WriteLine("{0:G5}", i);
Console.WriteLine("{0:N5}", i);
Console.WriteLine("{0:P5}", i);
Console.WriteLine("{0:X5}", i);
The R (round-trip) format works only with floating-point values: the value is first tested using the general format, with 15 spaces of precision for a Double and seven spaces of precision for a Single. If the value is successfully parsed back to the same numeric value, it�s formatted using the general format specifier. On the other hand, if the value isn�t successfully parsed back to the same numeric value, the value is formatted using 17 digits of precision for a Double and nine digits of precision for a Single. Although a precision specifier can be appended to the round-trip format specifier, it�s ignored.
double d = 1.2345678901234567890;
Console.WriteLine("Floating-Point:\t{0:F16}", d);
Console.WriteLine("Roundtrip:\t{0:R16}", d);
If the standard formatting specifiers aren�t enough for you, you can use picture format strings to create custom string output. Picture format definitions are described using placeholder strings that identify the minimum and maximum number of digits used, the placement or appearance of the negative sign, and the appearance of any other text within the number, as shown in Table 10-2.
Table 10-2 - Custom Format Specifiers
Format Character |
Purpose |
Description |
0 |
Display zero placeholder |
Results in a nonsignificant zero if a number has fewer digits than there are zeros in the format |
# |
Display digit placeholder |
Replaces the pound symbol (#) with only significant digits |
. |
Decimal point |
Displays a period (.) |
, |
Group separator |
Separates number groups, as in 1,000 |
% |
Percent notation |
Displays a percent sign (%) |
E+0 E-0 e+0 e-0 |
Exponent notation |
Formats the output of exponent notation |
\ |
Literal character |
Used with traditional formatting sequences such as �\n� (newline) |
'ABC' "ABC" |
Literal string |
Displays any string within quotes or apostrophes literally |
; |
Section separator |
Specifies different output if the numeric value to be formatted is positive, negative, or zero |
Let�s see the strings that result from a set of customized formats, using first a positive integer, then using the negative value of that same integer, and finally using zero:
int i = 123456;
Console.WriteLine();
Console.WriteLine("{0:#0}", i);
Console.WriteLine("{0:#0;(#0)}", i);
Console.WriteLine("{0:#0;(#0);<zero>}", i);
Console.WriteLine("{0:#%}", i);
i = -123456;
Console.WriteLine();
Console.WriteLine("{0:#0}", i);
Console.WriteLine("{0:#0;(#0)}", i);
Console.WriteLine("{0:#0;(#0);<zero>}", i);
Console.WriteLine("{0:#%}", i);
i = 0;
Console.WriteLine();
Console.WriteLine("{0:#0}", i);
Console.WriteLine("{0:#0;(#0)}", i);
Console.WriteLine("{0:#0;(#0);<zero>}", i);
Console.WriteLine("{0:#%}", i);
Objects and ToString
Recall that all data types�both predefined and user-defined�inherit from the
System.Object
class in the .NET Framework, which is aliased as object:
public class Thing
{
public int i = 2;
public int j = 3;
}
public class objectTypeApp
{
public static void Main()
{
object a;
a = 1;
Console.WriteLine(a);
Console.WriteLine(a.ToString());
Console.WriteLine(a.GetType());
Console.WriteLine();
Thing b = new Thing();
Console.WriteLine(b);
Console.WriteLine(b.ToString());
Console.WriteLine(b.GetType());
}
}
Here�s the output:
1
1
System.Int32
objectType.Thing
objectType.Thing
objectType.Thing
From the foregoing code, you can see that the statement
Console.WriteLine(a);
is the same as
Console.WriteLine(a.ToString());
The reason for this equivalence is that the ToString
method has been overridden in the Int32
type to produce a string representation of the numeric value. By default, however, ToString
will return the name of the object�s type�the same as GetTyp
e, a name composed of the enclosing namespace or namespaces and the class name. This equivalence is clear when we call ToString
on our Thing reference. We can�and should�override the inherited ToString
for any nontrivial user-defined type:
public class Thing
{
public int i = 2;
public int j = 3;
override public string ToString()
{
return String.Format("i = {0}, j = {1}", i, j);
}
}
The relevant output from this revised code is:
i = 2, j = 3
i = 2, j = 3
objectType.Thing
Numeric String Parsing
All the basic types have a
ToString
method, which is inherited from the
Object
type, and all the numeric types have a
Parse
method, which takes the string representation of a number and returns you its equivalent numeric value. For example:
public class NumParsingApp
{
public static void Main(string[] args)
{
int i = int.Parse("12345");
Console.WriteLine("i = {0}", i);
int j = Int32.Parse("12345");
Console.WriteLine("j = {0}", j);
double d = Double.Parse("1.2345E+6");
Console.WriteLine("d = {0:F}", d);
string s = i.ToString();
Console.WriteLine("s = {0}", s);
}
}
The output from this application is shown here:
i = 12345
j = 12345
d = 1234500.00
s = 12345
Certain non-digit characters in an input string are allowed by default, including leading and trailing spaces, commas and decimal points, and plus and minus signs. Therefore, the following Parse
statements are equivalent:
string t = " -1,234,567.890 ";
double g = double.Parse(t,
NumberStyles.AllowLeadingSign �
NumberStyles.AllowDecimalPoint �
NumberStyles.AllowThousands �
NumberStyles.AllowLeadingWhite �
NumberStyles.AllowTrailingWhite);
Console.WriteLine("g = {0:F}", g);
The output from this additional code block is shown next:
g = -1234567.89
Note that to use NumberStyles
you must add a using statement for System.Globalization
. Then you either can use a combination of the various NumberStyles
enum values or use NumberStyles.Any
for all of them. If you also want to accommodate a currency symbol, you need the third Parse
overload, which takes a NumberFormatInfo
object as a parameter. You then set the Currency­Symbol
field of the NumberFormatInfo
object to the expected symbol before passing it as the third parameter to Pars
e, which modifies the Parse
behavior:
string u = "� -1,234,567.890 ";
NumberFormatInfo ni = new NumberFormatInfo();
ni.CurrencySymbol = "�";
double h = Double.Parse(u, NumberStyles.Any, ni);
Console.WriteLine("h = {0:F}", h);
The output from this additional code block is shown here:
h = -1234567.89
In addition to NumberFormatInf
o, we can use the CultureInfo
class. CultureInfo
represents information about a specific culture, including the names of the culture, the writing system, and the calendar used, as well as access to culture-specific objects that provide methods for common operations, such as formatting dates and sorting strings. The culture names follow the RFC 1766 standard in the format <languagecode2>-<country/regioncode2>
, in which <languagecode2>
is a lowercase two-letter code derived from ISO 639-1 and <country/regioncode2>
is an uppercase two-letter code derived from ISO 3166. For example, U.S. English is "en-US", UK English is "en-GB", and Trinidad and Tobago English is "en-TT". For example, we could create a CultureInfo
object for English in the United States and convert an integer value to a string based on this CultureInfo
:
int k = 12345;
CultureInfo us = new CultureInfo("en-US");
string v = k.ToString("c", us);
Console.WriteLine(v);
This example would produce a string like this:
$12,345.00
Note that we�re using a ToString
overload that takes a format string as its first parameter and an IFormatProvider
interface implementation�in this case, a CultureInfo
reference�as its second parameter. Here�s another example, this time for Danish in Denmark:
CultureInfo dk = new CultureInfo("da-DK");
string w = k.ToString("c", dk);
Console.WriteLine(w);
The output is:
kr 12.345,00
Strings and DateTime
A
DateTime
object has a property named
Ticks
that stores the date and time as the number of 100-nanosecond intervals since 12:00 AM January 1, 1 A.D. in the Gregorian calendar. For example, a ticks value of 31241376000000000L has the string representation "Friday, January 01, 0100 12:00:00 AM". Each additional tick increases the time interval by 100 nanoseconds.
DateTime
values are formatted using standard or custom patterns stored in the properties of a DateTimeFormatInfo
instance. To modify how a value is displayed, the DateTimeFormatInfo
instance must be writeable so that custom patterns can be saved in its properties.
using System.Globalization;
public class DatesApp
{
public static void Main(string[] args)
{
DateTime dt = DateTime.Now;
Console.WriteLine(dt);
Console.WriteLine("date = {0}, time = {1}\n",
dt.Date, dt.TimeOfDay);
}
}
This code will produce the following output:
23/06/2001 17:55:10
date = 23/06/2001 00:00:00, time = 17:55:10.3839296
Table 10-3 lists the standard format characters for each standard pattern and the associated DateTimeFormatInfo
property that can be set to modify the standard pattern.
Table 10-3 - DateTime Formatting
Format Character |
Format Pattern |
Associated Property/Description |
D |
MM/dd/yyyy |
ShortDataPattern |
D |
dddd,MMMM dd,yyyy |
LongDatePattern |
F |
dddd,MMMM dd,yyyy HH:mm |
Full date and time (long date and short time) |
F |
dddd,MMMM dd,yyyy HH:mm:ss |
FullDateTimePattern (long date and long time) |
G |
MM/dd/yyyy HH:mm |
General (short date and short time) |
G |
MM/dd/yyyy HH:mm:ss |
General (short date and long time) |
M,M |
MMMM dd |
MonthDayPattern |
r,R |
ddd,dd MMM yyyy,HH':'mm':'ss 'GMT' |
RFC1123Pattern |
S |
yyyy-MM-dd HH:mm:ss |
SortableDateTimePattern (conforms to ISO 8601) using local time |
T |
HH:mm |
ShortTimePattern |
T |
HH:mm:ss |
LongTimePattern |
U |
yyyy-MM-dd HH:mm:ss |
UniversalSortableDateTimePattern (conforms to ISO 8601) using universal time |
U |
dddd,MMMM dd,yyyy,HH:mm:ss |
UniversalSortableDateTimePattern |
y,Y |
MMMM,yyyy |
YearMonthPattern |
The DateTimeFormatInfo.InvariantInfo
property gets the default read-only DateTimeFormatInfo
instance that�s culture independent (invariant). You can also create custom patterns. Note that the InvariantInfo
isn�t necessarily the same as the current locale info: Invariant equates to U.S. standard. Also, if you pass null as the second parameter to DateTime.Format
, the DateTimeFormatInfo
will default to CurrentInfo
,as in:
Console.WriteLine(dt.ToString("d", dtfi));
Console.WriteLine(dt.ToString("d", null));
Console.WriteLine();
Here�s the output:
06/23/2001
23/06/2001
Compare the results of choosing InvariantInfo
with those of choosing CurrentInf
o:
DateTimeFormatInfo dtfi;
Console.Write("[I]nvariant or [C]urrent Info?: ");
if (Console.Read() == 'I')
dtfi = DateTimeFormatInfo.InvariantInfo;
else
dtfi = DateTimeFormatInfo.CurrentInfo;
DateTimeFormatInfo dtfi = DateTimeFormatInfo.InvariantInfo;
Console.WriteLine(dt.ToString("D", dtfi));
Console.WriteLine(dt.ToString("f", dtfi));
Console.WriteLine(dt.ToString("F", dtfi));
Console.WriteLine(dt.ToString("g", dtfi));
Console.WriteLine(dt.ToString("G", dtfi));
Console.WriteLine(dt.ToString("m", dtfi));
Console.WriteLine(dt.ToString("r", dtfi));
Console.WriteLine(dt.ToString("s", dtfi));
Console.WriteLine(dt.ToString("t", dtfi));
Console.WriteLine(dt.ToString("T", dtfi));
Console.WriteLine(dt.ToString("u", dtfi));
Console.WriteLine(dt.ToString("U", dtfi));
Console.WriteLine(dt.ToString("d", dtfi));
Console.WriteLine(dt.ToString("y", dtfi));
Console.WriteLine(dt.ToString("dd-MMM-yy", dtfi));
Here�s the output:
[I]nvariant or [C]urrent Info?: I
01/03/2002
03/01/2002
Thursday, 03 January 2002
Thursday, 03 January 2002 12:55
Thursday, 03 January 2002 12:55:03
01/03/2002 12:55
01/03/2002 12:55:03
January 03
Thu, 03 Jan 2002 12:55:03 GMT
2002-01-03T12:55:03
12:55
12:55:03
2002-01-03 12:55:03Z
Thursday, 03 January 2002 12:55:03
01/03/2002
2002 January
03-Jan-02
[I]nvariant or [C]urrent Info?: C
03/01/2002
03/01/2002
03 January 2002
03 January 2002 12:55
03 January 2002 12:55:47
03/01/2002 12:55
03/01/2002 12:55:47
03 January
Thu, 03 Jan 2002 12:55:47 GMT
2002-01-03T12:55:47
12:55
12:55:47
2002-01-03 12:55:47Z
03 January 2002 12:55:47
03/01/2002
January 2002
03-Jan-02
Encoding Strings
The
System.Text
namespace offers an Encoding class. Encoding is an abstract class, so you can�t instantiate it directly. However, it does provide a range of methods and properties for converting arrays and strings of Unicode characters to and from arrays of bytes encoded for a target code page. These properties actually resolve to returning an implementation of the Encoding class. Table 10-4 shows some of these properties.
Table 10-4 - String Encoding Classes
Property |
Encoding |
ASCII |
Encodes Unicode characters as single, 7-bit ASCII characters. This encoding supports only character values between U+0000 and U+007F |
BigEndianUnicode |
Encodes each Unicode character as two consecutive bytes, using big endian (code page 1201) byte ordering. |
Unicode |
Encodes each Unicode character as two consecutive bytes, using little endian (code page 1200) byte ordering. |
UTF7 |
Encodes Unicode characters using the UTF-7 encoding. (UTF-7 stands for UCS Transformation Format, 7-bit form.) This encoding supports all Unicode character values and can be accessed as code page 65000. |
UTF8 |
Encodes Unicode characters using the UTF-8 encoding. (UTF-8 stands for UCS Transformation Format, 8-bit form.) This encoding supports all Unicode character values and can be accessed as code page 65001. |
For example, you can convert a simple sequence of bytes into a conventional ASCII string, as shown here:
class StringEncodingApp
{
static void Main(string[] args)
{
byte[] ba = new byte[]
{72, 101, 108, 108, 111};
string s = Encoding.ASCII.GetString(ba);
Console.WriteLine(s);
}
}
This is the output:
Hello
If you want to convert to something other than ASCII, simply use one of the other Encoding
properties. The following example has the same output as the previous example:
byte[] bb = new byte[]
{0,72, 0,101, 0,108, 0,108, 0,111};
string t = Encoding.BigEndianUnicode.GetString(bb);
Console.WriteLine(t);
The System.Text
namespace also includes several classes derived from�and therefore implementing�the abstract Encoding
class. These classes offer similar behavior to the properties in the Encoding
class itself:
- ASCIIEncoding
- UnicodeEncoding
- UTF7Encoding
- UTF8Encoding
You could achieve the same results as those from the previous example with the following code:
ASCIIEncoding ae = new ASCIIEncoding();
Console.WriteLine(ae.GetString(ba));
UnicodeEncoding bu =
new UnicodeEncoding(true, false);
Console.WriteLine(bu.GetString(bb));
The StringBuilder Class
Recall that with the
String
class, methods that appear to modify a string actually return a new string containing the modification. This behavior is sometimes a nuisance because if you make several modifications to a string, you end up working with several generations of copies of the original. For this reason, the people at Redmond have provided the
StringBuilder
class in the
System.Text
namespace.
Consider this example, using the StringBuilder
methods Replace
, Insert
, Append
, AppendFormat
, and Remove
:
class UseSBApp
{
static void Main(string[] args)
{
StringBuilder sb = new StringBuilder("Pineapple");
sb.Replace('e', 'X');
sb.Insert(4, "Banana");
sb.Append("Kiwi");
sb.AppendFormat(", {0}:{1}", 123, 45.6789);
sb.Remove(sb.Length - 3, 3);
Console.WriteLine(sb);
}
}
This is the output:
PinXBananaapplXKiwi, 123:45.6
Note that�as with most other types�you can easily convert from a StringBuilder
to a String
:
string s = sb.ToString().ToUpper();
Console.WriteLine(s);
Here�s the output:
PINXBANANAAPPLXKIWI, 123:45.6
Splitting Strings
The
String
class does offer a Split method for splitting a string into substrings, with the splits determined by arbitrary separator characters that you supply to the method. For example:
class SplitStringApp
{
static void Main(string[] args)
{
string s = "Once Upon A Time In America";
char[] seps = new char[]{' '};
foreach (string ss in s.Split(seps))
Console.WriteLine(ss);
}
}
The output follows:
Once
Upon
A
Time
In
America
The separators parameter to String.Split
is an array of char; therefore, we can split a string based on multiple delimiters. However, we have to be careful about special characters such as the backslash (\) and single quote ('). The following code produces the same output as the previous example did:
string t = "Once,Upon:A/Time\\In\'America";
char[] sep2 = new char[]{ ' ', ',', ':', '/', '\\', '\''};
foreach (string ss in t.Split(sep2))
Console.WriteLine(ss);
Note that the Split
method is quite simple and not too useful if we want to split substrings that are separated by multiple instances of some character. For example, if we have more than one space between any of the words in our string, we�ll get these results:
string u = "Once Upon A Time In America";
char[] sep3 = new char[]{' '};
foreach (string ss in u.Split(sep3))
Console.WriteLine(ss);
Here�s the output:
Once
Upon
A
Time
In
America
In the second article of this two-part series, we�ll consider the regular expression classes in the .NET Framework, and we�ll see how to solve this particular problem and many others.
Extending Strings
In libraries before the .NET era, it became common practice to extend the
String
class found in the library with enhanced features. Unfortunately, the
String
class in the .NET Framework is sealed; therefore, you can�t derive from it. On the other hand, it�s entirely possible to provide a series of encapsulated static methods that process strings. For example, the
String
class does offer the
ToUpper
and
ToLower
methods for converting to uppercase or lowercase, respectively, but this class doesn�t offer a method to convert to proper case (initial capitals on each word). Providing such functionality is simple, as shown here:
public class StringEx
{
public static string ProperCase(string s)
{
s = s.ToLower();
string sProper = "";
char[] seps = new char[]{' '};
foreach (string ss in s.Split(seps))
{
sProper += char.ToUpper(ss[0]);
sProper +=
(ss.Substring(1, ss.Length - 1) + ' ');
}
return sProper;
}
}
class StringExApp
{
static void Main(string[] args)
{
string s = "the qUEEn wAs in HER parLOr";
Console.WriteLine("Initial String:\t{0}", s);
string t = StringEx.ProperCase(s);
Console.WriteLine("ProperCase:\t{0}", t);
}
}
This will produce the output shown here. (In the second part of this two-part series, we�ll see how to achieve the same results with regular expressions.)
Initial String: the qUEEn wAs in HER parLOr
ProperCase: The Queen Was In Her Parlor
Another classic operation that doubtless will appear again is a test for a palindromic string�a string that reads the same backwards and forwards:
public static bool IsPalindrome(string s)
{
int iLength, iHalfLen;
iLength = s.Length - 1;
iHalfLen = iLength / 2;
for (int i = 0; i <= iHalfLen; i++)
{
if (s.Substring(i, 1) !=
s.Substring(iLength - i, 1))
{
return false;
}
}
return true;
}
static void Main(string[] args)
{
Console.WriteLine("\nPalindromes?");
string[] sa = new string[]{
"level", "minim", "radar",
"foobar", "rotor", "banana"};
foreach (string v in sa)
Console.WriteLine("{0}\t{1}",
v, StringEx.IsPalindrome(v));
}
Here�s the output:
Palindromes?
level True
minim True
radar True
foobar False
rotor True
banana False
For more complex operations�such as conditional splitting or joining, extended parsing or tokenizing, and sophisticated trimming in which the String
class doesn�t offer the power you want�you can turn to the Regex
class. That�s what we�ll look at next in the follow-up article to this one
String Interning
One of the reasons strings were designed to be immutable is that this arrangement allows the system to intern them. During the process of string interning, all the constant strings in an application are stored in a common place in memory, thus eliminating unnecessary duplicates. This practice clearly saves space at run time but can confuse the unwary. For example, recall that the equivalence operator (==) will test for value equivalence for value types and for address (or reference) equivalence for reference types. Therefore, in the following application, when we compare two reference type objects of the same class with the same contents, the result is
False
. However, when we compare two string objects with the same contents, the result is
True
:
class StringInterningApp
{
public class Thing
{
private int i;
public Thing(int i) { this.i = i; }
}
static void Main(string[] args)
{
Thing t1 = new Thing(123);
Thing t2 = new Thing(123);
Console.WriteLine(t1 == t2);
string a = "Hello";
string b = "Hello";
Console.WriteLine(a == b);
}
}
OK, but both strings are actually constants or literals. Suppose we have another string that�s a variable? Again, given the same contents, the string equivalence operator will return True
:
string c = String.Copy(a);
Console.WriteLine(a == c);
Now suppose we force the run-time system to treat the two strings as objects, not strings, and therefore use the most basic reference type equivalence operator. This time we get False
:
Console.WriteLine((object)a == (object)c);
Time to look at the underlying Microsoft intermediate language (MSIL), as shown in Figure 10-1.
Figure 10-1 - MSIL for string equivalence and object equivalence.
The crucial differences are as follows: For the first comparison (t1==t2)
, having loaded the two Thing
object references onto the evaluation stack, the MSIL uses opcode ceq
(compare equal), thus clearly comparing the references, or address values. However, when we load the two strings onto the stack for comparison with ldstr, the MSIL for the second comparison (a==b)
is a call operation. We don�t just compare the values on the stack; instead, we call the String
class equivalence operator method, op_Equality
. The same process happens for the third comparison (a==c)
. For the fourth comparison, (object)a==(object)c
, we�re back again to ceq
. In other words, we compare the values on the stack�in this case, the addresses of the two strings.
Note that Chapter 13 of Inside C# illustrates exactly how the String
class can have its own equivalence operator method via operator overloading. For now, it�s enough to know that the system will compare strings differently than other reference types.
What happens if we compare the two original string constants and force the use of the most primitive equivalence operator? Take a look:
Console.WriteLine((object)a == (object)b);
You�ll find that the output from this is True
. Proof, finally, that the system is interning strings�the MSIL opcode used is again ceq, but this time it results in equality because the two strings were assigned a constant literal value that was stored only once. In fact, the Common Language Infrastructure guarantees that the result of two ldstr instructions referring to two metadata tokens with the same sequence of characters return precisely the same string object.
Summary
In this article, we examined the
String
class and a range of ancillary classes that modify and support string operations. We explored the use of the
String
class methods for searching, sorting, splitting, joining, and otherwise returning modified strings. We also saw how many other classes in the .NET Framework support string processing�including Console, the basic numeric types, and DateTime�and how culture information and character encoding can affect string formatting. Finally, we saw how the system performs sneaky string interning to improve runtime efficiency. In the next article, you'll discover the
Regex
class and its supporting classes �
Match
,
Group
, and
Capture
� for encapsulating regular expressions.