Full Lectures Set
- C#Lectures - Lecture 1: Primitive Types
- C# Lectures - Lecture 2: Work with text in C#: char, string, StringBuilder, SecureString
- C# Lectures - Lecture 3 Designing Types in C#. Basics You Need to Know About Classes
- C# Lectures - Lecture 4: OOP basics: Abstraction, Encapsulation, Inheritance, Polymorphism by C# example
- C# Lectures - Lecture 5:Events, Delegates, Delegates Chain by C# example
- C# Lectures - Lecture 6: Attributes, Custom attributes in C#
- C# Lectures - Lecture 7: Reflection by C# example
- C# Lectures - Lecture 8: Disaster recovery. Exceptions and error handling by C# example
- C# Lectures - Lecture 9:Lambda expressions
- C# Lectures - Lecture 10: LINQ introduction, LINQ to 0bjects Part 1
- C# Lectures - Lecture 11: LINQ to 0bjects Part 2. Nondeferred Operators
Introduction
In this article I want to focus on text handling in C#. I will describe about most common ways how text data is presented in C#, look at types that are used for text data handling.
char
Most simple way to store text even not text but one symbol data is to use char type. Char is primitive type (more about primitive types you can learn in my article here). Char variable can hold one symbol. Actually char is 16bit numerical value that compiler translate to symbol literal.
char cVar = 's';
Console.WriteLine("Code of character s = " + (int)cVar);char cVar1 = (char)115;
char cVar2 = '\x0073';
char cVar3 = '\u0073';
Console.WriteLine("cVar=" + cVar + " cVar1=" + cVar1 + " cVar2=" + cVar2 + " cVar3=" + cVar3 +"\n");
All char symbols are split to several categories, such as lower case letters, upper case letters, currency symbols, punctuation symbols, math symbols and several other categories. To get symbol category char type has static method GetUnicodeCategory. Below is the example of its usage and its result:
Console.WriteLine(char.GetUnicodeCategory('c')); Console.WriteLine(char.GetUnicodeCategory('C')); Console.WriteLine(char.GetUnicodeCategory('$')); Console.WriteLine(char.GetUnicodeCategory(',')); Console.WriteLine(char.GetUnicodeCategory('+') + "\n");
Each Unicode category has its own two letters code. List of codes you can see here. Also at unicode.org you can find list of all characters as txt file. In this file third column is category ID. There is a set of static methods to work with char and understand its relation to some category. They call to GetUnicodeCategory and return true or false. These are methods as IsDigit, IsLetter, IsWhiteSpace, IsUpper, IsLower, IsPunctuation, IsLetterOrDigit, IsControl, IsNumber etc. These methods receive one character or string and index of specific character in this string.
Besides getting char category we have also important and convenient static functions of type char to convert characters. Functions ToUpperInvariant and ToLowerInvariant change the register of the symbol without taking into account regional standards (culture). To change the register with culture that is assigned to calling thread you should use ToUpper and ToLower functions. If the input symbol is already in output register output is the same as input. If you want to put specific culture info to conversion functions you need to put an object of type CultureInfo as input to these functions. Below is the code that demonstrates the usage of described functions and its result:
Console.WriteLine(char.ToUpperInvariant('s')); Console.WriteLine(char.ToLowerInvariant('P')); Console.WriteLine(char.ToLowerInvariant('p')); Console.WriteLine(char.ToLowerInvariant('Ф')); Console.WriteLine(char.ToLower('Ф'));
There are few more important static functions of type char that you should know and use where it is required. GetNumericalValue converts input character to double precision floating point number. You can use this function to know if specific linguistic character can be converted to a decimal digit.
Equals checks if two input characters are equivalent or not.
Following is the example of their usage and its result:
Console.WriteLine(char.GetNumericValue('s')); Console.WriteLine(char.GetNumericValue('4'));
Console.WriteLine(char.Equals('s','s')); Console.WriteLine(char.Equals('s', 'p')); Console.WriteLine(char.Equals('s', 'S') + "\n");
You can use char arrays to do manipulations with text. This is a valid way to do work with complex text and it gives a lot of flexibility for developer. This is not very convenient way because you need to control everything by yourself and create own algorithms for text handling. No need to do this by yourself, C# has types that help you to be more efficient and wraps the work with char arrays to type called string, but you should know that char array is widely used way to present text data. Set of API returns char arrays as output text data, you should not scare of it and know how to deal with it.
string
String is the most known and most commonly used type to deal with text data in C#. String is primitive type that derives not from ValueType, but directly from Object. This means that string is always stored in heap and never in stack. One more thing you should know is that string is ordered and constant (not changeable) set of symbols. Because strings are constant you don't need to synchronize threads when work with string. Besides this if two strings in system are equal CLR knows about it and both pointers point to same place. It means that work with strings is optimized and fast in .NET. To achieve this productivity Microsoft closed the class string and make it sealed, you can't derive from it.
Declaration
As string is primitive type in C#, it has shorten way to declare string instances. You can declare it using simplified way: string s = "String text"
You can also use 7 overloaded string constructors to instantiate string. Note, that no one from this constructors takes string literal as input. Following code will not be compiled successfully:
string s = new string("dsfds");
You can also instantiate new string by calling string.Copy method. It creates copy of the input string and initialize output string with it. String Clone method returns reference to the same string object on which you call it.
Following code demonstrates different ways of string declaration and initialization:
string s = "This is the string";
s = @"C:\Windows\Sysem32\notepad.exe";
string s1 = "C:\\Windows\\Sysem32\\notepad.exe";
char[] charArray = {'s','d','s','d','s','a'};
s1 = new string(charArray);
Console.WriteLine(s1);
string s2 = "dfdefsd";
unsafe
{
fixed (char* pChar = s2)
{
string s3 = new string(pChar);
Console.WriteLine(s3);
}
}
string s4 = string.Copy(s1);Console.WriteLine(s4); s4 = (string)s1.Clone();Console.WriteLine(Object.ReferenceEquals(s4, s1)); s2 = s4.ToString();Console.WriteLine(Object.ReferenceEquals(s2, s1));
Character combinations consisting of a backslash (\) followed by a letter or by a combination of digits are called "escape sequences." To represent a newline character, single quotation mark, or certain other characters in a character constant, you must use escape sequences. (More about it you can read here). In C# there is a way to declare string when all symbols between "" are treaed as part of the string. It is called verbatim srings. To declare verbatim string you need to use symbol @ before the string constant. Following declaration give the same result:
string s = @"C:\Windows\Sysem32\notepad.exe";
string s1 = "C:\\Windows\\Sysem32\\notepad.exe";
There is set of operations that you can perform with strings. Some of them are static members of type string and some are done using operators and other types.
Interning
If your application uses a lot of strings and there is a chance that these strings will be repeated, you get a very cool mechanism to deal with it from .NET. CLR supports mechanism called strings interning. What does it mean? While application domain is created CLR creates internal hash table with string literals keys and values that are references to string objects. To work with this hash table you need to use two methods: Intern and IsInterned. By calling Inern you look for string in hash table and if it is present you receive the reference to string object that holds it, otherwise we new copy of the string is created in hash table and reference to it returned. Once string object reference is not hold by application anymore garbage collector cleans text data that this object referenced. In case of interning this doesn't happen. Strings that are inside hash table are in memory until application domain is loaded. It means that if you work with repeated text data it works very fast.
Console.WriteLine("--INTERNING");
s = string.Intern(s);
Console.WriteLine("Is interned: " + string.IsInterned(s)); Console.WriteLine("Is interned: " + string.IsInterned(s1)); string s5 = "Hello";
string s6 = new StringBuilder().Append("He").Append("llo").ToString();
Console.WriteLine(Object.ReferenceEquals(s5, s6)); s5 = string.Intern(s5);
s6 = string.Intern(s6);
Console.WriteLine(Object.ReferenceEquals(s5, s6));
Concatenation
Let's review the first operation that you can perform on string this is concatenation:
- You can do it using "+" operator. As you remember string is the constant unchangeable set of characters and result of each concatenation is completely new allocation in the memory. That's why concatenation with using "+" operator is better to do for literal strings. In this case concatenation is done at compilation time and you have final string built in to metadata of your module. Using "+" operator for variables is not good as each concatenation will create new string in memory that is not good from the point of resources used by application.
- You can also use static type string string.Concat method to concatenate strings. This method is very convenient to run with containers to concatenate all their members as string. Concat has many overloads and before implementing concatenation manually I recommend to check if one of them can be applicable for your name.
- You can use Join method to concatenate objects of input array or collection and may put a specific separator between them. Has several overloads and you should check for correct one before using this function.
Below is example of different concatenation ways that I just mentioned:
Console.WriteLine("--CONCATENATION");
s = "String1 " + "String2 ";
Console.WriteLine(s);s1 = "String3 ";
s = s + s1;
Console.WriteLine(s);s = string.Concat("test1", " test2", " test3");
Console.WriteLine(s);List<int> iList = new List<int>();
for(int i = 0; i< 10; i++)
{
iList.Add(i);
}
s = string.Concat(iList);
Console.WriteLine(s);string[] array = { "one", "two", "three"};
s = string.Join("||", array);
Console.WriteLine(s);
Comparison
There are several methods to do string comparison:
- Compare and CompareTo - these functions can compare two string objects or two substrings of specified string objects basing on sort position in alphabetical order. You can apply different rules by passing different arguments to Compare functions.
- Equals - identifies is current instance and another instance have the same values.
When you compare strings CultureInfo struct may be passed to comparison function. By passing culture you tell to the compiler which culture lexual rules are applied for the comparison. This is separate topic and I recommend you to read Jeffre's Richter book "CLR via C#" about this. In section about text handling he perfectly describe this and from my perspective developer that wants to compare strings in different languages must be familiar with this part of comparison. Following code demonstrates some samples for comparison functions:
Console.WriteLine("--COMPARISION");
s = "String";
s1 = "String";
Console.WriteLine(string.Compare(s,s1));
Console.WriteLine(s.Equals(s1));s = "A String";
s1 = "B String";
Console.WriteLine(string.Compare(s, s1).ToString() + s.CompareTo(s1).ToString());Console.WriteLine(s.Equals(s1));s = "C String";
Console.WriteLine(string.Compare(s, s1).ToString() + s.CompareTo(s1).ToString());s1 = "c string";
Console.WriteLine(string.Compare(s, s1, true).ToString());Console.WriteLine(s.CompareTo(s1).ToString());Console.WriteLine(s.Equals(s1,StringComparison.Ordinal));Console.WriteLine(s.Equals(s1,StringComparison.OrdinalIgnoreCase));
Few more details about string comparison you can find here.
Characters and substrings
String has set of functions to do search of substring in string, search of specified character, etc. Below is little bit more about each particular function:
- Contains - returns true if substring is present in string
- CopyTo - copies specified number of characters from specific position to output array in specific position there
- EndsWith - returns true is specific string ends with input string. Has several types basing on input culture and comparison options.
- StartsWith - returns true is specific string starts with input string. Has several types basing on input culture and comparison options.
- IndexOf - returns the first index of specific character or string basing on different input parameters. Has 9 overloads that propose various options
- IndexOfAny - returns the first index in string for any of character from input array
- LastIndexOf - returns the last index of specific character or string basing on different input parameters. Has 9 overloads that propose various options
- LastIndexOfAny - returns the last index in string for any of character from input array
- Split - splits string to substring basing on input characters or string array. Has several overloads for input char and input string array.
- Substring - returns new string that is substring from source string. This substring contains all characters from specific position till the end or for the specific length
- Remove - returns new string that is created from the source by removing all characters after specific position till the end or deletes specific amount of characters
Code below demonstrates usage of the functionality specified:
Console.WriteLine("--CHARACTERS AND SUBSTRINGS");
s = "string that contains substring";
s1 = "substring";
Console.WriteLine(s.Contains(s1));Console.WriteLine(s.Contains("asdfds"));char[] destination = new char[6];
s.CopyTo(0, destination, 0, 6);Console.WriteLine(destination); Console.WriteLine(s.EndsWith("dsas")); Console.WriteLine(s.EndsWith("ring")); Console.WriteLine(s.StartsWith("dsas")); Console.WriteLine(s.StartsWith("str")); Console.WriteLine(s.IndexOf("that")); Console.WriteLine(s.IndexOf("rerwe")); Console.WriteLine(s.LastIndexOf("in")); Console.WriteLine(s.LastIndexOf("qqq")); char[] search_chars = { 'i' };
Console.WriteLine(s.IndexOfAny(search_chars)); Console.WriteLine(s.LastIndexOfAny(search_chars)); char[] search_chars1 = { 'q' };
Console.WriteLine(s.IndexOfAny(search_chars1)); Console.WriteLine(s.LastIndexOfAny(search_chars1)); char[] split_char = { ' ' };
string[] out_strings = s.Split(split_char);
foreach (string splitted in out_strings) {
Console.WriteLine(splitted);
}
string[] split_string = { "in" };
out_strings = s.Split(split_string,StringSplitOptions.None);
foreach (string splitted in out_strings) {
Console.WriteLine(splitted);
}
s = "very important string";
Console.WriteLine(s.Substring(5)); Console.WriteLine(s.Substring(5,9)); Console.WriteLine(s.Remove(5)); Console.WriteLine(s.Remove(5, 2));
Formatting
String supports set of functions that help to format string and change its content. NOTE: string is not changeable and each time you do such a formatting as result you receive new string object. Following functions you may use for formatting the string:
- Format - using this function you can build new string with string representatives of input object\s. It replaces format elements by such string representatives. Has set of overloads.
- ToLower, ToLowerInvariant - returns new string with all characters lower case
- ToUpper, ToUpperInvariant - returns new string with all characters upper case
- Insert - adds new string to current one in specific position and return new object that presents this operation
- PadRigth, PadLeft - aligns string to the left or to the right and extends it with whitespaces to achieve input size if required
- Trim, TrimStart, TrimEnd - removes all occurrences of input characters set from the string if no input characters than it removes whitespaces from the beginning and\or end of the string.
- Replace - returns new string where all instance of specific character or string is replaced by another character or string
Code below shows formatting functions examples:
Console.WriteLine("--FORMATTING");
s = string.Format("First argument is: {0} and second argument is {1}", 10, 11);Console.WriteLine(s);s = string.Format("Persents are: {0:0.0%}", 0.75);Console.WriteLine(s);DateTime date = new DateTime(2015, 10, 5);
TimeSpan time = new TimeSpan(15, 15, 30);
decimal temp = 10.5m;
s = String.Format("Temperature on {0:d} at {1,10} is {2} degrees", date, time, temp);
Console.WriteLine(s);s = "one plus";
s = s.Insert(8, " one");
Console.WriteLine(s);s = "some string";
s1 = s.ToUpper();
Console.WriteLine(s1);s = s1.ToLower();Console.WriteLine(s);
s1 = s.ToUpperInvariant();
Console.WriteLine(s1);s = s1.ToLower();
Console.WriteLine(s);s = " string with extra whitespaces ";
char[] split_chars = { 's', 't', 'r', 'c', 'e', ' ' };
Console.WriteLine(s.Trim()); Console.WriteLine(s.TrimStart()); Console.WriteLine(s.TrimEnd()); Console.WriteLine(s.Trim(split_chars)); s = "some string";
Console.WriteLine(s.PadLeft(15));
Console.WriteLine(s.Replace('s','S')); Console.WriteLine(s.Replace("str", "STR"));
Other
There is set of other functions and properties that you can use for strings. Below are their brief description:
- Length - this property returns the length of the string in characters
- ToCharArray - copies content of the full string or part of the string to char array
- IsNullOrEmpty - returns true if string object is null or empty string. Very convenient to check strings.
- IsNullOrWhiteSpace - returns true if string object is null or empty string or contains whitespaces only
Following code demonstrates their usage:
Console.WriteLine("--OTHER");
Console.WriteLine(s.Length);char[] chArray = s4.ToCharArray();s = null;
Console.WriteLine(string.IsNullOrEmpty(s)); Console.WriteLine(string.IsNullOrWhiteSpace(s)); s = " ";
Console.WriteLine(string.IsNullOrEmpty(s)); Console.WriteLine(string.IsNullOrWhiteSpace(s)); s = "sss";
Console.WriteLine(string.IsNullOrEmpty(s)); Console.WriteLine(string.IsNullOrWhiteSpace(s));
StringBuilder
StringBuilder is a class which objects are used to create and change strings without creating new instance in memory all the time. Books that I read don't recommend to use StringBuilder as replacement for the string and, for example, path it as argument to the function. You should still rely on string in this case. StringBuilder is designed for string formatting and manipulation. Inside itself StringBuilder has a char array and if you need more space for the new string data it extends array size by creating new one and copying data for manipulation there. This flexible way gives StringBuilder ability to use memory efficiently and not create new structure there after each manipulation as it is done in String. Once you have done to manipulate string data in StringBuilder you can get it from there as string by calling ToString method. ToString returns new string all the time it is called.
You can create object of StringBuilder by several possible ways:
StringBuilder sb = new StringBuilder(); sb = new StringBuilder(25); sb = new StringBuilder("input string"); sb = new StringBuilder(25, 225); sb = new StringBuilder("some string",20); sb = new StringBuilder("some string", 0, 4, 20); Console.WriteLine(sb.ToString());
StringBuilder has following important properties and methods:
Properties
- Capacity - sets or returns the size of char array where StringBuilder stores its data. Default capacity for StringBuilder is 16 characters. Once string data exceeds current capacity CLR doubles size of current capacity.
- MaxCapacity - read only property that returns maximum number of symbols that can be stored in StringBuilder object.
- Length - gets or sets length of current StringBuilder object
Following code demonstrates properties of StringBuilder:
sb = new StringBuilder("some string", 0, 4, 20);
Console.WriteLine(sb.Capacity);
Console.WriteLine(sb.MaxCapacity); Console.WriteLine(sb.Length);
Methods
- Append - appends the string representation of input data type to the string data that StringBuilder object holds. Has 20 overloads for different input parameters
- AppendFormat - appends the string that is formatted on the fly. This string contains one or more format items which string representatives are added to the string. You can format each format item using specific format provider. Has set of overloads that you should use basing on your requirements.
- Clear - clears all characters from StringBuilder instance
- AppendLine - appends line terminator to the current StringBuilder instance. Also has an overload with input string which is much more convenient from my opinion.
- CopyTo - copies part of the string segment from current StringBuilder instance to output Char array.
- EnsureCapacity - returns current capacity of the StringBuilder instance if it is more than input value or sets it to input value.
- Insert - similar to append function has set of overloads and inserts string representative of different data types to specific character position in char array of StrtingBuilder.
- Remove - removes specific range or characters from specific index and for specific length
- Replace - replaces all occurrences of specific character or specific string to new one
- ToString - returns as a string all or partial data of the char array that is hold in specific StringBuilder instance
Code below demonstrates usage of methods:
sb = new StringBuilder();
sb.Append(1);
sb.Append('s');
sb.Append(" some string data ");
sb.Append(true);
sb.Append('\t');
Object o = new Object();
sb.Append(o);
sb.Append('\t');
sb.Append(123.435345);
Console.WriteLine(sb.ToString());sb.Clear();
sb.AppendFormat("Appends first digit: {0} and second bool: {1}", 123, false);
Console.WriteLine(sb.ToString());sb.Clear();
sb.Append("first line");
sb.AppendLine();
sb.Append("second line");
Console.WriteLine(sb.ToString());sb.Clear();
sb.AppendLine("first line");
sb.Append("second line");
Console.WriteLine(sb.ToString());char[] cArray = new char[5];
sb.CopyTo(0, cArray, 0, 5);
Console.WriteLine(cArray);Console.WriteLine(sb.EnsureCapacity(5));Console.WriteLine(sb.EnsureCapacity(105));try
{
sb.EnsureCapacity(int.MaxValue);}
catch (System.OutOfMemoryException)
{
Console.WriteLine("We catch exception because tried to create StringBuilder object that exceeds maximum size");
}
sb.Clear();
sb.Insert(0, "first ");
sb.Insert(6, "second ");
sb.Insert(13, true);
sb.Insert(17, " ");
sb.Insert(18, 123.2435d);
Console.WriteLine(sb.ToString());try
{
sb.Insert(121, "");}
catch (System.ArgumentOutOfRangeException)
{
Console.WriteLine("We catch exception because tried to insert to StringBuilder object for a position that exceeds array size");
}
sb.Remove(18, 7);
Console.WriteLine(sb.ToString());sb.Replace('5', '!');
Console.WriteLine(sb.ToString());sb.Replace("fir", "FIR");
Console.WriteLine(sb.ToString());
As you can see class String is much more powerful than StringBuilder but when you can you should try to use StringBuilder class for string formatting instead of String as it is more efficient from resources utilization.
SecureString
All strings are stored in heap and there is big chance that any string that is used in your application can be recognized by analyzing memory allocated for process. A lot of hacks were done by analyzing dumps of memory for specific application where they stored user passwords or any other secret or private information. Besides this there are government or security organizations that have strict rules for applications that run on their side. To solve all these problems Microsoft designed special string data type that is allocated in a specific, encoded and unmanaged memory area that is not accessible for Garbage Collector. This type name is System.Security.SecureString. Type SecureString has several methods to manipulate with its text data. When you call these methods part of the string is decoded and after manipulation encoded back. This happens very fast and data is encoded for a very small amount of time which guarantees data protection. Class SecureString implements interface Idisposable. When application doesn't require data from this class you can call Dispose method or use using operator to delete text data that is stored inside special buffer that holds string data for SecureString.
Listing
Full listing of an application that shows work with text example is available as attachment.
Sources
- Jeffrey Richter - CLR via C#
- Andrew Troelsen - Pro C# 5.0 and the .NET 4.5 Framework
- MSDN
- http://www.introprogramming.info/english-intro-csharp-book/read-online/chapter-13-strings-and-text-processing/
- http://www.c-sharpcorner.com/UploadFile/mahesh/WorkingWithStringsP111232005042550AM/WorkingWithStringsP1.aspx