(untagged)

Escaping in C#: characters, strings, string formats, keywords, identifiers

Andreas Gieriet

0.00/5 (No votes)

21 Aug 2012

Different possibilities to escape literals and names/keywords.

Introduction

Everybody knows how to escape specific characters in C# string. So, why to bother about this?

This tip shows the quirks involved with escaping in C#:

- character literal escaping:	e.g. `'\''`, `'\n'`, `'\u20AC'` (the Euro € currency sign), `'\x9'` (equivalent to `\t`))
- literal string escaping:	e.g. `"...\t...\u0040...\U000000041...\x9..."`
- verbatim string escaping:	e.g. `@"...""..."`
- `string.Format` escaping:	e.g. `"...{{...}}..."`
- keyword escaping:	e.g. `@if` (for `if` as identifier)
- identifier escaping:	e.g. `i\u0064` (for `id`)

Introduction
- Table of Contents
Escaping - what for?
Escaping in character and string literals
- Summary
Escaping in verbatim strings
- Summary
string.Format escaping
- Summary
- Bonus
Escaping identifiers
- Summary
Escaping in Regular Expressions
- Summary
- Bonus
Links
History

Escaping - what for?

Again, everybody know this - or has at least a feeling for it. Nonetheless, I'd like to just remind what escaping is good for.

Escaping gives an alternative meaning to the "normal" meaning. "Normal" is a matter of what is commonly used. There is no absolute reference for what is "normal", so, each escape mechanism defines what is "normal" and what is the escape for it.

E.g. a string literal is enclosed in double quotes "...". The meaning of the double quotes is to enclose the string literal - this is the normal meaning of double quotes for strings. If you want now to include a double quote in a string literal, you must tell that this double quote does not have the normal meaning. E.g. "..."..." would terminate the string at the second double quote, where as "...\"..." escapes the second double quote from being interpreted as terminating the string literal.

There are a variety of established escaping mechanisms. The motivation for escaping vary as well. Some motivation to employ escaping:

In string and character literals:
- One must be able to embed the terminators, like single or double quote.
- One needs to enter special characters that have no character symbol associated, like a horizontal tabulator.
- One needs to enter a character that has no direct key on the keyboard, like the Yen currency symbol (¥).
- etc.
In identifiers:
- One needs to enter names with characters that have no equivalent key on the keyboard, like the German umlaut Ä (Unicode 0x00C4).
- One needs to generate C# code that may use identifiers that clash with the C# keywords, like yield.
- etc.
In string formatting:
- One must be able to enter a literal { or } in a string.Format(...), like in
  Console.WriteLine("...{...", ...).
In Regular Expressions:
- One must match characters that had otherwise a control meaning, like matching the character [, etc.
etc.

So, let's start discussing the several various machineries to escape the normal behavior.

Escaping in character and string literals

Let's first look at the strings. A string is a sequence of characters. A character is a type that holds an UTF-16[^] encoded value. A character therefore is a two-byte value.

E.g. the UTF-16 code decimal 64 (hexadecimal 40) is the @ character.

Note: There are a few "characters" which cannot directly be encoded in these two bytes. These characters occupy 4 bytes, thus, a pair of UTF-16 values. These are called UTF-16: surrogate pair[^] (search for "surrogate pair").

So, the string is a sequence of two-byte characters.

E.g. the string "abc" results in the executed program in a sequence of the three UTF-16 values 0x0061, 0x0062, 0x0063. Or the Euro currency sign is the Unicode character 0x20AC (€) and the Yen currency sign is the Unicode character 0x00A5 (¥).

How to write that in C#?

char euro = '\u20ac';
char yen  = '\u00a5';

The item \uxxxx denotes a UTF-16 code.

As an alternative, one can write \u.... as \x, followed by one to four hex characters. The above example can also be written as

char euro = '\x20ac';
char yen  = '\xa5';

Note: the \x sequence tries to match as much as possible, i.e. "\x68ello" results in "ڎllo" and not in "hello" (the \x68e terminates after three characters since the following character is not a possible hex character. As a consequence, \u... is safer than using \x... since the length in given in the first case, where in the second case, the longest match is taken which may fool you.

Notes:

Please note that the upper case \Uxxxxxxxx item denotes a surrogate pair. Since a surrogate pair requires a pair of UTF-16 characters, it cannot be stored in one C# character.
\u must be followed by exactly four hexadecimal characters
\U must be followed by exactly eight hexadecimal characters
\x must be followed by one to four hexadecimal characters

Ah, yes, since it is common knowledge to everyone, I almost forgot to provide the short character escape notation of some often used special characters like \n, etc.:

Short Notation	UTF-16 character	Description
`\'`	`\u0027`	allow to enter a `'` in a character literal, e.g. `'\''`
`\"`	`\u0022`	allow to enter a `"` in a string literal, e.g. `"this is the double quote (\") character"`
`\\`	`\u005c`	allow to enter a `\` character in a character or string literal, e.g. `'\\'` or `"this is the backslash (\\) character"`
`\0`	`\u0000`	allow to enter the character with code 0
`\a`	`\u0007`	alarm (usually the HW beep)
`\b`	`\u0008`	back-space
`\f`	`\u000c`	form-feed (next page)
`\n`	`\u000a`	line-feed (next line)
`\r`	`\u000d`	carriage-return (move to the beginning of the line)
`\t`	`\u0009`	(horizontal-) tab
`\v`	`\u000b`	vertical-tab

Summary

characters are two-byte UTF-16 codes
UTF-16 surrogate pairs are stored in a pair of C# characters
the escape character \ introduces escaping
what follows the \ character is
- one of the short notations characters (\\, \", \', \a, ...)
- a Unicode character code (\u20a5, \u00a5, ...)
- a surrogate pair (\Ud869ded6, ...) which can only be stored in a string but not in a single character.
- a hex sequence of 1 to 4 hex characters (\xa5, ...)

Escaping in verbatim strings

What are verbatim strings? This is Syntactic Sugar[^] to enter strings in C#.

E.g. storing a Windows file path like

string path = "C:\\Program Files\\Microsoft Visual Studio 10.0\\";

can be considered as awkward or ugly. A more convenient version is the verbatim string:

string path = @"C:\Program Files\Microsoft Visual Studio 10.0\";

A verbatim string (@"...") takes the content as-is without any interpretation of any character. Well almost; there is exactly one character that can be escaped: an embedded " must be escaped as "". E.g.

string xml = @"<?xml version=""1.0""?>
<Data>
...
<Data>";

Note: As mentioned above, the verbatim string literal is a convenience way to enter a string literal in C#. The resulting memory image of the strings is the same. E.g. these are all identical string contents:

            string v1 = "a\r\nb";
            string v2 = "\u0061\u000d\u000a\u0062";
            string v3 = @"a
b";
            Console.WriteLine("v1 = \"{0}\"\nv2 = \"{1}\"\nsame = {2}", v1, v2, v1 == v2);
            Console.WriteLine("v1 = \"{0}\"\nv3 = \"{1}\"\nsame = {2}", v1, v3, v1 == v3);

results in

v1 = "a
b"
v2 = "a
b"
same = True
v1 = "a
b"
v3 = "a
b"
same = True

Summary

verbatim string literals and normal string literals are two ways to define string content
verbatim string take all given characters as-is, including new lines, etc.
the only escape sequence in a verbatim string literal is "" to denote an embedded " character

string.Format escaping

Format strings are interpreted during runtime (not during compile time) to replace {...} by the respective arguments. E.g.

Console.WriteLine("User = {0}", Environment.UserName);

But what if you want to have a { or } embedded in the format string? Is it \{ or {{? Think of it!

Clearly the second. Why? Let's elaborate on that.

The format string is a string like any other. You can enter it as

string.Format("...", a, b);
string.Format(@"...", a, b);
string.Format(s.GetSomeFormatString(), a, b);

If C# would allow to enter \{ or \} it would be stored in the string as { and } respectively.
The string.Format function then reads this string to decide if a formatting instruction is to be interpreted, e.g. {0}. Since the \{ resulted in { character in the string, the string.Format function could not decide that this is to be taken as a format instruction or as literal {.
The alternative was some other escaping. The established way is to double the character to escape.
So, string.Format treats {{ as literal {. Analogous }} for }.

Summary

string.Format(...) escaping is interpreted at runtime only, not at compile time
the two characters that have a special meaning in string.Format(...) are { and }
the escaping for these two characters is the doubling of the specific character: {{ and }} (e.g. Console.WriteLine("{{{0}}}", "abc"); results in console output {abc}

Bonus

The following code scans the C# string format text and returns all argument ids:

public static IEnumerable<int> GetIds(string format)
{
    string pattern = @"\{(\d+)[^\}]*\}";
    var ids = Regex.Matches(format, pattern, RegexOptions.Compiled)
                   .Cast<Match>()
                   .Select(m=>int.Parse(m.Groups[1].Value));
}
foreach (int n in GetIds("a {0} b {1 } c {{{0,10}}} d {{e}}")) Console.WriteLine(n);

Passing "a {0} b {1 } c {{{0,10}}} d {{e}}" results in

0
1
0

Escaping identifiers

Why would one escape identifiers? I guess, this is not really intended for daily use. It is probably only useful for automatically generated C# code. Nonetheless, there is two mechanisms to escape identifiers.

define an identifier that would clash with keywords
define an identifier that contains characters which have no equivalent on the keyboard

Option A: prefix an identifier by @, e.g.

int @yield = 10;

Option B: use UTF-16 escape sequences as described above in the string literals above, e.g.

int \u0079ield = 10;

Notes:

A keyword must stay unescaped, i.e. if an identifier is written as @xxx it is alwas an identifier (i.e. never a keyword).
The same holds for identifiers that contain UTF-16 escape sequences
You can mix and match escaped identifiers, e.g. the following are identical:
```
while (@a > 0) \u0061 = a - 1;
while (a > 0) a = a - 1;
```

Summary

identifier escaping is available in C#
identifiers can be prefixed by @ to avoid keyword clashes
identifier characters can be encoded by using UTF-16 character escape sequences
the escaped identifiers must still be from the legal character sets - you cannot define an identifier containing a dot, etc.
numbers, operators, and punctuation cannot be escaped (e.g. 1.0f, etc. cannot be escaped)
My opinion: escaping identifiers is not intended for daily use - e.g. don't ever attempt to prefix any identifiers by a @! This is meant for automatically generated code only, i.e. no user should ever see such an identifier...

Escaping in Regular Expressions

Regex pattern strings are also interpreted at runtime, like string.Format(...). The Regex syntax contains instructions that are introduced by \. E.g. \d stands for a single character from the set 0...9. I don't go into the Regex syntax in this tip, but rather how to conveniently put such a Regex pattern into a C# string.

Since the Regex pattern most likely contains some \, it is more convenient to write Regex patterns as verbatim string. The result is that the \ does not need to be escaped in the pattern. E.g. the following patterns are identical for the Regex pattern \d+|\w+ (decide yourself which one is more convenient):

var match1 = Regex.Matches(input, "\\d+|\\w+");
var match2 = Regex.Matches(input, @"\d+|\w+");

There is a gotcha: entering double quotes looks a bit odd in a verbatim string. Finally it's your choice which way you enter the pattern, as normal string literal or as verbatim string literal.

Summary

Regex patterns are conveniently entered as verbatim string @"..."

Bonus

The following code shows tokenizing C#. Try to understand the escaping Wink | ;-) :

string strlit  = @"""(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}|\\x[0-9a-fA-F]{1,4}|\\.|[^""])*""";
string verlit  = @"@""(?:""""|[^""])*"""; // or: "@\"(?:\"\"|[^\"])*\""
string charlit = @"'(?:\\u[0-9a-fA-F]{4}|\\x[0-9a-fA-F]{1,4}|\\.|[^'])'";
string hexlit  = @"0[xX][0-9a-fA-F]+[ulUL]?";
string number1 = @"(?:\d*\.\d+)(?:[eE][-+]?\d+)?[fdmFDM]?";
string number2 = @"\d+(?:[ulUL]?|(?:[eE][-+]?\d+)[fdmFDM]?|[fdmFDM])";
string ident   = @"@?(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}|\w)+";
string[] op3   = new string[] {"<<="};
string[] op2   = new string[] {"!=","%=","&&","&=","*=","++","+=","--","-=","/=",
                               "::","<<","<=","==","=>","??","^=","|=","||"};
string rest = @"\S";

string skip = @"(?:"+ string.Join("|", new string[]
{
    @"[#].*?\n",                                     // C# pre processor line
    @"//.*?\n",                                      // C# single line comment
    @"/[*][\s\S]*?[*]/",                             // C# block comment
    @"\s",                                           // white-space
}) + @")*";
string pattern = skip + "(" + string.Join("|", new string[]
{
    strlit,                                          // C# string literal
    verlit,                                          // C# verbatim literal
    charlit,                                         // C# character literal
    hexlit,                                          // C# hex number literal
    number1,                                         // C# real literal
    number2,                                         // C# integer or real literal
    ident,                                           // C# identifiers
    string.Join("|",op3.Select(t=>Regex.Escape(t))), // C# three-letter operator
    string.Join("|",op2.Select(t=>Regex.Escape(t))), // C# two-letter operator
    rest,                                            // C# one-letter operator and any other one char
}) + @")" + skip;

string f = @"..."; // enter your path to the C# file to parse
string input = File.ReadAllText(f);
var matches = Regex.Matches(input, pattern, RegexOptions.Singleline|RegexOptions.Compiled).Cast<Match>();
foreach (var token in from m in matches select m.Groups[1].Value)
{
    Console.Write(" {0}", token);
    if ("{};".Contains(token)) Console.WriteLine();
}

Have fun!

Links

The following links may provide additional information:

History

V1.0	2012-04-23	Initial version.
V1.1	2012-04-23	Fix broken formatting.
V1.2	2012-04-25	Fix typos, add more links, fix HTML unicode literals in the text, update some summaries.
V1.3	2012-08-21	Fix `\x...` description. Make some tables of class ArticleTable (looks a bit nicer)

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Escaping in C#: characters, strings, string formats, keywords, identifiers

Introduction

Table of Contents

Escaping - what for?

Escaping in character and string literals

Summary

Escaping in verbatim strings

Summary

string.Format escaping

Summary

Bonus

Escaping identifiers

Summary

Escaping in Regular Expressions

Summary

Bonus

Links

History

License