Introduction
Everybody knows how to escape specific characters in C# string. So, why to bother about this?
This tip shows the quirks involved with escaping in C#:
- character literal escaping: | e.g. '\'' , '\n' , '\u20AC' (the Euro € currency sign), '\x9' (equivalent to \t )) |
- literal string escaping: | e.g. "...\t...\u0040...\U000000041...\x9..." |
- verbatim string escaping: | e.g. @"...""..." |
- string.Format escaping: | e.g. "...{{...}}..." |
- keyword escaping: | e.g. @if (for if as identifier) |
- identifier escaping: | e.g. i\u0064 (for id ) |
Table of Contents
Escaping - what for?
Again, everybody know this - or has at least a feeling for it. Nonetheless, I'd like to just remind what escaping is good for.
Escaping gives an alternative meaning to the "normal" meaning. "Normal" is a matter of what is commonly used. There is no absolute reference for what is "normal", so, each escape mechanism defines what is "normal" and what is the escape for it.
E.g. a string literal is enclosed in double quotes "..."
. The meaning of the double quotes is to enclose the string literal - this is the normal meaning of double quotes for strings. If you want now to include a double quote in a string literal, you must tell that this double quote does not have the normal meaning. E.g. "..."..."
would terminate the string at the second double quote, where as "...\"..."
escapes the second double quote from being interpreted as terminating the string literal.
There are a variety of established escaping mechanisms. The motivation for escaping vary as well. Some motivation to employ escaping:
- In string and character literals:
- One must be able to embed the terminators, like single or double quote.
- One needs to enter special characters that have no character symbol associated, like a horizontal tabulator.
- One needs to enter a character that has no direct key on the keyboard, like the Yen currency symbol (¥).
- etc.
- In identifiers:
- One needs to enter names with characters that have no equivalent key on the keyboard, like the German umlaut Ä (Unicode 0x00C4).
- One needs to generate C# code that may use identifiers that clash with the C# keywords, like
yield
.
- etc.
- In string formatting:
- One must be able to enter a literal
{
or }
in a string.Format(...)
, like in
Console.WriteLine("...{...", ...)
.
- In Regular Expressions:
- One must match characters that had otherwise a control meaning, like matching the character
[
, etc.
- etc.
So, let's start discussing the several various machineries to escape the normal behavior.
Escaping in character and string literals
Let's first look at the strings. A string is a sequence of characters. A character is a type that holds an UTF-16[^] encoded value. A character therefore is a two-byte value.
E.g. the UTF-16 code decimal 64 (hexadecimal 40) is the @
character.
Note: There are a few "characters" which cannot directly be encoded in these two bytes. These characters occupy 4 bytes, thus, a pair of UTF-16 values. These are called UTF-16: surrogate pair[^] (search for "surrogate pair").
So, the string is a sequence of two-byte characters.
E.g. the string "abc"
results in the executed program in a sequence of the three UTF-16 values 0x0061
, 0x0062
, 0x0063
. Or the Euro currency sign is the Unicode character 0x20AC
(€) and the Yen currency sign is the Unicode character 0x00A5
(¥).
How to write that in C#?
char euro = '\u20ac';
char yen = '\u00a5';
The item \uxxxx
denotes a UTF-16 code.
As an alternative, one can write \u....
as \x
, followed by one to four hex characters. The above example can also be written as
char euro = '\x20ac';
char yen = '\xa5';
Note: the \x
sequence tries to match as much as possible, i.e. "\x68ello"
results in "ڎllo"
and not in "hello"
(the \x68e
terminates after three characters since the following character is not a possible hex character. As a consequence, \u...
is safer than using \x...
since the length in given in the first case, where in the second case, the longest match is taken which may fool you.
Notes:
- Please note that the upper case
\Uxxxxxxxx
item denotes a surrogate pair. Since a surrogate pair requires a pair of UTF-16 characters, it cannot be stored in one C# character.
\u
must be followed by exactly four hexadecimal characters
\U
must be followed by exactly eight hexadecimal characters
\x
must be followed by one to four hexadecimal characters
Ah, yes, since it is common knowledge to everyone, I almost forgot to provide the short character escape notation of some often used special characters like \n
, etc.:
Short Notation | UTF-16 character | Description |
---|
\' | \u0027 | allow to enter a ' in a character literal, e.g. '\'' |
\" | \u0022 | allow to enter a " in a string literal, e.g. "this is the double quote (\") character" |
\\ | \u005c | allow to enter a \ character in a character or string literal, e.g. '\\' or "this is the backslash (\\) character" |
\0 | \u0000 | allow to enter the character with code 0 |
\a | \u0007 | alarm (usually the HW beep) |
\b | \u0008 | back-space |
\f | \u000c | form-feed (next page) |
\n | \u000a | line-feed (next line) |
\r | \u000d | carriage-return (move to the beginning of the line) |
\t | \u0009 | (horizontal-) tab |
\v | \u000b | vertical-tab |
Summary
- characters are two-byte UTF-16 codes
- UTF-16 surrogate pairs are stored in a pair of C# characters
- the escape character
\
introduces escaping
- what follows the
\
character is
- one of the short notations characters (
\\
, \"
, \'
, \a
, ...)
- a Unicode character code (
\u20a5
, \u00a5
, ...)
- a surrogate pair (
\Ud869ded6
, ...) which can only be stored in a string but not in a single character.
- a hex sequence of 1 to 4 hex characters (
\xa5
, ...)
Escaping in verbatim strings
What are verbatim strings? This is Syntactic Sugar[^] to enter strings in C#.
E.g. storing a Windows file path like
string path = "C:\\Program Files\\Microsoft Visual Studio 10.0\\";
can be considered as awkward or ugly. A more convenient version is the verbatim string:
string path = @"C:\Program Files\Microsoft Visual Studio 10.0\";
A verbatim string (@"..."
) takes the content as-is without any interpretation of any character. Well almost; there is exactly one character that can be escaped: an embedded "
must be escaped as ""
. E.g.
string xml = @"<?xml version=""1.0""?>
<Data>
...
<Data>";
Note: As mentioned above, the verbatim string literal is a convenience way to enter a string literal in C#. The resulting memory image of the strings is the same. E.g. these are all identical string contents:
string v1 = "a\r\nb";
string v2 = "\u0061\u000d\u000a\u0062";
string v3 = @"a
b";
Console.WriteLine("v1 = \"{0}\"\nv2 = \"{1}\"\nsame = {2}", v1, v2, v1 == v2);
Console.WriteLine("v1 = \"{0}\"\nv3 = \"{1}\"\nsame = {2}", v1, v3, v1 == v3);
results in
v1 = "a
b"
v2 = "a
b"
same = True
v1 = "a
b"
v3 = "a
b"
same = True
Summary
- verbatim string literals and normal string literals are two ways to define string content
- verbatim string take all given characters as-is, including new lines, etc.
- the only escape sequence in a verbatim string literal is
""
to denote an embedded "
character
string.Format escaping
Format strings are interpreted during runtime (not during compile time) to replace {...}
by the respective arguments. E.g.
Console.WriteLine("User = {0}", Environment.UserName);
But what if you want to have a {
or }
embedded in the format string? Is it \{
or {{
? Think of it!
Clearly the second. Why? Let's elaborate on that.
- The format string is a string like any other. You can enter it as
string.Format("...", a, b);
string.Format(@"...", a, b);
string.Format(s.GetSomeFormatString(), a, b);
- If C# would allow to enter
\{
or \}
it would be stored in the string as {
and }
respectively.
- The
string.Format
function then reads this string to decide if a formatting instruction is to be interpreted, e.g. {0}
. Since the \{
resulted in {
character in the string, the string.Format
function could not decide that this is to be taken as a format instruction or as literal {
.
- The alternative was some other escaping. The established way is to double the character to escape.
- So,
string.Format
treats {{
as literal {
. Analogous }}
for }
.
Summary
string.Format(...)
escaping is interpreted at runtime only, not at compile time
- the two characters that have a special meaning in
string.Format(...)
are {
and }
- the escaping for these two characters is the doubling of the specific character:
{{
and }}
(e.g. Console.WriteLine("{{{0}}}", "abc");
results in console output {abc}
Bonus
The following code scans the C# string format text and returns all argument ids:
public static IEnumerable<int> GetIds(string format)
{
string pattern = @"\{(\d+)[^\}]*\}";
var ids = Regex.Matches(format, pattern, RegexOptions.Compiled)
.Cast<Match>()
.Select(m=>int.Parse(m.Groups[1].Value));
}
foreach (int n in GetIds("a {0} b {1 } c {{{0,10}}} d {{e}}")) Console.WriteLine(n);
Passing "a {0} b {1 } c {{{0,10}}} d {{e}}"
results in
0
1
0
Escaping identifiers
Why would one escape identifiers? I guess, this is not really intended for daily use. It is probably only useful for automatically generated C# code. Nonetheless, there is two mechanisms to escape identifiers.
- define an identifier that would clash with keywords
- define an identifier that contains characters which have no equivalent on the keyboard
Option A: prefix an identifier by @
, e.g.
int @yield = 10;
Option B: use UTF-16 escape sequences as described above in the string literals above, e.g.
int \u0079ield = 10;
Notes:
Summary
- identifier escaping is available in C#
- identifiers can be prefixed by
@
to avoid keyword clashes
- identifier characters can be encoded by using UTF-16 character escape sequences
- the escaped identifiers must still be from the legal character sets - you cannot define an identifier containing a dot, etc.
- numbers, operators, and punctuation cannot be escaped (e.g. 1.0f, etc. cannot be escaped)
- My opinion: escaping identifiers is not intended for daily use - e.g. don't ever attempt to prefix any identifiers by a
@
! This is meant for automatically generated code only, i.e. no user should ever see such an identifier...
Escaping in Regular Expressions
Regex pattern strings are also interpreted at runtime, like string.Format(...)
. The Regex syntax contains instructions that are introduced by \
. E.g. \d
stands for a single character from the set 0...9
. I don't go into the Regex syntax in this tip, but rather how to conveniently put such a Regex pattern into a C# string.
Since the Regex pattern most likely contains some \
, it is more convenient to write Regex patterns as verbatim string. The result is that the \
does not need to be escaped in the pattern. E.g. the following patterns are identical for the Regex pattern \d+|\w+
(decide yourself which one is more convenient):
var match1 = Regex.Matches(input, "\\d+|\\w+");
var match2 = Regex.Matches(input, @"\d+|\w+");
There is a gotcha: entering double quotes looks a bit odd in a verbatim string. Finally it's your choice which way you enter the pattern, as normal string literal or as verbatim string literal.
Summary
- Regex patterns are conveniently entered as verbatim string
@"..."
Bonus
The following code shows tokenizing C#. Try to understand the escaping :
string strlit = @"""(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}|\\x[0-9a-fA-F]{1,4}|\\.|[^""])*""";
string verlit = @"@""(?:""""|[^""])*""";
string charlit = @"'(?:\\u[0-9a-fA-F]{4}|\\x[0-9a-fA-F]{1,4}|\\.|[^'])'";
string hexlit = @"0[xX][0-9a-fA-F]+[ulUL]?";
string number1 = @"(?:\d*\.\d+)(?:[eE][-+]?\d+)?[fdmFDM]?";
string number2 = @"\d+(?:[ulUL]?|(?:[eE][-+]?\d+)[fdmFDM]?|[fdmFDM])";
string ident = @"@?(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}|\w)+";
string[] op3 = new string[] {"<<="};
string[] op2 = new string[] {"!=","%=","&&","&=","*=","++","+=","--","-=","/=",
"::","<<","<=","==","=>","??","^=","|=","||"};
string rest = @"\S";
string skip = @"(?:"+ string.Join("|", new string[]
{
@"[#].*?\n",
@"//.*?\n",
@"/[*][\s\S]*?[*]/",
@"\s",
}) + @")*";
string pattern = skip + "(" + string.Join("|", new string[]
{
strlit,
verlit,
charlit,
hexlit,
number1,
number2,
ident,
string.Join("|",op3.Select(t=>Regex.Escape(t))),
string.Join("|",op2.Select(t=>Regex.Escape(t))),
rest,
}) + @")" + skip;
string f = @"...";
string input = File.ReadAllText(f);
var matches = Regex.Matches(input, pattern, RegexOptions.Singleline|RegexOptions.Compiled).Cast<Match>();
foreach (var token in from m in matches select m.Groups[1].Value)
{
Console.Write(" {0}", token);
if ("{};".Contains(token)) Console.WriteLine();
}
Have fun!
Links
The following links may provide additional information:
History
V1.0 | 2012-04-23 |
Initial version.
|
V1.1 | 2012-04-23 |
Fix broken formatting.
|
V1.2 | 2012-04-25 |
Fix typos, add more links, fix HTML unicode literals in the text, update some summaries.
|
V1.3 | 2012-08-21 |
Fix \x... description. Make some tables of class ArticleTable (looks a bit nicer)
|