This article discusses various character data types, their purpose, history and scenarios of use.
Introduction
The topic of this article is something that is considered pretty basic in most programming languages: character and string
types. However, this is not a beginner level article on usage of character data in C and C++. My goal here is to shed some light on various character data types, their purpose, history and scenarios of use.
People who might benefit from reading this text include:
- programmers with experience in other programming languages who are interested in learning more about C and C++
- C and C++ programmers of various levels of experience who are not 100% certain when or how to use different character and
string
types - programming enthusiasts who may find it interesting to learn some history behind development of character and string data types in C and C++
The Original Char Data Type
The original char
type was invented at New Jersey - based Bell Labs in 1971 when Dennis Ritchie started extending the B programming language to add types. char
was the very first type he added to the language which was first called NB
("new B") and was later renamed to C.
char
is a type that represents the smallest addressable unit of the machine that can contain basic character set. In addition to character data, it is often used as the "byte
" type - for instance, binary data is often declared as char
arrays.
For developers who are used to more recent programming languages such as Java or C# , the C version of char
looks too flexible and under-specified. For instance, in Java, the char
type is a primitive integral type, guaranteed to hold 16-bit unsigned integers representing UTF-16 code units. In contrast, the C/C++ char
type:
- has the size of one byte, but the number of bits is only guaranteed to be at least 8. There are, or at least used to be, real-life platforms where
char
had more than 8 bits - as many as 32; - can be either signed or unsigned. The default usually depends on a platform, and it can typically be changed via a compiler flag. signed
char
and unsigned char
exist and they are guaranteed to be unsigned and signed respectively, but they are both distinct types than char
. - can contain a character in any single-byte character set or a byte of a multi-byte character set encoded
string
.
To illustrate the previous points - in Java:
char a = (char)65;
char b = (char)174;
char c = (char)1046;
We know that a
contains a 16-bit unsigned integral value of 65 which represents Latin capital A
; b
is a 16-bit unsigned integral value of 174 which represents registered trade mark sign; c
is a 16-bit unsigned integral value of 1046 which represents a Cyrillic capital letter zhe.
In C or C++, we have no such guarantees.
char a = (char)65;
a
would have at least 8 bits, so we can be certain it will have a value of 65, regardless of the "signess" of char
. What that value would represent is open to interpretation. For most ASCII-derived character sets, a
would represent Latin capital A
, just like in Java. However, on an IBM mainframe system which uses EBCDIC encoding, it will not represent anything.
char b = (char)174;
Assuming char
is 8-bits wide, b
could have a value of either 174
or it would overflow into something like -82
, depending on whether the char
type is signed or unsigned. Furthermore, the actual character it represents will differ even for various ASCII-derived character sets. For instance, in ISO-8859-1 (Latin-1 encoding), it will represent registered trademark sign (just like Java); with ISO 8859-2 (Latin-2) it would represent "Ş" (S-cedilla); in ISO 8859-5 (Latin/Cyrillic) it will represent "Ў" (short U), etc.
char c = (char)1046;
c
is all but guaranteed to overflow with modern hardware architectures and its value will be meaningless.
In practice, we rarely assign integer values to char
s and use character literals instead. For instance:
char a = 'A';
That will work and assign an integral value to a
. Now, can you guess what will be the actual numerical value stored in the variable? That depends on the compiler, its options, and the encoding of the source file. For most platforms in use nowadays, the value stored in a
will be 65
. If you compile the code on an IBM mainframe, chances are it will be 193
, but it also depends on the compiler settings.
How about:
char b = '®';
Depending on the compiler and source file encoding, it may end up successfully compiling and storing something like 174
into b
, or causing a compiler error. For instance, clang 14.0
, which I use on Linux, expects UTF-8 source files and it reports:
error: character too large for enclosing character literal type
Something like:
char c = 'Ж';
could theoretically work if the source file was saved as ISO-8859-5 code page *and* the compiler was set up to use that encoding. Clang again fails with the same error and for the same reason.
Another interesting characteristic of char
literal is that in C its type is int
- not char
. For instance, sizeof('a')
is likely to return something like 4
. In C++, the type of the literal is char
, and sizeof('a')
is guaranteed to be 1
.
C-Style Strings
Obviously, character data is most often used as string
s of characters rather than individual ones. In C, a string
is an array of characters, terminated by a character with value 0
. There are some C libraries that implement so-called "Pascal-style" string
s where the character array is prefixed by its length, but they are rare as the language itself favors null-terminated string
. For instance, the type of the string
literal "abc
" will be char[4]
(in C, not C++) - it will include space for the trailing zero.
In the simplest case, a C-style string
will be encoded with a single-byte character set, and in this case, each char
corresponds to a "letter" that can be displayed on a screen. Obviously, a char
array does not carry information about the character set, so it would have to be supplied separately for the text to be displayed correctly.
C-style string
s can work well with multibyte encodings as long as they do not require embedded zeros. In that case, a single char
contains a byte that could correspond either to an individual character or a part of a multi-byte encoded one.
The C Standard Library assumes string
s are zero-terminated. For instance, a naive implementation of strlen()
function could look like:
size_t strlen(const char *str)
{
const char *c = str;
while (*c != 0)
++c;
return (c - str);
}
Let's look at how the above-mentioned characteristics of the char
type affect string
s of data. If we look at the strlen
example, we can see that:
- it is not affected by the size of
char
. Regardless of the number of bits, it will correctly count until it hits the 0
; - it is not affected by the sign of
char
. It will return the same value regardless of whether char
is signed or unsigned; - depending on the character set, the value returned by
strlen
may or may not be what the caller expected. Namely, the function will always return the number of char
s in the array and that is usually the number of characters if the character set is single-byte. For multi-byte character sets, the number of char
s returned will often differ from the user-perceived number of characters.
C++ String Class(es)
The C++ Standard Library provides template class std::basic_string<>
which can be instantiated for various character types. It is declared in <string>
header along with typedef
s for character type instantiations. The typedef
for the char
type std::basic_string<char>
is std::string
.
Historically, the string
class predates both templates and namespaces in C++, and before the C++ standard was adopted in 1998, it was just one of the many string
classes used. In its early days, string
class was often implemented with copy-on-write semantics which led to various problems especially in multi-threaded environments and was eventually prohibited by the C++11 standard.
Nowadays, std::basic_string
is widely adopted and used. Its implementations usually contain "small string optimization" - a technique where a stack-based buffer is used for small string
s. With the adoption of move semantics, string
s play well with the C++ containers without introducing unnecessary copies. Using char*
for string
s in modern C++ rarely makes sense, except at the API level.
wchar_t
In late 1980s, an initiative was started to introduce a universal character set that would replace all the legacy character encodings based on 8-bit code units. The idea was to extend the popular ASCII character set from 7 to 16 bits which was considered enough to cover the characters of all world languages. The new encoding standard was called Unicode and the first version was published in late 1991.
To support the new, "wide" characters, a new type was added to C90 standard - wchar_t
. It was defined as "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales". "w
" in wchar_t
means "wide
" to emphasize that wchar_t
is generally (but not necessarily!) wider than char
. The "_t
" part comes from the fact that in C wchar_t
is not a distinct compiler type but a typedef
to another integral type, such as unsigned short
. wchar_t
is declared in the wchar.h along with various functions to work with wide string
s, such as wcslen()
, wprintf()
, etc.
In pre-standard C++, wchar_t
also started as a typedef
, but it was soon decided it had to be a distinct compiler type. Even after standardization, it remained tied to an "underlying type" which is one of the other integral types.
In practice (but not by a letter of any standard), wchar_t
is always unsigned and comes in two sizes:
- on Microsoft Windows and IBM AIX, it is 16 bits
- on virtually every other platform, it is 32 bits
This difference in sizes is an unfortunate historical incident - the early adopters of Unicode went with the 16-bit size which was compatible with the Unicode 1.0 standard. After later Unicode standard versions introduced supplementary planes, wchar_t
was used for UTF-16 encoding form on Windows and AIX and for UTF-32 encoding form on other platforms where Unicode was adopted later.
Along with wchar_t
, the new literals for wide characters were introduced: L'' for a wide character and L"" for a wide string
.
The C++ Standard defines class std::wstring
as an instantiation of the basic_string
class template that uses wchar_t
as the character type.
C11 / C++11 Character Types
In C11, two new character types were introduced: char16_t
and char32_t
; both are declared in <uchar.h>
header and are typedef
s to unsigned integral types. The former is used to store 16-bit characters and has to be at least 16-bits wide. The latter is used for 32-bit characters and has to be at least 32 bits wide.
Just like with wchar_t
, new literals were introduced: u'' for char16_t
and U'' for char32_t
. Unlike with wchar_t
, there are no new string
functions equivalent to the ones for char
that will work with new types; there is no strlen()
for char16_t
.
The third new type of literal introduced is u8''. It works with the old char
type and is used for UTF-8 encoding form.
With C11, wchar_t
becomes basically useless (although not officially deprecated). The character types are meant to be used in the following scenarios:
char
for UTF-8 Unicode encoding form, various single-byte and multi-byte legacy encodings, and as a byte type char16_t
for UTF-16 Unicode encoding form char32_t
for UTF-32 Unicode encoding form
Unsurprisingly, C++ 11 introduced two identically named character types. Unlike C, they are distinct built-in types rather than typedef
s and new keywords as well.
Two new typedef
s for std::basic_string
instantiations for the two new types were introduced in C++11:
std::u16string
- a typedef for std::basic_string<char16_t>
std::u32string
- a typedef for std::basic_string<char32_t>
C++20 char8_t
C++20 introduces a new character type specifically for UTF-8 encoded character data: char8_t
. It has the same size and sign as unsigned char
but is distinct from it. The u8'' character literal and u8"" string
literal has been changed to return the new type. A new typedef
std::u8string
for std::basic_string<char8_t>
was introduced.
The upcoming C standard (probably C23) includes a proposal for char8_t
, which is a typedef
to unsigned char
.
Conclusion
C and C++ character and string types reflect the long history of the languages.
The original char
type is still in the widest use. In the new code, it should be used for legacy single-byte and multibyte encodings, and for non-character binary data. It works well with UTF-8 encoded strings and can be used for them, especially with compilers that don't support char8_t
type.
wchar_t
turned out to be a victim of changing Unicode specifications. There is no good reason to use it today in the new code, even with ancient compilers.
char16_t
should be used for UTF-16 encoded string
s in places where various "widechar
" typedef
s have been used in the past.
char32_t
should be used for UTF-32 encoded string
. Although it is very rare to see string
s encoded as UTF-32 due to its memory inefficiency, individual code points are frequently UTF-32 encoded and char32_t
is the ideal type for that purpose.
char8_t
has been only recently introduced to C++ and only proposed for C. It is not clear whether its advantages over plain old char
for encoding UTF-8 string
s will be enough to see widespread use.
History
- 25th November, 2022: Initial version