This article continues the series on UTF-8 in Windows and shows a multi-lingual implementation for functions tolower and toupper.
Introduction
In the good old times when everybody spoke English and ANSI was the only game in town [1], converting from upper case to lower case involved only flipping a bit. Just OR with 0x20 any ASCII letter and you go from "A
" (0x41) to "a
" (0x61). Life was simpler and all were happy.
When you start playing with multi-lingual character sets, changing between upper case and lower case is more complicated. Just as an example, my name should be written "Neacșu" and in all caps that makes "NEACȘU". The "Lower case Latin letter s with cedilla", as Unicode calls it, has code 0x15E and the "Upper case Latin letter S with cedilla" has code 0x15F. It happens to be just one bit of difference but it is another bit.
This article continues the series on UTF-8 in Windows and shows a multi-lingual implementation for functions tolower
and toupper
. The latest version of the code can always be downloaded from my GitHub site.
Background
Knowing what letters are upper case and lower case in all the alphabets in the languages in the world is a daunting task for any programmer. Luckily, Unicode Consortium, the body that administers the Unicode has taken a break from creating emojis, and created a long list of case folding codes. You can download the list from this link. The document describes four types of case folding: common, full, simple and Turkic. My functions implement only the standard and simple cases.
Implementation
The basic idea of the implementation is quite simple. Convert the Unicode document in two tables of equal size, one with the upper case letters and the other one with the lower case ones. The upper case table is sorted to allow for binary searching. If a code is found in the upper case table, it is replaced with the matching code from the lower case.
Here is the code for the tolower
function:
#include "uppertab.c"
std::string tolower (const std::string& str)
{
u32string wstr = runes (str);
for (auto ptr = wstr.begin (); ptr < wstr.end (); ptr++)
{
char32_t *f = lower_bound (begin (u2l), end (u2l), *ptr);
if (f != end (u2l) && *f == *ptr)
*ptr = lc[f - u2l];
}
return narrow (wstr);
}
The "uppertab.c" is generated by a short program that reads the Unicode case folding file and produces the two tables, u2l
and lc
. Here is a short sample of it:
static char32_t u2l [1411] = {
0x00041, 0x00042, 0x00043, 0x00044, ....
0x1e91f, 0x1e920, 0x1e921};
static char32_t lc [1411] = {
0x00061, 0x00062, 0x00063, 0x00064, 0x00065, 0x00066, 0x00067, 0x00068,
0x00069, 0x0006a, 0x0006b, 0x0006c, 0x0006d, 0x0006e, 0x0006f, 0x00070,
...
0x1e939, 0x1e93a, 0x1e93b, 0x1e93c, 0x1e93d, 0x1e93e, 0x1e93f, 0x1e940,
0x1e941, 0x1e942, 0x1e943};
The input string is converted to UTF-32 by calling the utf8::runes
function. Each character is then searched in the u2l
table and, if found, it is replaced with the matching character from lc
table. Searching is done using the lower_bound
function that performs a binary search. The resulting string is converted back to UTF-8 and this is that.
The toupper
function is very similar except that it uses the tables l2c
and uc
defined in the "lowertab.c" file.
For the sake of completeness, both functions also have an in-place variant:
void tolower (std::string& str);
void toupper (std::string& str);
Another function is utf8::icompare
that performs case-insensitive string comparison:
int icompare (const std::string& s1, const std::string& s2);
It behaves just like the std::string::compare
, returning a negative value if s1
, converted to lower case, preceeds s2
converted to lower case or a positive value if s1
succeds s2
. If the lower case versions of the two strings are equal, the function returns 0.
The table generation program "gen_casetab.cpp" is also straight-forward. It reads and parses the case folding textfile and produces first the "uppertab.c" file and then the "lowertab.c" file. In between, it has to re-order the table that was sorted by uppercase codes to make it ordered by lowercase codes.
Using the Code
All functions are in the utf8
namespace. Because these and many other functions in this namespace have the same name as standard C/C++ functions, my recommendation is not to use a "using" directive. Below is a short example showing how to call these functions:
#include <utf8.h>
...
string all_caps = utf8::toupper (u8"Neacșu"); string greek {u8"αλφάβητο"};
utf8::toupper (greek);
Points of Interest
One interesting point is that there are more than one uppercase codes that are folded into the same lower case code. In my implementation, the second code is dropped from the table. It seems to work OK but the whole idea is somewhat troubling: it means there is no unique way to convert a lowercase string to its equivalent uppercase one. Personally, I think there are some strange choices that have been baked into the Unicode Consortium table. For instance, both the upper case Latin letter K and the degrees Kelvin symbol (K) are folded into the lowercase (k) letter[2].
Also note that these functions are not so lightweight: each pair of conversion tables is over 10k but I guess this is the price you have to pay if you need multi-lingual case awareness.
History
- 17th February, 2020 - Initial version
Footnotes
[1] There were never such good times: it just happened that most of the nerds who worked in the field spoke English, ANSI code was called ASCII and was used by everyone except for a big blue company who insisted to use EBCDIC. This was so wasteful: EBCDIC used 8 bits for what ASCII could do with 7!
[2] I would argue that the degrees Kelvin does not have a lower case symbol and my science teacher would probably agree with me.