Doing UTF-8 in Windows, Part 2 - Tolower or Not to Lower

Mircea Neacsu

5.00/5 (2 votes)

18 Feb 2020MIT4 min read

13.8K

173

Case folding for UTF-8 code

This article continues the series on UTF-8 in Windows and shows a multi-lingual implementation for functions tolower and toupper.

Download source code - 64.7 KB

Cyrillic letters in cursive - source: Wikipedia

Introduction

In the good old times when everybody spoke English and ANSI was the only game in town ^[1], converting from upper case to lower case involved only flipping a bit. Just OR with 0x20 any ASCII letter and you go from "A" (0x41) to "a" (0x61). Life was simpler and all were happy.

When you start playing with multi-lingual character sets, changing between upper case and lower case is more complicated. Just as an example, my name should be written "Neacșu" and in all caps that makes "NEACȘU". The "Lower case Latin letter s with cedilla", as Unicode calls it, has code 0x15E and the "Upper case Latin letter S with cedilla" has code 0x15F. It happens to be just one bit of difference but it is another bit.

This article continues the series on UTF-8 in Windows and shows a multi-lingual implementation for functions tolower and toupper. The latest version of the code can always be downloaded from my GitHub site.

Background

Knowing what letters are upper case and lower case in all the alphabets in the languages in the world is a daunting task for any programmer. Luckily, Unicode Consortium, the body that administers the Unicode has taken a break from creating emojis, and created a long list of case folding codes. You can download the list from this link. The document describes four types of case folding: common, full, simple and Turkic. My functions implement only the standard and simple cases.

Implementation

The basic idea of the implementation is quite simple. Convert the Unicode document in two tables of equal size, one with the upper case letters and the other one with the lower case ones. The upper case table is sorted to allow for binary searching. If a code is found in the upper case table, it is replaced with the matching code from the lower case.

Here is the code for the tolower function:

C++

//definition of 'u2l' and 'lc' tables
#include "uppertab.c"

std::string tolower (const std::string& str)
{
  u32string wstr = runes (str);
  for (auto ptr = wstr.begin (); ptr < wstr.end (); ptr++)
  {
    char32_t *f = lower_bound (begin (u2l), end (u2l), *ptr);
    if (f != end (u2l) && *f == *ptr)
      *ptr = lc[f - u2l];
  }
  return narrow (wstr);
}

The "uppertab.c" is generated by a short program that reads the Unicode case folding file and produces the two tables, u2l and lc. Here is a short sample of it:

C++

//Upper case table
static char32_t u2l [1411] = { 
  0x00041, //  LATIN CAPITAL LETTER A
  0x00042, //  LATIN CAPITAL LETTER B
  0x00043, //  LATIN CAPITAL LETTER C
  0x00044, //  LATIN CAPITAL LETTER D
....
  0x1e91f, //  ADLAM CAPITAL LETTER ZAL
  0x1e920, //  ADLAM CAPITAL LETTER KPO
  0x1e921};//  ADLAM CAPITAL LETTER SHA

//Lower case equivalents
static char32_t lc [1411] = { 
  0x00061, 0x00062, 0x00063, 0x00064, 0x00065, 0x00066, 0x00067, 0x00068, 
  0x00069, 0x0006a, 0x0006b, 0x0006c, 0x0006d, 0x0006e, 0x0006f, 0x00070, 
... 
  0x1e939, 0x1e93a, 0x1e93b, 0x1e93c, 0x1e93d, 0x1e93e, 0x1e93f, 0x1e940, 
  0x1e941, 0x1e942, 0x1e943};

The input string is converted to UTF-32 by calling the utf8::runes function. Each character is then searched in the u2l table and, if found, it is replaced with the matching character from lc table. Searching is done using the lower_bound function that performs a binary search. The resulting string is converted back to UTF-8 and this is that.

The toupper function is very similar except that it uses the tables l2c and uc defined in the "lowertab.c" file.

For the sake of completeness, both functions also have an in-place variant:

C++

void tolower (std::string& str);
void toupper (std::string& str);

Another function is utf8::icompare that performs case-insensitive string comparison:

C++

int icompare (const std::string& s1, const std::string& s2);

It behaves just like the std::string::compare, returning a negative value if s1, converted to lower case, preceeds s2 converted to lower case or a positive value if s1 succeds s2. If the lower case versions of the two strings are equal, the function returns 0.

The table generation program "gen_casetab.cpp" is also straight-forward. It reads and parses the case folding textfile and produces first the "uppertab.c" file and then the "lowertab.c" file. In between, it has to re-order the table that was sorted by uppercase codes to make it ordered by lowercase codes.

Using the Code

All functions are in the utf8 namespace. Because these and many other functions in this namespace have the same name as standard C/C++ functions, my recommendation is not to use a "using" directive. Below is a short example showing how to call these functions:

C++

#include <utf8.h>
...
string all_caps = utf8::toupper (u8"Neacșu"); // all_caps should be "NEACȘU"
string greek {u8"αλφάβητο"};
utf8::toupper (greek);                        //string should be "ΑΛΦΆΒΗΤΟ"

Points of Interest

One interesting point is that there are more than one uppercase codes that are folded into the same lower case code. In my implementation, the second code is dropped from the table. It seems to work OK but the whole idea is somewhat troubling: it means there is no unique way to convert a lowercase string to its equivalent uppercase one. Personally, I think there are some strange choices that have been baked into the Unicode Consortium table. For instance, both the upper case Latin letter K and the degrees Kelvin symbol (K) are folded into the lowercase (k) letter^[2].

Also note that these functions are not so lightweight: each pair of conversion tables is over 10k but I guess this is the price you have to pay if you need multi-lingual case awareness.

History

17^th February, 2020 - Initial version

Footnotes

[1] There were never such good times: it just happened that most of the nerds who worked in the field spoke English, ANSI code was called ASCII and was used by everyone except for a big blue company who insisted to use EBCDIC. This was so wasteful: EBCDIC used 8 bits for what ASCII could do with 7!

[2] I would argue that the degrees Kelvin does not have a lower case symbol and my science teacher would probably agree with me.

License

This article, along with any associated source code and files, is licensed under The MIT License