First, you did not show how your objects named
char…
are declared. You need to do all the calculations on 32-bit unsigned integer; in other cases, the size would be not enough to represent a
code point beyond BMP.
I did not check up UTF16 part, but at least one part is missing: there should be two different branches: one for UTF16LE and another for UTF16BE. In each of the cases, you first check up if you are reading a
surrogate pair and then calculate your internal representation of a
code point out of the pair, in the form of unsigned 32-bit integer. For big endian, all representations are flipped, including the surrogate pairs themselves. Other code points should be composed out of 16-bit words; and its unsigned integer interpretation will be arithmetically equal to a code point value. Please see:
https://en.wikipedia.org/wiki/Endianness[
^],
https://en.wikipedia.org/wiki/UTF-16[
^].
The goal of first stage is to interpret UTF16 encoding character by character, and each character should be represented as 16-bit unsigned value which should be arithmetically equal to the code point. Here, you need to realize that Unicode code points are mathematical abstraction representing cardinal value; they are abstracted from the bitwise representation of this data, from any kind of computer representation. They are just abstract mathematical values.
Now, UTF-8 is also variable-width encoding. It uses pretty cunning algorithm with very low redundancy. It is fully described, for example, here:
https://en.wikipedia.org/wiki/UTF-8[
^].
Just follow the algorithm description. I don't think it's anything too complicated.
There is another optional feature of the UTF-16 or UTF-8 streams: the BOM. This is the marker which is optional. You need to decide what to do with text with absent marker. You can deny processing if the marker is not found, or you need to have another function where the expected encoding is specified. That should be your design. Please see:
http://unicode.org/faq/utf_bom.html[
^].
And finally, one delicate point: both encodings allow
invalid code points. In your particular problem, UTF-8 is never a source, so all problems you may have are with UTF-16. If, for example, you face a second member of a surrogate pair before the first one is encountered, this is invalid data. If you have only one member of a surrogate pairs surrounding by the non-surrogate words, this is invalid data. So, you have to decide what to do with such cases; and this should be just a voluntary decision. It should be by your design.
I hope I did all you wanted: no code, but now you have all ins and outs. It it clear?
—SA