Convert utf-16 to utf-8 in C

Question

5.00/5 (2 votes)

See more:

I have a university assignment i need some help with. Don't give me the solution; hints or small portions of code would be appreciated.

So, my university project is all about unicode. To be exact, I have to write code that takes character input in utf-16 format, converts it to utf-8 and places it in the appropriate exit (terminal console or file.txt), whilst I also do the following:
1)Don't use arrays
2)Use putchar
3)Use getchar

Note: I am in my second year there, but it would be best if I did not use pointers and scanf.

I'd rather not post code unless necessary, in case my professor is watching the forums.
Here's my start:

Objective-C

int main(){
int char1 = 0,char2;
while (char1 != EOF) 
{
		char1 = getchar();
		char2 = getchar();
		char1 <<= 8;
		char1 += char2;

		if (char1 >= 0xD800 && char1 <= 0xDBFF) {
			
			char2 = getchar();
			int tempchar = getchar();
			char2<<= 8;
			char2 += tempchar;

			if (char2 >= 0xDC00 && char2 <= 0xDFFF)
			{
				char1 -= 0xD800;
				char2 -= 0xDC00;
				char1 <<= 10;
				char1 += char2;
				char1 += 0x010000;
				//write code that converts to utf 8

			}
 else if((char1 >= 0x0000 && char1 <= 0xD7FF )||(char1 >= 0xE000 && char1 <= 0xFFFF)){
			//write code that converts to utf 8
		}
}

Is my code up to this point correct? Is the shifting right? If not explain to me how I could make it work.

Posted 4-Dec-15 8:20am

Kobayashi Porcelain

Updated 4-Dec-15 9:49am

v4

Add a Solution

Comments

[no name] 4-Dec-15 14:56pm

Only a note:
"that takes character input in utf-16 format,"

In case you are using .NET it is very hard to get _a_ char other than Unicode coded in UTF16.....

Kobayashi Porcelain 4-Dec-15 14:58pm

I am talking about using a file.txt as input through unix commands. Terminal style. But ok sure :)

[no name] 4-Dec-15 14:59pm

Ok. Thanks for this, now it makes much more sense ;)

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2015-12-04T09:59:00

First, you did not show how your objects named char… are declared. You need to do all the calculations on 32-bit unsigned integer; in other cases, the size would be not enough to represent a code point beyond BMP.

I did not check up UTF16 part, but at least one part is missing: there should be two different branches: one for UTF16LE and another for UTF16BE. In each of the cases, you first check up if you are reading a surrogate pair and then calculate your internal representation of a code point out of the pair, in the form of unsigned 32-bit integer. For big endian, all representations are flipped, including the surrogate pairs themselves. Other code points should be composed out of 16-bit words; and its unsigned integer interpretation will be arithmetically equal to a code point value. Please see:
https://en.wikipedia.org/wiki/Endianness[^],
https://en.wikipedia.org/wiki/UTF-16[^].

The goal of first stage is to interpret UTF16 encoding character by character, and each character should be represented as 16-bit unsigned value which should be arithmetically equal to the code point. Here, you need to realize that Unicode code points are mathematical abstraction representing cardinal value; they are abstracted from the bitwise representation of this data, from any kind of computer representation. They are just abstract mathematical values.

Now, UTF-8 is also variable-width encoding. It uses pretty cunning algorithm with very low redundancy. It is fully described, for example, here: https://en.wikipedia.org/wiki/UTF-8[^].

Just follow the algorithm description. I don't think it's anything too complicated.

There is another optional feature of the UTF-16 or UTF-8 streams: the BOM. This is the marker which is optional. You need to decide what to do with text with absent marker. You can deny processing if the marker is not found, or you need to have another function where the expected encoding is specified. That should be your design. Please see: http://unicode.org/faq/utf_bom.html[^].

And finally, one delicate point: both encodings allow invalid code points. In your particular problem, UTF-8 is never a source, so all problems you may have are with UTF-16. If, for example, you face a second member of a surrogate pair before the first one is encountered, this is invalid data. If you have only one member of a surrogate pairs surrounding by the non-surrogate words, this is invalid data. So, you have to decide what to do with such cases; and this should be just a voluntary decision. It should be by your design.

I hope I did all you wanted: no code, but now you have all ins and outs. It it clear?

—SA