I have the following code to convert decimal value to its charcater . But it works for only for 0-255 decimal values. I need it for upto 10000 values

Question

3.40/5 (3 votes)

See more:

C#

public char unicode to Character()
{
int decimal=44;
Byte[] bytes=new byte[2];
bytes[0]=(byte)(decimal);
bytes[1]=0;
int charcount=unicode.GetCharCount(bytes);
har[] chars=new Char[charcount];
chars=unicode.GetChars(bytes);

return chars[0];
}

What I have tried:

the above code works for only charaters with a decimal value upto 255. But I want to get the output for decimal value 3516.

Posted 26-Mar-16 17:20pm

Member 11842212

Updated 26-Mar-16 19:18pm

Sibeesh Passion

v2

Add a Solution

Comments

BillWoodruff 27-Mar-16 0:31am

A C# Unicode Char is a wrapper over a UShort Integer (two bytes) which means you have only 65,536 possible value. Transformations from a Char to integer, and from integer to Char, are straightforward.

What is your goal in dealing with a single byte of a Unicode Char ?

Sergey Alexandrovich Kryukov 27-Mar-16 1:31am

It depends on what is that decimal and how you want to interpret it. Unicode is defined for only 1,112,064 different characters, but decimal has the range -7.9 x 1028 to 7.9 x 1028, approximately. So, in general case, it's wrong. :-)

You primary mistake is that you probably thing that Unicode is some kind of encoding. It is not.

—SA

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2016-03-26T19:18:00

Solution 1

If by "Unicode" you mean a character code point, you can use Char.ConvertFromUtf32 Method (Int32) (System)[^].

But it cannot always be a character, that's why it's a string. A single .NET character is, strictly speaking, not a Unicode character; it can be a part of surrogate pair, beyond BMP. I recently tried to explain it here:
How to change one font to other font in the term of key maping[^].

But surrogate pairs are reserved code points in the range U+D800 to U+DFFF, so, if you want to work in the range up to 3516 = 0x0DBC, you will get a 1-character string, which you can interpret as "character" — problem solved.

—SA

Posted 26-Mar-16 19:18pm

Sergey Alexandrovich Kryukov

Updated 26-Mar-16 19:26pm

v2

Comments

[no name] 27-Mar-16 11:36am

A 5. Please allow me a question which I'm still not clear about it:

I read this in MSDN among others: BMP can represent 2^16 Code Points.
Do we not to need to substract the 1024 for the low Surrogate "start char". Still I have headache about These Unicode things to understand it completely :(
Thank you in advance.
Bruno

Sergey Alexandrovich Kryukov 27-Mar-16 12:10pm

The delicate point is here: yes, 2^16 code points, but not 2^16 characters. Can you see the difference?

This is what exactly what was standardized by Unicode: the range e U+D800 to U+DFFF was reserved to be never used for characters. It's can be used for surrogate pairs only. When Unicode committee needs to introduce new character, they consider to put it into some unused room of code points which are not yet assigned to characters. The choice depends on the semantic/cultural value of the new character. If, for example, it's a new character related to already standardized alphabet, an attempt can be made to add it to the range previously reserved for that alphabet.

In all cases, new character can be chosen to be added to either BMP or not. Adding to BMP means adding it to 0 to U+FFFF domain, but excluding already reserved subset of code point, notably, excluding U+D800 to U+DFFF.

Alternatively, the character can be chosen to be added beyond BMP. It fills in 3 code points: one normal code point greater than U+FFFF and two code points (pair) in the surrogate range U+D800 to U+DFFF, to allow UTF-16 representation. But taking up the pair does not need any decision or reservation, as it is done "automatically", according to the UTF-16 algorithm: Descrpition; see also the table "UTF-16 decoder".

—SA

[no name] 27-Mar-16 12:29pm

First of all thank you very much again for your answer to a "comment question.

"Can you see the difference?": Seems I can't yet, sorry...

I will read now again the rest of your answer in Detail and try to ask a question at CP. This also, that I can Forward some Points.

Meanwhile I feel to be realy stupid not to understand all this Unicode stuff :(

Thank you again
Bruno

How can the range for surrogates counted as code Points?

Sergey Alexandrovich Kryukov 27-Mar-16 12:48pm

The difference is simple. First of all, code points are some pure mathematical entities, abstract integer numbers, abstracted from their computer representation. The are integer numbers as they are understood in mathematics. Some of such numbers are reserved by the Unicode standard, to point to some cultural entities. Nobody says that each code point is reserved to point to a character. Actually, the objects (as cultural entities) pointed by a code point are classified into 1) characters, 2) low surrogates, 3) high surrogates. Is it better now?

These surrogates (not pairs of them) can be considered as non-characters or some "technical characters" used to represent other characters in UTF-16. One interesting consequence is: it's possible to have invalid UTF-16 text (or it can be dubbed "non-text"), the text which cannot be interpreted as Unicode at all. Say, surrogates, if any, should only appear in proper pairs. If you have a single surrogate word without its counterpart, it makes the whole text invalid. By the way, UTF-8 algorithm is more stable, it's a really cunning algorithm, a stroke of a genius maybe...

Note that a .NET character is not, strictly speaking, a Unicode character, not even a character in the sense of UTF-16 (internal .NET representation of a character in memory is UTF-16LE). And this is only because, technically, a low surrogate or high surrogate is, technically, considered as a character. In .NET, in particular, string.Length is not, generally, a correct length in characters; it's only a length in 16-bit words. Even in .NET, Unicode should be taken with care. Say, if you write text search in editor and use string.Length to get, say, the position of the caret, it will be incorrectly set if there is a non-BMP character in between...

—SA

[no name] 28-Mar-16 10:12am

I'm reading this at the Moment in Detail and try to understand. Looks like it solves a lot of my doubts. Was always afraid to ask questions about something like this... idiot me ;)

Something like " If you have a single surrogate word without its counterpart, it makes the whole text invalid":

That was a question from the very beginning (where I tried to understand UC vs. UTF16) for me. But such a Statement like yours, one can not find (or I did not find it until now) in MSDN.

++++5. Thank you again.
Bruno

Sergey Alexandrovich Kryukov 28-Mar-16 10:34am

Your questions actually could help many, if those people decided to read it all. Many not only don't understand it, but also think that understand, which is much worse. And, naturally, don't ask questions. False knowledge is much worse than lack of knowledge; and realizing the lack of understanding is the best thing, not very simple thing, the major driving force of progress.

...you are very welcome.
—SA

[no name] 28-Mar-16 10:57am

But sometimes one is ashamed also to admit that one does not understand something and then provides no more questions. I will give up this path and ask and ask even I will be down voted for it ;)

"False knowledge is much worse than lack of knowledge". Yes this is very very true, therefore I would never down vote a Q (if it is not something like "give me code").

Bruno

Sergey Alexandrovich Kryukov 28-Mar-16 11:11am

Unfortunately, there are more "give me code" or "may car is broken; what to do?" posts than real questions on this site... :-(
—SA

[no name] 28-Mar-16 11:25am

Yes it is like this (eheh for my "xyz is broken" I would have a great link, but it is only available in german :( ).

Anyway my note "... would never down vote a Q..." was not addressed directly to you, it is more a General note :-)
Bruno

Sergey Alexandrovich Kryukov 27-Mar-16 13:04pm

By the way, it was interesting to discover for myself and may be interesting for you:
In Free Pascal Lazarus, which offers its own UI library LCL, the internal representation of Unicode is UTF-8, not UTF-16. It simply means that the size of the internal representation of each character is variable, even within BMP, that is, instead of 16-bit or 32-bit UTF-16 character, the character lengths are 1, 2, 3...

As I recently dealt with all that UTF-8 search stuff (I just ported my very old favorite highly used text editor I wrote in Delphi v.2.0 long time ago, ported into a multiplatform solution), I had to use a pair of UTF-8 functions provided: if you have to pointers to the same text, you need to counter the number of characters (not bytes) between them; if you have a pointer and a number of characters on right, you need a function which gives you the shift taking into account the variable character size.

It makes search considerable slower compared to the Delphi Windows-only version (where the search was embedded in the Windows-only component), but performance can be considered as acceptable... Same thing would happen is one writes text search from scratch using Windows UTF-16 representation...

—SA

[no name] 27-Mar-16 13:27pm

It Comes more and more confusing for me :(
Quote: "the internal representation of Unicode is UTF-8, not UTF-16".
But also with Encoding UTF16, nothing is constant (?) The surrogates Needs two "16 bit chars"? For me it is the same UTF8 and UTF16, one has to take into acount, that _one_ character does not represent a code Point.

I'm just on, to write a question about all this, I hope you will answer ;)

... but at the Moment as I mentioned I'm confused again. I hope I can formulate an appropritae question ;)
Bruno

Sergey Alexandrovich Kryukov 27-Mar-16 14:32pm

Again, internal representation of Unicode in Windows is UTF-16LE. But that in Windows. In my last previous comment, I talked only about Lazarus UI library called LCL, which is cross-platform. They suggested components which are based on UTF-8 instead, and the idea is: ANSI compatibility. That is, all code is based on 8-bit string elements (characters or not), PChar (not PWideChar, which if course also can be used, but not directly on LCL), so most code works transparently, begin agnostic to the choice between UTF-8 and ANSI. Then call it "not Unicode-aware", which is, of course, not a correct formulation of what they done. The components do take Unicode into account, at least when the represent Unicode text in each of the OS-specific implementations.

If you say that "it is the same UTF8 and UTF16", you are maybe right, if you mean that the idea of using variable-length characters is basically the same.

By the way, with Windows, I believe if some start using characters beyond BMP, it will break many existing Windows Unicode applications and .NET applications... :-(

—SA

[no name] 27-Mar-16 15:14pm

Sorry that I can't vote for this!

Quote: "By the way, with Windows, I believe if some start using characters beyond BMP, it will break many existing Windows Unicode applications and .NET applications... :-("

That is exactly the point which I'm thinking about (not only with Windows). If I read about "Unicode" everything seems to be solved... but it is not for the "real live" when I Need an abstraction of it e.g. with ".NET". Who does realy take into acount chars beyond BMP? I will definitelly dive more into this, it seems also to be a potentional to attack some apps.
Bruno

Sergey Alexandrovich Kryukov 27-Mar-16 16:22pm

Don't worry, I plan to publish some articles, will notify you. :-)

As to the testing of characters beyond BMP, one problem is the lack of fonts. In the past, I tried things with Code2001 fonts, now the owner of them abandoned his site, there are no official download sources: https://en.wikipedia.org/wiki/Code2000.

You see, if you make a similar mistake with UTF-8, the problem will be immediately detected, because all code points above 0xFF will reveal it (and actually above 127, too). With UTF-16, all use BMP, where we don't encounter transition between 16-bit character presentation and a pair... So the bugs in taking the possibility of surrogate pairs into account are just nearly never tested — you are right.

—SA