Not that your result is correct or not; it's better to say that the whole idea of this "translation" between languages makes no sense. You could make it meaningful it you explained to ultimate purpose of your character calculations (which look strange). Apart from some application context, the question does not makes any sense.
Here is why: you are doing seemingly similar operations of very different objects.
Your C characters are 8-bit objects. Moreover, you use signed characters, but if you do just bitwise operations, it does not matter. And you use complement operator '~'. The idea of complement would not make any sense without specification of "complement to what". You could typecase some object to a wider type and complement to its value corresponding to a value with all bits set and get a different result. With C
char
type, the complement means a bitwise complement to the value 0xFF. For example, if your character is blank space (
char source = ' ';
), the complement gets the value -33, which corresponds to 0xDF in the
unsigned char
form.
In .NET, a character is a Unicode character. In memory, it is represented using the encoding UTF-16LE, which uses 16-bit words to express a character in a Base Multilingual Place (BMP) and a pair of such words to express one character outside BMP. When you calculate a complement of the same very blank space, you get a "character" 0xFFDF, which is not standardized as a character:
http://www.unicode.org/charts/PDF/UFF00.pdf[
^].
Please see:
http://www.unicode.org[
^].
Now, you wrap all intermediate results to a byte, it will give identical result: 0xDF. So, up to this point everything is "correct" (if this is really what you want to get), and the problems could be somewhere else. What is your input file is not actually all "Western European". Or you interpret it incorrectly. So, to go further, let's see what exactly characters are "wrong". You could easily run this code under the debugger to see a calculation on some specific characters. Please see my comment to the question and answer my question.
As to your idea to "do this without even worrying about the encoding", it strongly resembles the thinking of
monsieur Jourdain, a character of Molière's play
Le Bourgeois gentilhomme. This guy was proud of the fact he could express himself
in prose, after his teacher explained it to him. :-)
Please see:
http://en.wikipedia.org/wiki/Prose[
^],
http://en.wikipedia.org/wiki/Le_Bourgeois_gentilhomme[
^].
[EDIT]
Anyway, I decided to try it out. First of all, let me re-write the code is a literate way (but it does not mean is should work correctly):
class Program {
const string fileName = "input.txt";
const string outfileName = "output.txt";
static void Main(string[] args) {
using (StreamReader reader = new StreamReader(fileName, Encoding.GetEncoding(1252))) {
using (StreamWriter writer = new StreamWriter(outfileName, false, Encoding.GetEncoding(1252))) {
while(true) {
int value = reader.Read();
if (value < 0) break;
char character = (char)value;
if (character != '\r' && character != '\n')
character = (char)(byte)(~(int)(byte)character);
writer.Write(character);
}
}
}
}
}
Your text sample is "converted" like this:
??????? ??ßÂßÝ????????Ý
??????? ??ßÂßÝÎÏÏÝß
where each question mark is really a question mark (code point 0x003f). The reason is this: it is incorrect to work with characters and encodings in principle. In this case, your complement function produces an image of a source character which does not fit into the range of the valid code range for the encoding, so it is replaced by a question mark.
Here is the background: C characters are not really characters, they are signed bytes and are processed in the bitwise manner, ignoring the cultural meaning of them. As to .NET, it follows Unicode standard.
Let me tell you that all your "1251", as well as the whole idea of "code page" do not exist anymore, in a way. They exist only in the form of some Microsoft legacy. Look at the result of
System.Text.Encoding.GetEncoding
— this is the real encoding object. Also, all non-Unicode encodings are only good for some legacy (such as ASCII, as a subset of Unicode). If you use any encoding except one of Unicode UTFs on an arbitrary text, a correct result is not guaranteed.
Now, to reproduce the effect of your C code, you need to work with binary bytes, as it is suggested in Solution 4. This is the only way.
Then again, this is a kind of "obfuscation" which makes no sense, whatsoever. If you needs encryption, use encryption (again, why?).
—SA