Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C

Binary Obfuscation

4.93/5 (23 votes)
25 Dec 2014CPOL16 min read 62.6K   1.5K  
How to add obfuscation at binary level to protect your technology

Introduction

Binary obfuscation is a technique that aims to shadow the real application code to make it difficult for an external person, who does not have access to your sources, to understand what your program has to do. Obfuscation techniques do not transform your application in an unbreakable one, because with the right effort everyone can gain access to the decrypted data. This is due to the fact that the CPU does not have the ability (yet) to read encrypted data, so you have to deliver unencrypted orders to it. At the end is just an effort to make it hard, for non technical people to follow the execution of your application. It is just a matter of gaining time before your program can be hacked.

Obfuscation primary allows your program to be shadowed to unmanned, automated tools. Any scan of your tool for binary pattern search will end with nothing found. If someone simulates your tool with some virtual environment, he will be able to end your application without error, and memory inspection can lead to the detection of the unencrypted code. Such operations are usually performed by someone (not something) and can be very boring and costly (in terms of person-hours), so at the end, this will lead an eventual hacker just to leave things are they are, and force him to follow a different route.

In this article, we will introduce the base of obfuscation techniques. Actually, there are various methods to perform such operation, some are more advanced and other not. The most basic one is having the whole code encrypted in the released binary and decrypted at runtime, only modifying the structure in the virtual memory. With this technique, the physical file will never be modified nor will it appear in its clear format on the file system.

For the shipped examples, the virtual memory will be modified to allow read/writes and then will be decrypted to allow the execution of the technology. There are other techniques, not explained here, that allow decrypting the procedure in the heap or on the stack, but these are more advanced methods.

Background

The following points must be clear to the reader:

First, and most important, you have to know how a compiler packs a binary. We will work on Windows, so the structure used will be the Portable Executable. It is necessary to know what PE sections are, and how the operating system image loader uses them.

Second, you must have an idea of what is effectively processing the CPU. What you write in the editor, and compile, is translated by the compiler, which produces CPU code. This code is CPU bound, so 32 and 64 bits matter and change the cards on the table. Changing the CPU architecture will lead in changing some part of the presented project to make it work again.

Third, you must have a base on cryptography and security. I will speak lightly of some cryptography methods, without entering into details. You will be able to read the document, but I advise you to deepen your knowledge looking at some more technical articles.

Last, but not least, you should have a very good knowledge of C, Assembly and CPU instruction sets.

Tools

I will be using Visual Studio developing tools for this article. Visual Studio is an integrated development environment (IDE) that allows building applications for Windows and other Operating systems (nmake projects).

Obfuscation Introduction

Now we have project, but we do not want it to be easily readable. How can we do it?

Well, the first and most useful tool we have is how big and complex our code is. Think about it: is it already difficult to read an already started project. Everyone has his own coding style, which is something more than the simple syntax of the language used. We are speaking about your habits of using local variables (defined at the start or disseminated into your code), how the project is structured (everything in a big file, or organized in a more hierarchical structure), and how do you use the preprocessor directive. Now scale such view on a bigger project (from 50 to hundreds files), which uses your own abstraction (or not) over multiple technologies used to make the application work.

Lot of technologies allowed can be compiled directly in the application without the use of .dll files, which can be present on the machine or not. Instead of checking for the pre-requisites, you can include the source from a static library in the application, to avoid the DLL Hell problem. All this amount of code, also designed from third party companies or developer, add complexity to your code. Depending on the patience limit of the hacker who wants to break your application, he can consider the work not worth the time to spare to crack it.

Now try to imagine inserting some obfuscation in that huge mess. This will be even more frustrating for the professionals who decided to attack your implementation, and this is what we are trying to achieve. Remember: nothing lasts forever, it is just a matter of gaining time to release a new and more complex obfuscation, or develop a new technology, which makes obsolete the one decrypted by the hacker.

Structure of the Target Binary

The binaries prepared are little and have very little complexity in them. They are just utilities measured in terms of kilobytes, and this makes it a good prey for everyone who wants to sniff your technology. If you want to protect these applications a little, then some obfuscation is necessary.

The project performs obfuscation on a different section of code. With such architecture, we are free to create a new code and decide where to push it, in the clear section or in the obfuscated one. Consider that the decryptor has to be in clear form, otherwise the CPU cannot decrypt your data. If the decryptor is "surrounded" by other code, then it will be difficult to isolate it, especially if used rarely.

This idea can be applied to a generic section, so you can also have a different data section (with connection strings or password or game cheats) that can be isolated from low interest data (normal variable value or output strings).

At the end of compilation, the utility binaries structure is the following:

+------------------+ File start
| DOS header       |
+------------------+
| DOS stub         |
+------------------+
| NT header        |
+------------------+
| Sections headers | Information over the binary sections
+------------------+
|       ...        |
+------------------+
| .dummy           | Section with code to obfuscate
+------------------+
|                  |
|                  |
|                  |
|       ...        |
|                  |
|                  |
|                  |
+------------------+ EOF

The .text area usually contains the application code with lower level of security to apply. The other section, .dummy, will contains the technology we want to protect. Consider that this is not the only approach one can have with obfuscation technology (you can decide to mix obfuscated code with clear one), but for sure is the more compatible one with C compilers. Usually, mixing obfuscated code with clear means that you have to use a lot of Assembly code, and you have to structure your project in a way that can mess with a work team (other programmers can mess up with the code you did). Maintaining a separate obfuscation section can aid in terms of organization.

HINT: Have you ever heard about multi staging obfuscation? Well, imagine having three different parts of your code (here called sections); every section obfuscated with a different technology and different key, and depend on the section that comes first. You can have, for example, .dummy1, .dummy2 and .dummy3. The decryptor for .dummy3 resides in .dummy2, which is decrypted with a technology/key that is in .dummy1. Chaining up obfuscation will add some complexity to your project, resulting in a more difficult hack from an external attacker.

HINT: Who said you have to contain the decryptor at all? Can you imagine someone looking desperately at your code searching for something that is not there? The section can remain there, silent, until you decide to decipher it with an external tool, shipped as .dll (for example).

Naive Section Obfuscation

Naive does not mean stupid. It is just a matter to choose the right protection to the right technology; you do not want to apply 2048-bit RSA key to your fancy Soraka avatar picture file. Losing it is not so terrible at the end. The naive methods respond to the question: what is the first idea to apply obfuscation to my technology without losing one month of development?

Here we start with XORing the entire section using a simply key of one byte. The one-byte key is a sufficient requirement to avoid automatic scanning of your code, but is easily read by an expert eye. Exchanging the key during version releases of your application renews the obfuscation. This led any hacker to repeat the operations to decipher your code again.

So, what happens when you choose such obfuscation in the project? Well the routines prepared will locate the .dummy section in the target executable. Once located, the offset from the file start and its size, it proceeds applying the key to the content in a separate buffer, and finally replaces the binary executable section with the encoded one.

Let us take a quick look at the structure of the executable just before and after the encryption:

The naiveA project .dummy section:

55 8b ec 81 ec c0 00 00 00 53 56 57 8d bd 40 ff
ff ff b9 30 00 00 00 b8 cc cc cc cc f3 ab 8b 45
08 83 c0 01 5f 5e 5b 8b e5 5d c3 cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ... and so on until section end.

NaiveA project .dummy section after obfuscation:

00 de b9 d4 b9 95 55 55 55 06 03 02 d8 e8 15 aa
aa aa ec 65 55 55 55 ed 99 99 99 99 a6 fe de 10
5d d6 95 54 0a 0b 0e de b0 08 96 99 99 99 99 99 <-- 0x99 strange pattern recognized.
99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 ... and so on until section end.

The difference between the two sections is clear. When an experienced reverse engineer take a quick look at such binary code, he will immediately recognize that something is wrong with the section. Initially, he recognizes that the section contains code because the flags in the section header say that. If he looks at the code, for this case (other keys can produce different output), he cannot locate a valid procedure (every procedure calling convention has its own prologue and epilogue which does not change). This is strange but not unusual: it is not rare to write custom assembly code to perform some performance critical operation, so he can evince that some kind of custom assembly has been written.

The strange behavior of the section here is the 0x99 filling pattern that you can see until the end of the section. Usually compilers fill the space of the section to reach a power of two size. What is the means of having so many cwd instructions in the code? The engineer will immediately recall that the filling patter usually is 0x00 or 0xcc depending on what decision has taken the compiler. If he assumes that the filling pattern is 0xcc and performs a test with the help of a very basic script, he can discover the key (0x55 in this case), and replace the section with the unencrypted one.

As you can see proceed as described take some time, and require that the reader of the binary owns some reverse engineering skill and some intuition. Usually this sort of intuition comes from experience and knowledge that a common PC folk does not have, so applying the obfuscation an application with a low level of technology to defend is enough.

HINT: This very basic obfuscation technique even allows your executable to bypass automated scanners, and can be encrypted again with a different key every time you run it (the section can be easily located and modified in memory, as you saw). You can create a copy of the executable and replace the old one at every execution, so you will never have the same key (and the same binary pattern).

At the end, the time required for us to obfuscate it and the time required for an external person to translate/read it is hugely in our favor.

Next Step: Longer Key Can Resolve Our Problem?

Intuitively using a longer key allows the program to gain more security, because even if the reverse engineer successfully decrypts one byte of our code, he will have to go forward again with his work to obtain the full, readable binary code. What will change if we try such an approach? Another project, naiveB, helps answering this question, and as I did for the previous chapter, now I will show you the effect of such obfuscation over your code section.

The naiveB project .dummy section:

55 8b ec 81 ec c0 00 00 00 53 56 57 8d bd 40 ff
ff ff b9 30 00 00 00 b8 cc cc cc cc f3 ab 8b 45
08 83 c0 01 5f 5e 5b 8b e5 5d c3 cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ... and so on until section end.

NaiveB project .dummy section after obfuscation:

55 8a ee 82 e8 c5 06 07 08 5a 5c 5c 81 b0 4e f0
ef ee ab 23 14 15 16 af d4 d5 d6 d7 ef b6 95 5a
08 82 c2 02 5b 5b 5d 8c ed 54 c9 c7 c0 c1 c2 c3
dc dd de df d8 d9 da db d4 d5 d6 d7 d0 d1 d2 d3
cc cd ce cf c8 c9 ca cb c4 c5 c6 c7 c0 c1 c2 c3 <-- Repeating pattern visible here.
dc dd de df d8 d9 da db d4 d5 d6 d7 d0 d1 d2 d3
cc cd ce cf c8 c9 ca cb c4 c5 c6 c7 c0 c1 c2 c3
dc dd de df d8 d9 da db d4 d5 d6 d7 d0 d1 d2 d3 ... and so on until section end.

As you can see, now the single byte 0x99 is no more there; this thanks our longer key. However, another pattern is there, and now recurs every two lines. This means that, applying XOR encryption, the repeating pattern is long as our chosen key, for every key you decide to apply. This is a known issue to a reverse engineer. In this case, he can replace the key in his script and decrypt your obfuscated section without any additional time spent compared to our first attempt. This means that in such case, making our key longer does not work as good as we first thought.

Next Step: Avoid Any Recurring Pattern in the Obfuscated Section

Well, actually, there is a method known in cryptography to obtain an unbreakable encryption, but such feature comes with some restrictive points. The One-time pads are keys long as the data to encrypt, generated by genuine random number generator (which is not so easy to do as you think without specialized hardware). The idea is to have a pool of random numbers that you use to encrypt your data; if such number has enough randomness then there are no recurring patterns, and the reverse engineer has no repeated data which can use to break your code. It is like choosing a single key for every byte of the data, always randomly changing the key.

The problem with this approach is that you will have other data to protect or hide, which is the key itself. Let us see what happens when we decide to apply such key to the section we want to obfuscate:

The naiveC project .dummy section:

55 8b ec 81 ec c0 00 00 00 53 56 57 8d bd 40 ff
ff ff b9 30 00 00 00 b8 cc cc cc cc f3 ab 8b 45
08 83 c0 01 5f 5e 5b 8b e5 5d c3 cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ... and so on until section end.

NaiveC project .dummy section after obfuscation:

8e bf 9c 59 96 b5 92 03 85 b9 fa d1 45 bf 44 1e
31 15 19 60 f6 53 fe 12 05 b9 ca 45 78 f8 b6 ee
18 c2 fb 99 8d fa 04 0e a5 a7 02 b5 94 d6 a4 62
bb 70 72 f6 bc f7 8c 22 10 56 b6 b3 7e 0a ff 2f
d4 d6 29 32 b4 e4 bc b2 19 7b c2 c8 ac c4 46 48
8a 92 61 22 01 70 36 c2 53 3d 57 7d a9 1e 56 c5
54 0b 98 9e 58 45 e7 7a 23 e5 b0 a2 c4 99 1d e7
2f 1b 9b 78 f8 92 5e 1c 75 4d 83 aa 01 cd 16 2f
f7 82 be 10 9c 81 36 39 f8 95 3d cd b6 44 68 a6
39 e3 6e 1f 01 64 bd 31 1f 9e b3 2b de 16 97 f6
6a 75 e9 2f 1e 32 88 ce 80 82 9a cf 17 e5 a1 c6
e8 a2 b4 59 0d ee 33 91 58 a1 d8 b1 97 29 4a 19
48 dc 9b 7d 8e ef bc 6a 2c dc 58 71 9a 0c 5f 18
d6 51 0c 8c f4 98 68 7b 69 15 38 a2 1c 67 0d b2
b7 95 23 40 05 89 24 65 54 64 5d bb dc 1a b2 41
b0 08 ae d1 95 0b 05 18 61 52 c5 ce 57 7f b9 37
ff 52 19 71 43 27 df 1f d7 d1 fb b5 f8 3e 59 cc
39 25 89 b8 81 ce 18 b1 99 09 f5 4f 2d 49 c7 d9
9a 2a 3d 40 77 51 95 27 de bb a1 c6 24 51 8e 38
e7 da 9f 41 f0 41 e3 b4 89 99 a2 fb 01 66 25 58
45 f6 e7 8c b9 28 ee 77 e8 73 d6 bf 98 92 21 fc
0a 9e bf 63 81 3d 8c 41 e8 9a 43 a4 48 65 b5 8c
ba 6d a7 ef 2b 8a 1a c5 36 30 e4 31 6e 71 30 b1
a3 6a 41 e8 64 79 bd 4a 57 1d 48 90 fd c6 ee 2d ... and so on until section end, without patterns.

As you can see, even with such simple mechanism as the rand() procedure, there's such a mess that the work for the reverse engineer will take too much time to directly try to decrypt your code (without some specialized application). If he figures out you are using OTP and not some RSA encryption, then he will change his objective and search for the key instead. If he found it, then he will be able to decrypt your code.

HINT: Having a specialized key as One Time Pad can be a pain to manage; it all depends on what you have to protect and how much time you can dedicate. Maybe selecting a longer key can be enough for us. Imagine using part of your own application as key: .text section will be changed by the loader during relocation (for reaching data or imported function necessary for your utility), but the .data section usually is static at startup. You can use it to apply the XOR encryption to your obfuscated section (whole section as a key).

Now, all the Naive methods regard the obfuscation of the code in a simple and direct way: just encrypt it. The result of such methods can produce, as you have seen, both easy to difficult code to hack. All the naives have in common that the whole section is encrypted and will be recognized as it by any reverse engineer. Even the One Time Pad methods is clearly unreadable, without any logical recurring structure (this indicates encryption, usually). At the end, we expose the do not try to hide the section and we just rely on the encryption applied.

Next Step: Smart Obfuscation

Well, not really so smart. This is just the first example that comes in my mind when thinking of binary obfuscation. Let's try to resume what we encountered during our previous tests: using XOR encryption (all change if you decide another encryption algorithm) will lead to binary obfuscation that avoid automatic scanner to search for a binary string that can lead to the detection of your technology. On the other side, this base encryption is easily detectable and then decrypted by a reverse engineer that takes some time while reading your application. From here, you can choose two ways to make the reverse engineer work harder: improve your encryption or hide better your technology.

Improve the encryption is a good choice, but also usually takes lot of development time. In this final example, we will see an idea on hide better your technology. We go back to the first example: we want to use a single-byte key for XOR encryption, but we do not want our key to be easily exposed. We also do not want that the reverse engineer easily recognize our encrypted procedures.

The idea is just to obfuscate what is needed to obfuscate, and do not take in consideration what is useless to protect. Every procedure has a prologue and an epilogue, depending on the convention call used for your procedure. We will use such tokens as activator/deactivator of the encrypting routine. The target is to have legal routines recognized by any tool with obfuscated code inside. When someone tries to look them with reverse engineer tools (like IDA) then he will not be able to assign a logic to such routines, but it would be hard to think of them as obfuscated because they seem valid.

The naiveC project .dummy section:

55 8b ec 81 ec c0 00 00 00 53 56 57 8d bd 40 ff
ff ff b9 30 00 00 00 b8 cc cc cc cc f3 ab 8b 45
08 83 c0 01 5f 5e 5b 8b e5 5d c3 cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ... and so on until section end.

NaiveC project .dummy section after obfuscation:

55 8b ec d4 b9 95 55 55 55 06 03 02 d8 e8 15 aa <-- the inner part of the procedure is encrypted.
aa aa ec 65 55 55 55 ed 99 99 99 99 a6 fe de 10
5d d6 95 54 0a 0b 0e 8b e5 5d c3 cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ... and so on until section end.

As you can see, there is no more the 0x99 pattern issue, so the section seems legal. In addition, avoiding such pattern avoid to expose our key. The prologue and epilogue identity this as a true procedure valid for the standard convention call. This is a valid procedure in a valid code section of your code. Identifying it as encrypted takes some time, and decrypt it is a little harder (due to the fact you do not know what resides inside the procedure). Now imagine extending this heuristic as we did with naive projects. Applying a multibyte key will make it more challenging enough to discourage any junior reverse engineer. You can have a 512 bytes key that repeats after having encrypted 512 byte of binary code, which can be the size of 2 or 3 procedures (usually one can think that at every procedure start, the key is taken from the start, but that's not the case).

Testing the Code

Well, if you want to use the shipped code, you are free to do it. I usually test the code during the lecture of the article to make clear what the author is showing. In my code, you will find an obfuscator (binobf project) which can be modified to encrypt the desired section of the desired target utility. You can also dump the output of binobf in a log file to examine how the sections changes during the computation steps (usually with the command binobf.exe > log.txt). In the main source file of the obfuscator, you will find various procedures that you can call to change the examples proposed in this article. Comments in the project should be clear enough to perform personal tests on the dummy utilities.

If you do not understand something of the code... well, maybe you should look on the background requirements. The program is easy enough to allow everyone to choose the example to run. If you have questions, just comment on the article and wait for an answer.

To test the project, you can go both with command line or use only Visual Studio. It depends on how deep you want to examine it.

History

  • 12/24/2014 - Initial commit of the article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)