Technical Detail
This article's aim is to provide material for modern day decompiling of
applications written in C++. We assume you have a solid understand of C++, X86
Assembly, and windows.
Overview and contents
- Why is C++ Decompiling possible?
- Intro
- Modern Day Example
- Compiler Specific
- C++ Protocols
- Intro
- Global Variables
- Expressions
- Return Values
- Function calls and the stack
- Local Variables
- C++ Keywords
- Intro
- If statement
- For Loop
- Structures
- Technical Algorithms
- Practical Decompiling
- Intro to Decompiling Windows application
- Decompiling a sample application
Special Case: Compiler Specific
Compiler Specific:
Each compiler is different, such as their CrtlStartUp
routines, their statement assemblies (switch
, if
, while
), and numerous other things make
each compiler generate different code, even if you compile the same C++ code on
two compilers, the end result will be different, because of this I will stick
with one and only one compiler, which is the Visual C++ Compiler.
Visual C++ is produce by Microsoft and currently delivers the fastest and
most optimized code available. Not to say all the information provided in this
book only applies to Visual C++, I just saying some of the information presented
in this book may only work on Visual C++.
If you don�t have Visual C++ that is fine, there are many other compilers
available, and most of this information is also accurate for them
Chapter 1: Why is C++ Decompiling possible?
1.1 Intro:
I been ask many times is C++ decompiling even possible not only due to the
complexity of a compiler but for the mass about of information loss in
compiling, such as comments , include files, macros just to name a few. So one
often wonders is this even worth pursing. Well I wanted to start out with the
topic of what is totally loss when you compile a program and what stays there,
refer to table 1.1.1 to see what we loses and remains.
What is lost |
What remains |
templates |
Function calls |
classes |
Dynamic linking calls |
Marcos |
Switch statements |
Include files |
Local Variables |
comments |
Parameters |
Not to say everything that is in the �What remains� sections is 100% there,
it just means it is very simple and practical to reverse engineer. Because of
this fact I choose to deal with the �What remains� section first because it�s
much easier.
As we progress though this book keep in mind reverse engineering is almost
never practical and takes lots of practice. It�s harder to reverse engineer
something created than to create it in the first place.
A good way to start out with reverse engineering is to decompile your own
programs and see how each C++ function specifically works, then apply that
knowledge in other areas because looking at thousands of lines of assembly code
is not really fun.
1.2 Modern Day Examples:
Now when your reading this book you might start to think that , �anything
translated info a different language can be retranslated back into the same
language� right, well this is not the case in reverse engineering a lot of
things will be lost, and a lot of things you must make up(assume) along the
way.
So I wanted to make sure a provided some practical examples for reverse
engineering at the beginning of the book, to give you a sense of hope.
To begin reverse engineering, I decided to start with the main C++
statement
Int main(int argc, char * argv[])
Now we can easily find this statement in any executable file due to the PE
format which tells us the start of the executable, because of this we can simply
read the PE format in a specific executable and get its start address. Or can
we?
This is where the Common Runtime Library comes in at (CRTL), you see when you
compile a C++ program most compilers (because this is compiler specific stuff)
will execute in the following order
CrtlStartUp()
Int main(int argc, char * argv[])
CrtlCleanUp
()
this means we can�t look into the PE file and get the start of our code, we
can only get the start of the CrtlStartUp()�s code. We have to choices, reverse
engineer the CrtStartup Code or skip over it, I like the latter, and we will
deal with the Common Runtime Library later.
Chapter 2: C++ Protocols:
2.1 Intro
One of the main reason C++ is so well design is because it has a strict
protocols use in its assemblies. C++ has some very static assemblies such as
when you return values, it is always put in the EAX register, and function
calling usually always use the stack because of this reverse engineers can
attack this static assemblies and get a head start
The first thing we should deal with is Global Variables because if you�re
coming from a lot of high level languages you might have some miss
conceptions.
2.2 Global Variables
You know how many books say Memory is stored random on the computer, well
this is true for the most part, but your application memory allocation for
global variables is quite static. That�s right each time you run your program,
your static allocated variables will always end up in the same place.
Another interesting fact is variables don�t hold data, they pointer to where
the data is stored.
Here is a C++ Example:
#include "stdafx.h"
#include "windows.h"
char * globalvar = "Whats Up";
int APIENTRY WinMain(HINSTANCE hInstance,
HINSTANCE hPrevInstance,
LPSTR lpCmdLine,
int nCmdShow)
{
globalvar = (char *)0x400000;
return 0;
}
Here is a in depth look at the disassemblies
00405030: global_var dd 405034h
00405034: global_var_value db 'Whats Up',0
mov global_var,400000h
OK, this proves that variables do not hold data, as you can see, the compiler
automatically initialize our global_var pointer to the address of global_var_value
.
OK, so far we know that variables are just pointers to values, so we can
change were the variable is pointer right? Yes we can, with mov
global_var, 400000h
so whenever the compiler accesses global_var
, it will look into the value stored at 405030h
and come up with 400000h
If you�re confused remember global_var
is stored at
405030h
, and refer to the picture 2.
This picture is pretty self explanatory and if you�re still confused how
everything works then I suggest you get a good assembly book and learn what
indirect addressing is.
We have just dealt with a pointer variable lets deal with just a variable,
because this is much more simple.
#include "stdafx.h"
#include "windows.h"
char globalvar[] = "Whats Up";
int APIENTRY WinMain(HINSTANCE hInstance,
INSTANCE hPrevInstance,
LPSTR lpCmdLine,
int nCmdShow)
{
globalvar[0] = 'A';
globalvar[5] = �U�;
return 0;
}
Which when compiled becomes
00405030 global_var db �Whats up�,0
mov global_var, �A�
mov global_var + 5 , �U�
When instantly see that regular variables or a lot simpler than global
variables, all we have to do is refer to a address in memory which holds or data
, of course in machine code we can�t see pretty names like global_var
, so here is a pure disassembly
00405030 �Whats Up�,0
mov 00405030,�A�
mov 00405035,�U�
As you can see, we aren�t doing anything special just modifying the values
store at 00405030
and 00405035
.
You should have variables and pointer variables down pack, since this
information will not be explain again, if there is something you don�t
understand, read it over.
2.3 Expressions
OK, as we all know C++ has near English like syntax and which we can program
in. Well X86 assembly code doesn�t, for example take a look at the following
statement
Int s = 3 + 4 + 1 + 5 + 9;
How can we calculate this in assembly? simple, look at the following C++
example
#include "stdafx.h"
#include "windows.h"
int s1 = 3;
int s2 = 4;
int s3 = 1;
int s4 = 13;
int APIENTRY WinMain(HINSTANCE hInstance,
HINSTANCE hPrevInstance,
LPSTR lpCmdLine,
int nCmdShow)
{
s1 = s2 + s3 �s4 + 34;
return 0;
}
Which when compiled becomes
00405030 s1 dd 3
00405034 s2 dd 4
00405038 s3 dd 1
0040503C s4 dd 1
mov eax, s2
00401008 add eax, s3
0040100E sub eax, s4
00401014 add eax, 34
00401017 mov s1, eax
OK the compiler optimizes the code a little bit, but it�s still very easy to
understand.
- The first thing the compiler does is load up
eax
, with
the value of s2 with mov eax, s2
- Now
eax
holds 4,
the next thing
we do is add eax
to s3
,
- Now
eax
holds 5
, after that we
subtract eax
from s4
,
- Now
eax
holds 4
, after that we
add eax
to 34
,
- Now
eax
holds 38
- Then we finish it up by moving s1 to
eax
which holds
38
, now s1
holds 38
.
You will often see the compiler use registers instead of variables in
expression because registers are faster.
From this we can conclude that for each mathematical operator the compiler
maps it with a specific X 86 Instructions, here is a table
C++ Operator |
X86 Instruction |
* (Multiply) |
Mul , (use fmul for floating point) |
/ (Division) |
Div (use fdiv for floating point) |
- (Subtraction) |
Sub |
+(Addition) |
Add |
As you can see, we can easily decipher most statements in C++ using the table
above.
For a test we will look at a sample disassembly dump and decompile it by hand
to C++.
0000000 2
0000001 3
0000002 4
0000003 0
0000004 1
0000005 mov al, [00000000]
add al, [00000001]
mov ch, [00000002]
mul ch
mov [000000003],ax
OK the first thing we do is try to figure out what type of variables they are
using
And from what we can see they our using al and ch, which are 8 bit registers,
so that means whenever they reference anything with 8 bit registers, it means
the variable is a Char type.
On down you see that they do a �mov [000000003], ax
�,
and since ax is a 16 bit register the variable type is short int.
Here is a small table, so you can map registers to variable types
X86 Registers |
C++ Type Variables |
8 bit registers ( AL ,
AH) |
Char |
16 bit registers (AX) |
Short int |
32 bit registers (EAX) |
Int |
So far we see 4 references to memory addresses, because of this we know we
have 4 variables, the first one [000000000] is obviously an char type variable
since we see,
mov al, [00000000]�
and since al is an 8 bit register.
So lets give [0000000] the name of s1, we also see that [0000000]
though[00000002]
is all
reference by 8 bit variables meaning they are also char type, and the last one
[00000003]
which�s use like �mov
[000000003] , ax
� is a short int type since ax is 16 bits
OK let�s create another table one which will hold variable names or alias for
the addresses
Although we can never get the original variable name we can also create our
own.
Addresses |
Variable names/alias�s |
Variable size |
0000:0000 |
S1 |
Char |
0000:0001 |
S2 |
Char |
0000:0002 |
S3 |
Char |
0000:0003 |
S4 |
Short int |
You might be confused why 00000004
holds 1 and 00000003
doesn�t, well this is because Intel is a little edian
machine, that stores values in reverse word order.
Now the next thing we should do is rewrite the code above with our alias�s we
created
s1 db 2
s2 db 3
s3 db 4
s4 dw 1
mov al, s1
add al, s2
mov ch, s3
mul ch
mov
s4,ax
Now the first thing we do is mov al, s1
OK al
now holds a value of 2, the next thing we do is
�add al, s2
�
Now al
has a value of 5, since s2
had a value of 3 in it the next thing we want to do is mov
ch, s3
?
Now ch
has a value of 4, after
that we mul ch, now ax has the value of al * ch
,
And since al
had a value of 5 in it and ch had a value
of 4 in at
, ax
has the value of
20.
OK we can start to decipher the C++ statement which is
s1 + s2 * s3
After that we see that we see, �mov s4, ax
� so the
complete C++ statement is
S4 = s1 + s2 * s3;
As you can see we just went though a whole bunch of mess to come up with a
simple C++ statement, and this only works for global variables. Not local
variables or structure members. So things will only get harder, due to this I
suggest you read carefully and if you don�t understand something read it over
and over until you do.
2.4. Return Values
One of the major fundamentals of C++ is returning values from function call.
This is actually a very simple procedure, because it simple involves placing a
value into the eax register.
So when you have a statement like this
c = (char *) malloc (0xFF);
The first thing the compiler does is call malloc
and then it assigns c to eax like �mov c,
eax
�
For example if you have a statement that returns 5; what you our really
saying is
__asm
{
Mov eax, 5
Ret
}
Let�s have a little practice with a full disassembly dump
Mov eax,5
Add eax,2
Sub eax,1
Ret
And the C++ equivalent is
return 5 + 2 � 1;
This although simple is one of the most important concepts a C++ reverse
engineer can learn.
2.5 Function Calls and the Stack
Now its time to get to the blood and guts of C++ with function calls.
Function calls are fairly simple for the most part because they our just
labels for assembly programmers example.
Int func () {return 1 ;}
Func();
Would compile into
Func:
Mov eax, 1
Ret
Call Func
From this we can conclude two things, the first is:
Function�s name or like variables, they are just references to some address
which is the same as a label
Here is a full disassembly dump for practice
0000:0000 0
0000:0001 0
0000:0002 0
0000:0003 0
0000:0005 mov eax,1
0000:0009 ret
0000:0010 call 0000:0005 �code starts here
0000:0015 mov [0000:0000],eax
OK the first thing we see is that at address 0000:0015
, we our assign a 32 bit memory address to the value of a 32 bit register
which�s mean that we have a 32 bit variable at hand or a int type variable to be
more exact.
So let�s create an alias for the address�s 0000:0000
�
0000:0003
, which will be s1
.
Now let�s create a new disassembly with this added information
S1 dw0
0000:0005 mov eax,1
0000:0009 ret
0000:0010 call 0000:0005 �code starts here
0000:0015 mov S1,eax
OK the second thing we see is that code start at 0000:0010
and the first instruction is call 0000:0005
.
Now we�re at 0000:0015 we can see that the code is moving a value into eax
then returning. Now we our at address 0000:0015
and we
just moved s1 into eax
So we can now reverse engineer this whole program back into C++
Int s1 = 0;
Int some_function()
{
return 1;
}
s1 = some_function();
Now what do we do when functions have parameters, well things get pretty
complicated because the compiler uses the stack to handle parameters.
It pushes in parameters right to left, meaning the last parameter goes in
first, and the first parameter goes in lest.
For example, C++ Function:
Func (1, 2);
Would compile into
Push 2
Push 1
Call func
Now let�s have an imaginary stack frame, which has a size of 32
Now the first thing we realize is that ESP = 32
, with
that in mind look at the table below
X86 Instruction |
Memory address stored at |
Stack Frame Pointer value |
Push 2 |
[32] |
ESP = 28 |
Push 1 |
[28] |
ESP = 24 |
Call func |
[24] |
ESP = 20 |
Push ebp |
[20] |
ESP = 16 |
Remember when you issue a call instruction on the X86 machines, the Processor
stores the current address on the stack so it can know the location it should
return to.
Now that the parameters are on the stack lets look at the function itself
Int func (int a, int b)
{
return a + b;
}
- The first thing the compiler does is �
Mov eax, [ESP +
8]
�, since ESP
equals 20, and the first parameter
is stored at [28].
- The second thing the compiler does is �
add eax, [esp +
12]
� and since ESP
equals 20 and the second
parameter is stored at 32.
- The last thing the compiler does is
ret
So the full compilation would be
Func:
Mov eax, [ESP + 8]
Add eax, [ESP + 12]
Ret
A neat little reverse engineering tip is to remember that sense the stack has
a fix width of 4 bytes, you can easily tell what parameter they our
accessing.
[EBP] = Stack
[EBP +4] = Return address
[EBP + 8] = First
[EBP + 12] = Second
[EBP + 16] = Third
[EBP + 20] = Fourth
And so on�.
2.6 Local Variables
We just learn that parameters are stored on the stack, now it time to learn
about local variables which are also stored on the stack, but local variables
are stored quite different.
Here is an example
Int func ()
{
int a = 5;
return a;
}
OK to compile this code, the compiler must first reserve space on the stack
by going
Sub ESP, 4
. Since 4 bytes is the size of an int
variable. Of course the compiler must first back up the esp
register , and it does this by �mov ebp,esp
� , but wait,
the compiler must first back up ebp
, and it does this by
�push ebp
� so the very first thing the compiler does
is
: Setting up the stack frame
Push ebp
Mov ebp, ESP
Sub ESP, 4
Note: C++ always compiles code like �Setting up the stack frame� in any
function, even if you use or don�t use local variables, and the compiler always
uses ebp to reference parameters and local variables.
In the �Function Calls and the Stack� section I use esp to reference
parameters and skip Setting up the stack frame code this out for clarity
sake.
Now the second thing the compiler does is
Mov [ebp � 4], 5
Mov eax, [ebp -4]
If we had a second local variable we could simple go
Mov [ebp � 8], 5,
or course the compiler would use
sub ESP, 8
Instead of sub ESP,
4
.
The last thing the compiler does is restore the stack frame and return
Mov ESP, ebp
Pop ebp
Ret
Note: The compiler always execute the �Cleaning up the stack frame� code, in
every function, due to this we can detect a function by looking for similar
code. I also skip this in �functions call and the stack� section for clarity
sake.
Here is a full disassembly dump, for practice
0000:0000 0
0000:0004 push ebp
0000:0003 mov ebp,esp
0000:0005 sub esp, 8
0000:0010 mov [ebp -4], 5
0000:0015 add [ebp � 4] , [ebp + 8]
0000:0016 mov eax,[ebp � 4]
0000:0018 mov esp, ebp
0000:0020 pop ebp
0000:0021 ret
0000:0022 push ebp
0000:0023 mov ebp,esp
0000:0025 add [ebp + 8] , [0000:0000]
0000:0030 add [ebp + 8] , [ebp + 12]
0000:0031 mov eax,[ebp +8]
0000:0032 mov esp, ebp
0000:0035 pop ebp
0000:0036 ret
0000:0037 push ebp
0000:0038 mov esp, ebp
0000:0040 push 1
0000:0044 call 0000:0002
0000:0049 mov [0000:0000],eax
0000:0050 push 4
0000:0051 push 3
0000:0052 call 0000:0022
0000:0056 add [0000:0000],eax
0000:0058 mov esp, ebp
0000:0059 pop ebp
OK the first thing we is that memory address [0000:000]
is being reference by eax a lot, meaning we have a 32 bit variable which is an
int type. The next thing we notice is we set up the stack frame 3 times and
clean it up 3 times, which means we have 3 functions(and yes int main(�)
also sets up the stack frame and cleans it up).
So we have
Func1 ()
Func2 ()
Main ()
Next we see Func1 address
is at 0000:0004
and accept one 32 bit parameter
Because we see at address 0000:0040 we push 1 into the stack and then at
address 0000:0044
we are calling 0000:0004 so we can setup
func1 declaration
00000:00002 Func1 (int a)
Now whenever func1, does anything to [ebp + 8] we know that it is doing
something to its first parameter. So look into func1 code, and we see that it
has 1 local variable because it references [ebp � 4
].
Now lets take a lot at address 0000:0049, which is mov
[0000:000], eax
so we know that the original C++ code is something
like
[0000:0000] = func1 (1);
Next when see at address 0000:0051
that we are pushing
4 onto the stack then after that we are pushing 3 onto the stack then we all
0000:0022
.
Now we can setup Func2
declarations
0000:0022 Func2(int a, int b)
At address 0000:0056
we see add
[0000:0000],eax
, means the original C++ code is something like
[0000:0000] += Func2(3,4)
Remember we pushed 4 onto the stack first, and 3 onto the stack second,
because parameters or passed right to left.
Now that we have a lot of information lets make a new disassembly one with
alias for all local variables and parameters in Func1 and Func2. Since we know
that whenever they use code like [ebp +�]
it�s a
parameter, and when they use code like [ebp -...]
it�s a
local variable.
0000:0000 s1 dw 0
0000:0004 func1(int param_1): push ebp
{ local : local_var_1}
0000:0003 mov ebp,esp
0000:0005 sub esp, 8
0000:0010 mov local_var_1, 5
0000:0015 add local_var_1 , param_1
0000:0016 mov eax,local_var_1
0000:0018 mov esp, ebp
0000:0020 pop ebp
0000:0021 ret
0000:0022 func2 (int param_1 , int param_2) :
push ebp
0000:0023 mov ebp,esp
0000:0025 add param_1, s1
0000:0030 add param_1, param_2
0000:0031 mov eax,param_1
0000:0032 mov esp, ebp
0000:0035 pop ebp
0000:0036 ret
0000:0037 push ebp
0000:0038 mov esp, ebp
0000:0040 push 1
0000:0044 call func1
0000:0049 mov s1,eax
0000:0050 push 4
0000:0051 push 3
0000:0052 call func2
0000:0056 add [0000:0000],eax
0000:0058 mov esp, ebp
0000:0059 pop ebp
OK I know, I made up a little assembly syntax such as func1(int param_1)
and
{Local : local_var_1 }
This is for clarity sake that�s all.
Now let�s start with func1 at address 0000:0010
we see
that it is moving local_var_1
to 5, which in C++ it's
saying
int local_var_1 = 5;
next we see add local_var_1, param_1 which in C++ its saying
local_var_1 +=param_1
The last thing we see before we clean up the stack is mov
eax,local_var_1
which in C++ its saying
return local_var_1;
So the full reversed engineered function is
Int func1(int param1)
{
int local_var_1 = 5;
local_var_1 += param1;
return local_var_1;
Now lets go to func2 at address 0000:0025
we see add
param_1, s1, which in C++ its saying
param_1 +=s1;
after that we see add param_1, param_2, which in C++ its saying
param_1 += param_2;
the last thing we see before we clean up the stack is mov eax,
param_1
, which in C++ its saying
return param_1;
So the full reversed engineered function is
Int func2(int param_1 int param_2)
{
param_1 += s1;
param_1 += param_2;
return param_1;
}
Now we our able to reverse engineer the whole program
Int s1 = 0;
Int func1(int param1)
{
int local_var_1 = 5;
local_var_1 += param1;
return local_var_1;
}
Int func2(int param_1 int param_2)
{
param_1 += s1;
param_1 += param_2;
return param_1;
}
int main()
{
s1 = func1(1);
s1 += func2(3,4);
}
This Chapter might be a little hard to comprehend at first since I presented
a lot of �straight to the point� information, again if you don�t understand
anything read it over, and if you still don�t understand email vbmew@hotmail.com with your question
Chapter 3: C++ Keywords
3.1 Intro:
What we been doing so far is the easy stuff, its time to deal with C++
keywords complex expression, and some practical real world examples.
3.2 If Statement
One of the main statements people use is this if statement which logically
compares values. Using this function we can choose which path of execution our
program should take.
If statement can also be very , very complex and very simple
Take a look at the following examples.
If(I ==0)
Now what if we had something like this
If(I==0)
{
int i2 = 0;
}
i2 = 3;
Because of this we know that compiler generates a stack frame for each If
statement with brackets right? Wrong!.
I2 is accessible to main in reality but the compiler keeps it hidden, the
reason I �m telling you this is because to reverse engineer if statements you
must completely understand them.
The second example is
If( (I ==0) || ( ( I2 == 1) && (i3 ==2) ) )
The logic for this is if I = 0
or if i2
= 1
andi3 = 2
Another Example would be
If( (c = (char *) malloc(0xFF) ) == NULL)
This is saying c = malloc(0xFF)
and if malloc
return NULL
this condition is true.
Yet another example is
If(malloc(0xFF))
The last but not least example is
If(!malloc(0xFF))
Thankfully all these if statement can be reverse engineer in turn back into
just the way they are(almost).
Now the if statement maps directly to the X86 instruction cmp with this in
mind take a lot at the following C++ program
int main()
{
int I = 0;
if(I == 34)
i+= 23;
return 1;
}
This compiles into the following
push ebp
mov ebp,esp
sub esp, 4
mov [ebp � 4],0
cmp [ebp � 4], 34
jnz continue_program
add [ebp � 4],23
continue_program:
mov eax,1
mov esp, ebp
pop ebp
ret
Yes I know I decided to give you a complete binary disassembly to see if you
remember about the stack frame and the [ebp -4] which means the first local
variable created and yes int main has to setup the stack frame like every other
function.
Now let�s learn how to turn this program back into C++
The first thing we do is look at the compare mov [ebp � 4],0 which is telling
us that the program is initilize a variable to 0.
Next we see a cmp instruction that is comparing [ebp -4],34 , because of this
we know the program is using a if statement, you know �if [ebp -4] = 34� what we
should do now is create some alias for [ebp -4] we will use local_var_1. next we
see the instruction jnz, which is the same as jne which is saying if[ebp -4] or
local_var_1 is not 34 then skip over this if statement and jump to
continue_program.
Add [ebp -4], 34
or add local_var_1, 34
is saying local_var_1 += 34
After that we �mov eax,1
�, clean up the stack frame and then return.
Now lets look for a multiple logical if statements
If( (i==0) || (i2 == 23) && (i3 ==21) )
If_block_check1:
Cmp I,0
Jne if_block_check2:
Jmp do_if
If_block_check2:
Cmp i2,23
Jne skip_if
Cmp i3,21
Jne skip_if
Do_if:
skip_if:
OK the first thing we see is that on multi logical if statements when one
condition fails it jumps to the next logical expression to see if that will
evaluate to true, as shown in figure 3.2.1
So if we have a multi logical if statement, and part of the expression
succeeds we continue to evaluate the expression until something is false.
Of course this is only true for a &&
operator.
For a ||
operator if one part of the expression is true we
quit that entire expression and the if statement evaluates as true.
3.3 For Loop
The for Loop is not only one of the most interesting
things about C++ it is one of the most use statements.
The interesting factor for the for loop comes in its ability to evaluate 3
expressions
For( <expression 1>; <expression 2>; <expression 3>)
The Expression our usually
For( <assignment>; <conditional>; <increment| decrement>)
Reverse engineering the for statement is not hard, because it�s really a if
statement in most cases
If(I < 4)
{
i++;
}
Now for the for loop equivalent
for(int I =0;i<4;i++)
{
}
OK lets look at a simple reverse disassembly for the for loop
Mov [ebp � 4],0
Jmp condition
Increment:
Add [ebp -4],1
Condition:
Cmp [ebp -4],4
Jge done
Loop:
Jmp increment
Done:
As you can see the for loop is nothing more than a high level if statement,
the first thing we do is initilize the local variable on the stack , after that
check the condition statement. Then we go to the loop, then at last we jump back
to increment then we jump yet again to the condition label and again until the
condition is true.
3.4 Structures
Structures are very useful in C++ because of there ability to contain
members. A structure lets you define a variable of any size , example
Struct test1
{
int member1;
int member2;
};
This creates a 64 bit , 8 byte variable in memory. So in a sense structures
or regular variables but allow us to access certain parts of that variable
independently from others
This makes it very useful
Because if you were to use char test1[8]; you would be create the exact same
in memory as Struct test1, only it would be much harder to access 4 byte members
individually in char test[8];
Here is a example of using test1 as a local variable
Sub ESP, 8
Mov [ESP -4], 45
Mov [ebp -8], 12
As you can see structures are stored reverse in memory, because you would
think
That member one would be the last on the stack, but it turns out it is the
first on the stack
For a global variable the compiler would simply reverse 8 bytes in the
executable in reference those each individually base on the member you have
chosen.
3.6 Technical Algorithms
I am providing some algorithms to prove and help you understand some of the
theory I presented in this book.
This following example proves that variables inside a if block our truly
accessible to the whole function.
#include "stdafx.h"
#include "iostream.h"
int main(int argc, char* argv[])
{
__asm mov dword ptr [ebp -4], 23
if(true)
{
int i;
cout << i << endl;
}
return 1;
}
The output should be 23 even though we never initialize I , if your confused
remanber that since I is the first variable and the only variable its location
is [ebp -4].
This next example proves that structures are just regular variables with the
given ability to be access in parts instead of wholes.
#include "stdafx.h"
#include "iostream.h"
struct test1
{
int member1;
int member2;
int member3;
};
int main(int argc, char* argv[])
{
test1 local_struct;
local_struct.member1 = 1;
local_struct.member2 = 1;
local_struct.member3 = 1;
__asm
{
add dword ptr [ ebp - 12],55 ; structure 1
add dword ptr [ ebp - 8] , 100 ; structure 2
add dword ptr [ ebp - 4] , 23 ; structure 3
}
cout << "member 1: " << local_struct.member1 << endl;
cout << "member 2: " << local_struct.member2 << endl;
cout << "member 3: " << local_struct.member3 << endl;
return 1;
}
Output should be
member 1: 56
member 2: 101
member 3: 24
Chapter 4: Practical Decompiling
This Chapter aims to provide knowledge of practical decompiling, in this
chapter we will learn to use a disassembler, and learn to decompile real world
applications.
4.1 Intro to Windows decompiling
Windows decompiling is not that difficult since all windows programmers
follow a strict programming method such as CreateWindowEx
,
or CreateDialog
, and All windows have message loops which
you can easily find. Before we really start getting into decompiling lets go
over the basic. In the vast world of windows there are many types of
application, and many more types of technology.
Therefore all of it is too much to cover in one tutorial. On top of that,
this information only applies to application that uses the basic window
functions, such as CreateWindowEx
, and CreateDialog
. Applications made in visual basic, or
Delphi use there own engine, and there engines will not
be cover. Also there is MFC, which is simply a class wrapper to API calls, but
can greatly complex things. We will be working on an application I made in pure
win32 API, All it does is show a window, but we all know showing a window
requires a significant amount of work.
1. Create the window class
From this we can get the Window Procedure Method, in which all message are
handle.
lpfnWndProc
of the WNDCLASSEX
structure contains the address to the Window procedure method.
2. Create the Window itself.
We can retrieve every single const by name, and most of the time the exact
C/C++ equivalent.
3. The message Loop
All we have to do is look for a reference to
GetMessage(�)
.
We start with the basic skeletons first, then move on to more complex stuff,
its import to learn the basic first because
They give you an ideal of how the application is design. We will be using the
PVdasm, which you can get from my site -
This is a very nice free disassembler which we will be using.
4.2 Decompiling a sample application
First load up PvDasm, and your screen should look similar to Figure 4.2.1
(Figure 4.2.1)
Grab CreateWindow2 (the program we are going to decompile by hand) and Open
it in the disassembler, your screen should look similar to figure 4.2.2
(Figure 4.2.2)
We see are entry point, but this is CRTL code (Common Runtime library), how
can we find WinMain
Function? By references. We know that
in WinMain functions we have a CreateWindowEx
, or a RegisterClassEx
, if we can find where the program is calling
these functions, we can than begin to map out the program. You see when you
compile a program a linker links it with libraries or DLL (Dynamic linking
libraries). The functions you get from these
DLL�s are called imports. The PVdasm can list all the imports a program has,
and show you the address from where they are called. To use this feature press
Crtl+N or press the import button. Your screen should look similar to figure
4.2.3
- Step 1. Click the input button or press Crtl+N
- Step 2. You should see a window with a list of imports; scroll down until
you see
CreateWindowEx
.
Now we must find the start of the function, this is pretty easy, if we follow
the following rules.
1. Consist of a
push ebp
mov esp,ebp
sub esp, <X>
2. Right after a
mov esp,ebp
pop ebp
ret <X>
Well if we scroll up to address 0040104C and you should see
0040104C push ebp
0040104D mov ebp, esp
0040104F sub esp, 50h
After that we see
mov dword ptr ss:[ebp- 30],0000030
mov dword ptr ss:[ebp-2c],0000000003
Ok, so we know we have local variables, and it mostly looks like a structure,
to find the WNDCLASSEX
structure we need a reference
point. A good reference to look for is LoadCursor
. About
every single application uses the call, so simply press the import button or
Crtl+N, and select LoadCursor
.
Once you have selected LoadCursor
you should then see
something similar to
00401092 call ds:LoadCursorA
00401098 mov [ebp-14], eax
Ok, now we all know the return value for functions are stored in the eax
register, and we know that the hCursor
member of WNDCLASSEX
is being used (because we are loading a cursor). Now
what position is hCursor
in memory, well its ebp-14h(yes
that�s 14 HEX no decimal), with this information we can figure out where all the
other member are to. If we take a quick look at the WNDCLASSEX
structure
typedef struct WNDCLASSEX {
UINT cbSize;
UINT style;
WNDPROC lpfnWndProc;
int cbClsExtra;
int cbWndExtra;
HINSTANCE hInstance;
HICON hIcon;
HCURSOR hCursor;
HBRUSH hbrBackground;
LPCSTR lpszMenuName;
LPCSTR lpszClassName;
HICON hIconSm;
};
As you can see its easy to calculate structure member addresses, simply add
the size of the variable for each member above you and subtract the size of the
variable for each member below you. Now that we know the memory location of
every structure we can begin to really understand how the program is created.
The first thing we do is get the value of all the members in the structure,
starting with the cbSize member.
1. cbSize
The first thing we see is mov dword ptr ss:[ebp-
30],0000030
and we all know that ebp � 30h is the location of cbSize
. So what we are really saying is mov dword
ptr ss:[cbSize],30h
. Of course we can go a step further since we know
that 30h is the size of WNDCLASSEX
, and cbSize
is suppose to hold the size of WNDCLASSEX
, so we can fully decompile this line to
wc.cbSize = sizeof(WNDCLASSEX)
2. style
mov dword ptr ss:[ebp-2c],0000000003
Ok, what style is the program using, well, to figure this out we need to look
into windows.h and get all style values. Now we could do a bit by bit compare by
hand, but we don�t have time for that, so I made a small program call
WinDasmRef. All we need to do is choose the type of section we want to look up,
in our case its style from WNDCLASSEX
, then enter a value,
and bam it returns exactly what the user entered.
Refer to screen shot 4.2.5 for more information
You can get this program from http://www.crackingislife.com/modules.php?name=Downloads&d_op=getit&lid=1
- Step 1. Select a section
- Step 2. Enter a value
- Step 3. It will do a bit by bit compare for you and find all the values.
This program is no where near finish, but it is more than enough for this
book.
3. lpfnWndProc
mov dword ptr ss:[ebp-28],00401000
This is the most important and interesting structure, because this holds the
address to the message loop from this we can tell that the message loop is
located at address 00401000(in hex of course)
4. cbClsExtra
mov dword ptr ss:[ebp -24],0
We are simply setting wc.cbClsExtra
to 0000000
5. cbWndExtra
mov dword ptr ss:[ebp-20],0000000
we are simply setting wc.cbWndExtra
to 0
6. hInstance
mov eax,dword ptr ss:[ebp+8] //local variable
hInstance
mov dword ptr ss:[ebp-1C],eax //Hinstance
Remember the declaration for the main function is
WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR
lpCmdLine , int nCmdShow)
and the first parameter (hInstance) is stored at ebp +
8
, and the second parameter (hPrevInstance) is stored at ebp + 12
Now that eax holds the value of holds hinstance, we simply transfer that
value to [ebp-1C] or hinstance. So in other words we are saying wc.hInstance =
hInstance
7. hIcon
mov dword ptr ss:[ebp-18],00000000
we are simply setting wc.hIcon
to 0
8. hCursor
push 00007F00
mov ecx,DWORD ptr SS:[ebp+08]
push ecx
call USER32!LoadCursorA
mov dword ptr ss:[ebp-14],eax
Ok, the first thing we do is look at the declaration of LoadCursorA
and find that it is
LoadCursor (HINSTANCE hInstance, LPSTR cursorname)
and the last parameter is push first, so cursorname is the first parameter
being bush which is the value 7F00.
If the user is not using a custom cursor (most don�t) we can retrieve its
value in WinDasmRef and yes, you can enter hex values in WinDasmRef, just make
sure you put a 0x7F00 not 7F00
refer to figure 4.2.6
(Figure 4.2.6)
Note: If your wondering why LoadCursor.cursorname
wasn�t in the first picture, it is because I�m writing this program as I�m
typing this book.
mov ecx,DWORD ptr SS:[ebp+08]
push ecx
Next we move ecx, to SS:[ebp+8] which is hInstance
, and
then we push ecx to the stack,
the stack currently contains
then we see call USER32!LoadCursorA
, we can turn this
back into the complete original line of source which is
LoadCursor(hInstance,IDC_ARROW)
now we all know that LoadCursor
returns the handle to
the cursor in the eax register so
mov dword ptr ss:[ebp-14],eax , ebp-14
is the position
of hCursor. Now lets decompile the entire statement
wc.hCursor = LoadCursor(hInstance,IDC_ARROW)
9. hbrBackground
push 01
CALL GDI32!GetStockObject
mov dword ptr ss:[ebp-10],eax
Ok , first we push 01 into the stack and call GetStockObject
, now if we look at the declaration of GetStockObject
which is GetStockObject(int
brush)
, we know that the 01 is specifying a brush so load up WinDasmRef,
and type 1 in , refer to figure 4.2.7 for more information
So we know the call is like GetStockObject(LTGRAY_BRUSH)
, after that we see mov dword ptr ss:[ebp-10],eax
and eax holds the handle to the
brush return by GetStockObject
, and ebp-10, is the memory
location of hbrBackground
, so the full decompile statement
is
wc.hbrBackground = GetStockObject(LTGRAY_BRUSH)
10. lpszMenuName
mov dword ptr ss:[ebp-0C],0000000
we simply set lpszMenuName
to 0
11. lpszClassName
mov edx,dword ptr ds:[0040603C]
mov dword ptr ss:[ebp-08],edx
at the address of 0040603C, is a pointer to are class name, how can i tell ?
, easy because it is surrounding the address in brackets, so it is getting a
value from 0040603C, we can easily use any hex editor to look at the address
0040603C, as long as we know the image base.
The image base is the location the program is loaded into memory, to see the
image base press CRTL+P in PvDasm A window similar to Figure 4.2.8 should come
up
(Figure 4.2.8)
We subtract the image base with is 400000 in hex from 0040603C, and we are
left with 603C, now if we go to offset 603C in a file we will see 30, we must
read 3 more bytes because Intel uses 32 bit address, so the full address is
30604000
Now 30604000 is in little endian order, which the X86 uses, we must convert
it to big endian by reverse every hex byte, like this 00406030, now if we
subtract the image base from that we get 6030, and we look at address 6030, we
will see a �D�, if we keep reading to a null terminator like everyone else does
we will see DECOMPILE.
Now that we have the name of are class, we can fully decompile the statement
like this
static char * szClass = �DECOMPILE�
wc.lpszClassName = szClass
since we are going mov dword ptr ss:[ebp-08],edx and edx
holds the address of szClass
, and ebp-8 is the memory
location of lpszClassName
12. hIcon
mov dword ptr ss:[ebp-4],0000000
this is simply setting hIcon to 0
Now that we are done with are whole window class, lets have a overview of all
the values
WNDCLASSEX wc;
wc.cbSize = sizeof(WNDCLASSEX);
wc.style = CS_HREDRAW | CS_VREDRAW;
wc.lpfnWndProc = WndProc;
wc.cbClsExtra = 0;
wc.cbWndExtra =0;
wc.hInstance = hInstance;
wc.hIcon =0;
wc.hCursor = LoadCursor(hInstance,IDC_ARROW);
wc.hbrBackground = (HBRUSH) GetStockObject(LTGRAY_BRUSH);
wc.lpszMenuName = NULL;
wc.lpszClassName = szClass;
wc.hIconSm = NULL;
As you can see we practically decompile this back to exact source code.
Now we see the following code
lea eax,dword ptr ss:[ebp-30]
push eax
call USER32!RegisterClassExA
and eax,0000FFFF
test eax,eax
jnz 004010E4
push 0
push 00406054
push 0040605C
push 0
Call USER32!MessageBoxA
xor eax,eax
jmp 00401172
lets first begin with
lea eax,dword ptr ss:[ebp-30]
push eax
call USER32!RegisterClassExA
now ss:[ebp-30] holds the address of the WNDCLASSEX
structure, because [ebp-30] is the first member of the structure which is
cbSize, now that eax holds the address of the structure we push it into the
stack and call USER32!RegisterClassExA
if we look at the
Declaration of RegisterClassEx
,
ATOM WINAPI RegisterClassExA(CONST WNDCLASSEX *)
We see that it returns the type ATOM, which is 16 bits, and because of that
we see and eax,0000FFFF, which is masking off the upper 16 bits, so we don�t
read a 32 bit value, after that we see
test eax,eax
jnz 004010E4
this is simply saying if eax is not zero then jump to 004010E4, the exact c++
code for this is
if(!RegisterClassEx(&wc))
{
//bad code here
}
//else continue (004010E4
Remember the �!� is saying if RegisterClassEx
returns
the value of 0 execute the bad code. Now as we continue on we see that it is
going to display a message box if it fails
push 0
push 00406054
push 0040605C
push 0
Call USER32!MessageBoxA
and if we look at the declaring of MessageBox
MessageBoxA(HWND hWnd , LPCSTR lpText, LPCSTR lpCaption, UINT
uType)
- push 0 is for the hWnd parameter and its specifying we have none
- push 00406054; is the address of the ASCII string �crap�
- push 0040605C;is the address of the ASCII string �Can�t register class�
- push 0; is the message box type, to see what type 0 is
Lets crack open WinDasmRef
Refer to figure 4.2.9 for more information
So we can decompile the whole line into
MessageBox(NULL,�Can�t register
class�,�crap�,MB_OK)
after that we see
xor eax,eax
jmp 00401172
xor eax,eax
clears 0 and if we go see what�s at address
00401172, we will find
mov esp,ebp
pop ebp
ret 10
which is exit code, so we can decompile this line to return 0. The full
original code is
if(!RegisterClassEx(&wc))
{
MessageBox(NULL,"Can't register
class","Crap",MB_OK)
return 0
}
As you can see decompiling is quite simple for this basic windows stuff, so I
not going to bore you with the rest. If you have any questions , please check
out are forums at http://www.eliteproxy.com/modules.php?name=Forums
More to come
Visual basic 6.0 is next
Credit
This paper is made possible by a grant from your donation, if you would like
to continue to support Opcodevoid, then please donate.
Disclaimer
This book is provided as is; no warranty is applied nor granted information.
What is presented in this book is copyrighted by Opcodevoid with all rights
respected. All information, algorithms can not be copied, reproduce nor
distributed in anyway, without written permission from Opcodevoid or Opcodevoid
Inc.