Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / containers / virtual-machine

Virtualization for System Programmers

4.92/5 (50 votes)
31 Dec 2018CPOL17 min read 128.6K   1.6K  
Curious on how you create a hypervisor app? Read mode!

(now with VS solution and automatic compilation/ISO generation for bochs)

The Infamous Trilogy: Part 2

This article targets the user who has first read my ASM tutorial (http://www.codeproject.com/Articles/45788/The-Real-Protected-Long-mode-assembly-tutorial-for) and wants to learn how Virtualization works. You want to create your own VMWare workstation? Let's go!

Background

Required items:

  • A complete understanding of how the CPU works in protected and long mode - read my article: http://www.codeproject.com/KB/system/asm.aspx.
  • Bochs Source - recompilation with VMX extensions. VMWare (or other virtualizers) won't work. Also, chances are that my code could use features not found in your CPU version - but you can try. Oh, and you can test it in raw DOS PC if you are really brave. The github source includes Bochs for your convenience.
  • =very good= Assembly knowledge
  • Flat Assembler (http://flatassembler.net/)
  • FreeDos (or any other DOS you might have licensed). The source includes a FreeDos floppy drive for your convenience.
  • LOTS OF PATIENCE

If you are a beginner programmer, quit right now.

If you are an advanced programmer, quit right now.

If you are an expert programmer, quit right now. When you start reading this article, you will feel like a beginner anyway.

BUT, since I am a beginner too, you will be eventually able to read what I have to say because I felt the same after I read the virtualization manuals. So keep reading!

Startup

We will create an application that prepares the CPU for virtualization, creates a guest, enters it and exits. All this will be done in x64 mode for simplicity. In x86, it is also possible, but we will focus on the x64 architecture to avoid unnecessary overhead in the code.

The code demonstrates only the basic VMX features and it might not work in your own CPU. However, you can use bochs with virtualization enabled and then you will be able to test my code.

Terminology

  • VMM (Virtual Machine Monitor): The hosting application
  • VM (Virtual Machine): The guest application
  • Root Operation: The code/context the VMM runs
  • Non Root Operation: The code/context the VM runs
  • VMX Transition: Going from host to guest (VMEntry) or from guest to host (VMExit)
  • VMCS: A structure to control a VM and VMX transitions.
  • VM Entry: A transition from the host application to the guest.
  • VM Exit: A transition from the guest to the host due to some reason.

Life Cycle of VMX Operations

  • VMM checks for CPU virtualization (CPUID) and enables it (CR4 and VMXON)
  • VMM initializes a control structure, called VMCS, for each VM. Tell the CPU where this pointer is by using VMPTRST and VMPTRLD. Read/Write VMCS with VMREAD, VMWRITE and VMCLEAR.
  • VMM enters a VM using VMLAUNCH or VMRESUME
  • VM exits to the VMM with VMEXIT
  • Do all the above over and over again
  • VMM eventually shutdowns itself VMXOFF

Does My CPU have Virtualization Support?

Yes (or you wouldn't be reading this one by now anyway), but if you still want to verify, you check the ECX's bit 5 after a CPUID with EAX = 1:

MASM
mov eax,1
cpuid
bt ecx,5
jc VMX_Supported
jmp VMX_NotSupported    

After you know that your CPU supports VMX operations, you should check the IA32_VMX_BASIC MSR (index 0x480) to check implementation-specific information for your CPU:

MASM
mov ecx, 0480h
rdmsr

This 64-bit MSR has a lot of information, but at the moment, we are interested in 2 fields:

  • Bits 0 - 31: 32-bit VMX Revision Number
  • Bits 32 - 44: Number of bytes (up to 4096) that a VMXON region or a VMCS should be.

The VMX revision (4 bytes) should be put in every VMCS/VMXON structure so the processor knows the format that should be used to store data in it. Each VMCS/VMX structure size should be exactly the number of bytes indicated by bits 32-44 (max 4096).

Enabling VMX Operations

  • Enter Long Mode.
  • Set CR4's bit 13 to 1. This bit enables the VMX operations.
  • Set CR0's bit 5 to 1 (NE) - this is required for the VMXON to succeed.
  • Initialize a VMXON region.
  • Execute the VMXON instruction.

A VMCS is a 4-KB aligned memory area used to support VM operations. It consists of 3 fields: 4 bytes that hold the revision number (0x480 MSR Register returned value), 4 bytes that are used for VMX Abort data (more on this later), and the rest is a collection of six fields to control the VM operations.

A VMXON region is a single VMCS region which you only need to initialize the revision number. Initialization of the VMXON region requires putting the correct revision number (first 4 bytes) as returned by the 0x480 MSR register above.

The VMXON instruction requires an address (e.g., VMXON [rdi]). This address should contain the 64-bit physical address of the VMXON region (4-KB aligned) and the first 4 bytes of that region should contain the VMX revision.

File: VMX.ASM
Func: VMX_Enable 

CR4 bit set for VMX operations:

MASM
mov rax,cr4
bts rax,13
mov cr4,rax

Enable VMX:

mov [rdi],ebx ; Put the revision. Rdi holds the VMCS address and ebx holds the revision
VMXON [rsi]  ; Assuming rsi holds the address of the VMCS

The VMCS Groups

It was easy so far, but here starts your hell. The rest of the VMCS (that is, after the first 8 bytes (revision + VMX Abort) is divided into 6 subgroups:

  • Guest State
  • Host State
  • Non root controls
  • VMExit controls
  • VMEntry controls
  • VMExit information

Each of the above fields contains important information about how the VM starts (State after a VMEntry), what is the host state after a VMExit, when a VMExit will occur and others.

File: VMX.ASM
Func: VMX_TryGuest and VMX_TryGuest2

The Guest State

This contains the following information (In parentheses, the bit number):

  • CR0,CR3,CR4,DR7,RSP,RIP,RFLAGS, (64 each)
  • For each of CS,SS,DS,ES,FS,GS,LDTR,TR:
    • Selector (16)
    • Base address (64)
    • Segment limits (32)
    • Access rights (32)
  • For GDTR and IDTR:
    • Base address (64)
    • Limit (32)
  • IA32_DEBUGCRTL (64)
  • IA32_SYSENTER_CS (32)
  • IA32_SYSENTER_ESP (64)
  • IA32_SYSENTER_EIP (64)
  • IA_PERF_GLOBAL_CTRL (64)
  • IA32_PAT (64)
  • IA32_EFER (64)
  • SMBASE (32)
  • Activity State (32) - 0 Active , 1 Inactive (HLT executed) , 2 Triple fault occured , 3 waiting for startup IPI (SIPI).
  • Interruptibility state (32) - a state that defines some features that should be blocked in the VM - more on that later.
  • Pending debug exceptions (64) - to facilitate hardware breakpoings with DR7 - more on that later.
  • VMCS Link pointer (64) - reserved, set to 0xFFFFFFFFFFFFFFFF.
  • *VMX Preemption timer value (32) - more on this later.
  • *Page Directory pointer table entries (4x64) - pointers to pages - more on this later.

The guest state describes the values of the registers that the CPU has after a VMEntry. Because you can totally control the registers, you can start a VM in any mode (real, protected, long, etc.). But even if you are to start a real mode VM (as my code does), you have to initialize the segment registers as normal p-mode selectors, with proper limits access, etc.

The values that are used for the segment registers (limits, base address, selector, access rights and flags) are the same with those used in ordinary protected mode, so for example, you will see my code adding a 0x92 access flag for a DS read/write data segment.

The Host State

This contains the following information (In parentheses, the bit number):

  • CR0,CR3,CR4,RSP,RIP (64 each)
  • CS,SS,DS,ES,FS,GS,TR selectors (16 each)
  • FS,GS,TR,GDTR,IDTR base addresses (64 each)
  • IA32_SYSENTER_CS (32)
  • IA32_SYSENTER_ESP (64)
  • IA32_SYSENTER_EIP (64)
  • *IA32_PERF_GLOBAL_CTRL (64)
  • *IA32_PAT (64)
  • *IA32_EFER (64)

The host state tells the CPU how to return to the VMM after a VMExit.

Executon Control Fields

These fields essentially tell the CPU what is allowed to be executed in the VM and what is not. Everything not allowed causes a VMExit. The sections are:

  • Pin-Based (32b) : Interrupts
  • Processor-Based (2x32b)
    • Primary: Single Step, TSC HLT INVLPG MWAIT CR3 CR8 DR0 I/O Bitmaps
    • Secondary: EPT, Descriptor Table Change, Unrestricted Guest and others
  • Exception bitmap (32b): One bit for each exception. If bit is 1, the exception causes a VMExit.
  • I/O bitmap addresses (2x64b): Controls when IN/OUT cause VMExit.
  • Time Stamp Counter offset
  • CR0/CR4 guest/host masks
  • CR3 Targets
  • APIC Access
  • MSR Bitmaps

My code only uses the pin-based and the processor based for simplicity, but these fields are your real Swiss army knife; you can control entirely what the VM is and is not allowed to perform.

VM-Exit Control Fields

These fields tell the CPU what to load and what to discard in case of a VMExit:

  • VMExit Controls (32b)
  • VMExit Controls for MSRs

VM-Entry Control Fields

  • VMEntry Controls (32b)
  • VMEntry Controls for MSRs
  • VMEntry Controls for event injection

This event injection is your second weapon. When a VM exits, you can inject an event so the VM believes that the exception was generated by its code. Yes, a VMM can become really mighty.

VM-Exit Information (Read only) Field

  • Basic information
    • Exit Reason (32)
    • Exit Qualification (64)
    • Guest Linear Address (64)
    • Guest Physical Address (64)
  • Vectored exit information
  • Event delivery exits
  • Intstruction execution exits
  • Error field

The VCMS Initialization

To mark a VMCS for further reading/writing with VMREAD or VMWRITE, you would first initialize its first 4 bytes to the revision (as with the VMXON structure above), and then execute a VMPTRLD with its address.

Appendix H of the 3B Intel Manual has a list of all indices. For example, the index of the RIP of the guest is 0x681e. To write the value 0 to that field, we would use:

mov rax,0681eh
mov rbx,0
vmwrite rax,rbx

This means that, after a successful VM Entry, the guest will start with RIP set to 0.

Giving Memory to your VM

You would think you are done? Hahahah. Not so fast. You have to give your new Virtual Machine some memory to work, and you have to configure the EPT. An EPT is a mechanism that translates host physical address to guest physical addresses. Fortunately for you, it is exactly the same as the known long mode paging mechanism, so you can review it in my article.

Originally, the VMX capabilities of the CPU required guests to start in paged protected mode, and VMM applications usually put the virtual CPU into VM86 mode, to allow OSes (which expect a clean real mode boot) to work. Soon they introduced the "Unrestricted Guest" flag (bit 7 in Secondary Exit Controls) that would allow a guest to start in real mode. However, putting the virtual CPU in real mode means we have to map the lower 640KBtyes, so we have to use EPT.

If your CPU doesn't allow the "unrestricted guest" mode, then you can setup a protected mode guest using similar code, because my code creates protected mode style segments anyway. The github project, which uses Bochs, automatically creates a protected mode guest.

Of course, depending on your guest's initial state (for example, if you'd want to start a guest in long mode), you would also need to configure Guest PAE, paging, proper CR4 and stuff. But our little application will configure a real mode guest, so it needs to map a region of host memory to guest physical address.EPT Translation uses the lower 48 bits (as the nowadays CPU actually do nowadays - not the entire 64-bit range is used).

The code currently tests a protected mode guest. Initialization for this VM is in VMX_Initialize_Guest2. This time, CR0 is set to be in protected paged mode, CR4 is loaded with the page directory (the very same used for normal protected mode since our EPT is a see-through). File guest32.asm has the entry point for our protected mode guest and this time, the selectors are ready to go. It merely sets a flag and exits to the VMM with VMCall.

Launch It!

Having initialized the VMCS properly (ok, that's a joke, but I have to say "properly" anyway - prepare for LOTS of failures here), the VMLAUNCH opcode will start the execution of the virtual machine (from the VMCS guest set CS:XIP). If the entry fails, the Z flag will be set immediately after execution of VMLAUNCH.

This is where BOCHS will help you. After VMLAUNCH fails, the bochs debugger window will show you a message depending on what went wrong, so you will get an idea what to fix in the VMCS.

If VMLAUNCH succeeds, control will not return to the host until a VM Exit occurs. When a VM exit occurs, control is transferred to the VMM's exit routine (as configured in the VMCS host state fields). VMExit merely checks the flags set by VMEntry to know if the VMEntry code was successfully executed.

Note that, even if VMLAUNCH succeeds, starting the VM might immediately cause a VMExit due to any fault (page faults, EPT misconfigurations, etc). That way, VMLAUNCH will succeed but control will immediately return to your exit routine without the VMEntry code to be executed.

Launched, Now What?

Nothing. The VM executes as if nothing is present, unless you make something present. You need now to implement your own BIOS, copy it at the virtual memory at the proper address (so execution starts from 0xFFFF:0xFFF0) and your drivers to transfer data between the actual hardware and the actual memory to the virtual hardware and memory you may have allowed within the VM. Yup, that's why VMWare Workstation is some 500 MB in side; It contains bios and drivers and communication protocols to allow, e.g., a virtual screen (which is seen as an Actual driver from the guest) to be shown in your actual screen within a window. The same with USB hardware which is duplicated from the actual system to the virtual system.
For a simple test, one might think that it should be easy to copy the actual bios to the virtual memory so, for example, DOS can boot. Right, and from what device will DOS boot since there is no one in the VM? That's why you have to duplicate an actual device into the VM using your custom BIOS in order to communicate with the host with a specific protocol, then emulate the allowed devices in order for the VM to function properly.

My application simply forwards memory in a see-through style, so calling BIOS and DOS from the VM is possible. But in real life, you don't want to do that, as then the VM can ruin the VMM because they share the same memory.

In real life also, in case that the "unrestricted guest" isn't allowed, you have to start the guest in VM86 paging protected mode and if the guest likes itself to set protected/long mode (like an OS), you must catch the VMExit (which would occur when the guest software attempts to execute LGDT) and emulate all the calls that would otherwise fail (LGDT, LIDT, CR0, Paging initialization, etc.) so the guest can assume that its operations were successful.

VM Exits

A VMExit can occur for various reasons, either because you had specified a VMExit reason in VMCS control/exit fields, or if the VM actually entered a shutdown state (for example, a ring 0 crash) that would reset the CPU if it would be run in an actual, non virtualized state, or anything else. Execution resumes at the VMCS host state saved (CS:XIP), and you can read the VMCS exit information (read only) to detect the reasons of the exit.

Use VMRESUME to resume the VM after an exit.

VMCall

Some systems know that they run under virtualization (for example, VMWare drivers) and they do want to jump back to their host in order to exchange information. The VMCall opcode causes a VMExit to the host, and the virtualized system can exchange information with the host. My code also uses VMCall to exit to the host.

Of course, if VMCall is executed in a non VMX-non root environment, an unrecognized opcode exception is thrown.

Control MSRs

For simplicity, my code doesn't check for all features (that's the most probable reason it won't work in your raw DOS), but you should check the VMX MSRs for available features before testing them. Intel's 3B Appendix G contains all these MSRs. To load a MSR, you put its number to RCX and execute the rdmsr opcode. The result is in RAX.

  • IA32_VMX_BASIC (0x480): Basic VMX information including revision, VMCS size, memory types and others.
  • IA32_VMX_PINBASED_CTLS (0x481): Allowed settings for pin-based VM execution controls.
  • IA32_VMX_PROCBASED_CTLS (0x482): Allowed settings for processor based VM execution controls.
  • IA32_VMX_PROCBASED_CTLS2 (0x48B): Allowed settings for secondary processor based VM execution controls.
  • IA32_VMX_EXIT_CTLS (0x483): Allowed settings for VM Exit controls.
  • IA32_VMX_ENTRY_CTLS (0x484): Allowed settings for VM Entry controls.
  • IA32_VMX_MISC MSR (0x485): Allowed settings for miscellaneous data, such as RDTSC options, unrestricted guest availability, activity state and others.
  • IA32_VMX_CR0_FIXED0 (0x486) and IA32_VMX_CR0_FIXED1 (0x487): Indicate the bits that are allowed to be 0 or to 1 in CR0 in the VMX operation.
  • IA32_VMX_CR4_FIXED0 (0x488) and IA32_VMX_CR4_FIXED1 (0x489): Same for CR4.
  • IA32_VMX_VMCS_ENUM (0x48A): enumerator helper for VMCS.
  • IA32_VMX_EPT_VPID_CAP (0x48C): provides information for capabilities regarding VPIDs and EPT.

Creating the Hypervisor Virus

So far, we are interested in the VM science all right, but the programmer's soul will always contain notorious feelings like killing, revenging, cheating, cracking and all sort of that stuff.

Now, we'll take into account the fact that you are evil (or you wouldn't make it up to here) and discuss about Blue Pill. Blue Pill is a hypervisor virtus that controls the entire OS. For this to work, you would simply map the entire memory as a see through and start the VM with Windows in it, while configuring almost anything to cause a VMExit. Now, whatever Windows tries will be reported to your hypervisor via VMExits, and using the injection technology, you can fake any response - and since Intel doesn't have a (known) way to detect if an application is running in Virtualization, you will never get caught. Never? Who knows - but if you ever get caught, let me assure you that I know nothing about it. :)

But wait! CR4's 13th bit should be 1 inside a Virtual Machine so if that bit is 0, you definitely know you are not virtualized! But if this bit is 1, do you really know if you are under a VMM? Who knows. If anybody gets the Windows Loader source and finds out that a mov eax,cr4 - test eax 0x2000 - jz WE_ARE_OWNED sequence is there, let me know.

Another possible option to test if you are owned is the VMCall, which would raise an exception and you can catch it. However, did anyone ensure you that there wasn't an exit and your host injected an exception for you to catch and assume you are free?

Another possible option is to test if the CPU does not support unrestricted guests and you started in VM86 mode. If you see that you are running in VM86 mode, then chances are that you are virtualized. But whoops - did we forget EMM386 exe? But Windows NT-based OSes do not load any DOS drivers so if NT loaded checks for VM86 and it is enabled, it may assume it is under virtualization.

Conclusion

As you saw, virtualization is not initially a very complex subject, but to make something that really works, you need to implement a BIOS, drivers, etc. That's why not many programmers really try such a thing and that's why only a few applications support virtualization. VMWare has completed a great deal of work to make their Workstation actually do the job.

The code is imported from my previous article and organized in 6 files. It is rather dirty, but it works.

Try it and tell me. If it doesn't work, tell me and help me to improve it. Either way, the fact that you are reading up to here is appreciated.

If you aren't disappointed by now, I urge you to apply to a Virtualization Software company like VMWare for a job - you will do very well. And tell them you have read my article, they might hire me as well. :)

GOOD LUCK!

References

History

  • 31-12-2018: Happy new year, include VMX code in the main github project
  • 11-01-2015: Happy new year, some formatting
  • 02-07-2012: Added protected mode guest and fixed some minimal bugs
  • 26-06-2011: First release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)