Preface
This article was writting way back in October 2005 but was never published. This is the final draft submission that was edited by Doctor Dobbs Journal. During this time period I submitted three articles and two of them were published as linked below. However, this one was never published and in light of DDJ now closing its doors it will never be published. Given this article was already written and even though the context is now 10 years past I decided to publish it here just for archive purposes.
Sharing Memory with the Virtual Machine
Introduction to Power Debugging
I have not modified this article since I originally wrote it in October 2005. There are several things I would write differently but given that this article is not being written for today I have left it unchanged. The perspective to view the article is during this time we had single core processors of which some were x64 capable. However, even though they were x64 capable they still would be used with operating systems such as Windows XP 32-bit. This was a fun article on a "what-if" scenario comparing DOS-Extenders between 16-Bit and 32-Bit to the possibility of a "Windows-Extender" from 32-Bit to 64-Bit.
Introduction
It wasn’t that long ago that most PCs were running 16-bit operating systems on 32-bit processors. In those days, MSDOS reigned, “Ralf Brown” and “DPMI” were familiar names, and everyone knew that port 3DAh was for vertical retrace. Okay, so there were a few people who used 32-bit operating systems such as OS/2 or a flavor of UNIX. As for developers, they generally had to target applications for 16-bit operating systems, thereby losing out on the benefits that could be gained from 32 bits. DOS Extenders—libraries that enabled Protected mode and let applications utilize 32-bit instructions without a 66h instruction prefix—addressed this problem. They also let you access memory up to 4 GB, depending on the characteristics of the particular extender you were using. The most popular DOS Extenders were PROT (see “Roll Your Own DOS Extender,” by Al Williams; DDJ, October 1990), DOS4GW (Microsoft’s 32-bit extension), Pharlap’s DOS Extender, and Trans PMODE (which integrated nicely with Watcom’s C/C++ compiler). Most extenders simply used DOS Protected mode Interrupt (DPMI), which abstracted the implementation of Protected mode through an interrupt. They also provided an easy method for executing 16-bit BIOS interrupts, so you would not need to implement things like switching video modes yourself. Others simply implemented 32-bit protected mode themselves, while some just did “Big Real mode” (also known as “Unreal mode”). Today, we face similar situations with 64-bit capable machines running 32-bit operating systems. Granted, this time around we won’t have to wait as long for more powerful mainstream operating systems to emerge. Still, being able to use the features of 64-bit processors is something we’d like to do. To that end, I present in this article a driver that extends Windows to take advantage of a CPU’s 64-bit features, in much the same way that DOS Extenders did for 16- and 32-bit systems. This Windows driver sets the processor into 64-bit mode, letting you run 64-bit applications. The driver saves the operating system, so that you can restore it back to its original state. Granted, this driver is simple. It doesn’t do any scheduling, for instance. Nevertheless, it lets you take advantage of the processor’s powerful features using Windows XP without installing a new operating system.
What Is Long Mode?
The processor mode that enables native 64-bit is called “Long mode.” It has two submodes: Native-64 mode and Compatibility mode. Native-64 mode allows the execution of 64-bit instructions and defines new behaviors. Compatibility mode executes 32-bit and 16-bit applications, akin to Protected mode. There are differences between Protected mode and Long mode aside from simply extending the bits. These can be found in the AMD64 processor architecture manuals (http://www.amd.com/). Here, I provide a quick overview of some of the more interesting differences. At this writing, I have not compared AMD64 with Intel’s EM64T.
No Virtual 8086 Mode. The AMD64 does not support V86 mode when the processor is in Long mode. V86 is a mode that allows the operating system to isolate and schedule applications written for Real mode while running in a Protected mode environment. In short, this is processor support for virtualization of 16-bit legacy applications. The processor still supports V86 as a submode when in Protected mode. Again, the processor only supports two submodes when in Long mode—Native- 64 and Compatibility mode. In Native 64, the selector references a Long mode descriptor and executes 64-bit instructions. In Compatibility mode, the selector references a legacy descriptor and executes 16- or 32-bit instructions.
Must Enable Paging. Protected mode architecture did not require you to implement paging that made it quite a simple implementation for DOS Extenders. This is not true of Long mode, which requires not only paging but actually implements paging as an extension of Protected mode’s Physical Address Extensions (PAE).
No Segmentation. The base address specified in the descriptor is not used when in Native-64-bit Long mode. The virtual address is the linear address in this model. There are no intermediate translations occurring on the virtual address before the paging translation. The majority of the descriptor is actually not used when the Long mode bit is set. If the Long mode bit is not set in the descriptor, then the system is in compatibly mode and the descriptors will be interpreted the same as they are in Protected mode.
More General-Purpose Registers. There are eight new registers, R8 through R15, which come in 8-, 16-, 32-, and 64-bit forms. These registers are available when the processor is in Native-64-bit mode.
PAE Paging Architecture. The architecture used in Physical Address Extensions is the basis for the paging implementation in Long mode. These need to be understood before you can write a Windows Extender. Figure 1 shows a graphical outline of the 4K paging model in 32-bit Protected mode.
The processor actually supports 2- MB, 4-MB, and 4-KB paging implementations, which can also be intermixed. The Virtual Address that is being used in software is first added to the base address of the descriptor tables maintained by the operating system. The final result is a linear address that is then subject to the paging mechanism. The bits in the linear address index into several OS-implemented tables. The CR3 register is a pointer to the first page in the process. In nonPAE paging, this register simply points to a single Page Directory. The implementation of PAE has introduced a third table named “Page Directory Pointers” that contains four pointers each referencing a complete Page Directory. The Page Directory itself is a table that references Page Tables in the 4K paging implementation. The entries in a Page Table reference a 4K base of physical memory that would be indexed by the final bits of the linear address.
Figure 2 shows the complete breakdown in 4K paging mode. The implementation of PAE lets the operating system to access physical pages of memory up to 36 bits. The virtual address space, however, is not increased and remains 32 bits. This lets the processor use more physical memory between multiple processes since they each have an isolated virtual address space. The processor can then take advantage of systems with more than 4 GB of RAM to optimize paging. History also repeats itself as the operating system can now implement “views” of virtual memory. This lets applications map more than 4 GB of address space. This brings us back to the days of XMS/EMS, even though with applications there does not need to be over 4 GB of physical memory to implement this feature because the operating system can simply lie. The Physical Address Extensions were expanded in the 64-bit architecture with the introduction of the Page Map Level 4 table. This table allows indexing of other Page Directory Pointer Tables.
Figure 4 illustrates the Virtual Address in Long mode in regard to the new paging. The paging bits are actually identical, aside from PDP being expanded up to bit 38 and the addition of the PML4 with a sign extension. The descriptor table also does not play a role in calculating the linear address when the submode is Native 64. The sign extension is simply used as a negative index into the table so that mathematical operations on the address can result in a correct mapping. The sign extension is either all 1s or all 0s.
The Concept Architecture
Again, my main goal was to create a basic 64-bit Windows Extender. The application running in user mode would simply send an IOCTL to the driver with a buffer that contains the 64-bit code. In turn, the driver thunks the current operating system, then executes this code. Once the code finishes, it restores the operating system and completes the IRP. The first step of this was to write a basic driver shell and application. The second phase was to set stages for the project and at each stage do a basic test that validates that the code is working correctly. Since I only have one AMD64, the machine I use for my testing is Athlon64 3200+ running Windows XP. Most of the project was outside the scope of the normal operating system. This means that if something goes wrong, the machine most likely reboots. In the absence of any hardware tools, you need to implement your own debugger. This could be writing to a screen, incremental addition of code, or even selective enabling and lots of setup debug messages. Here is a short list of work items and how I verified that I passed each phase:
Save and Restore Operating System.The starting point was to save the operating system state and attempt to restore it. This would help to validate the code was correct and would be fairly easy to do. The saving of the state focused on anything that would be changed by our code during the thunk. This included the descriptor tables, the control registers, and the selectors.
Create the Global Descriptor Table. The Global Descriptor Table is the only table that is mandatory. The interrupt table can be bypassed simply by disabling interrupts. To build up the descriptor table, I referred to the AMD64 architecture manuals and created structures for legacy and long mode descriptors. I needed to create both and don’t know whether I need one GDT or two GDTs when doing my switch from Protected mode to Long mode. It turns out that you can use one GDT when switching to Long mode. You can use this same GDT and populate it with your legacy and new descriptors. This also required the implementation of mirroring the legacy descriptor tables and creating my own. The test was then to simply load the descriptor table and set the selectors. The set and resetting causes the hidden portions of the selectors to be flushed and reloaded from the new GDT. The descriptor table is loaded using the LGDT assembly instruction and stored into a memory location using the SGDT instruction.
Implement Page Tables. There are two implementations of paging available in Protected mode—the legacy paging and the PAE mechanisms. The code that is implemented for paging should mirror the method being used by the operating system. The operating system I tested on is using PAE, so I was not able to test my legacy paging implementation. The testing of the Page Tables was simply to swap out CR3 and swap in my own. To validate the test, however, you need to reset the Global bit of CR4. The processor allows the OS to provide hints on which pages to cache through the use of a Global bit in the Page Tables. If the code you are executing is on a Global page when you swap CR3, then it is likely that you would not crash even if your Page Tables were wrong. The solution is to unset this bit in CR4, thus disabling the global page caching before you provide your new pages. You can also validate that different Page Tables are working by writing into the memory locations and later validating them with the debugger.
Disable Paging. The paging needs to be disabled before you can enter 64-bit mode and this requires the creation of identitymapped pages. Identity-mapped pages are created when the virtual address is equal to the physical address. This allows paging to be enabled or disabled without the current instruction being transported to a random location in memory. This test would simply be to load our Page Tables because we have more control over the mapping, and then disable paging all together. This will help validate that our paging is working correctly; however, they may require some tricks to implement. The paging is disabled or enabled via the high bit of CR0. In Windows, the driver memory is in a very high address space, so you either need to jump to another address in the code or you could map your descriptor’s base address with a negative number.
Setup a Stack. I decided that it would be best to have my own stack while I am setting and jumping into 64-bit mode. The stack then needs to be properly mapped and can easily be tested by pushing values into it, restoring the old stack and the operating system, and finally verifying the memory locations in the debugger. I usually attempt to write 0DEADBEEFh to a memory location and validate that this was tagged later.
Switch to 64 Bit. This is exciting, and before you even validate that the 64-bit code itself is working, you want to test Compatibility mode. This is the mode that the processor initially starts out in when you enter Long mode. The switch to 64-bit mode is done through the extended features registers of the CPU. There was an optimization that I had taken when performing the setup for Long mode. This can be seen in the code; however, I simply reused all the Page Tables. Again, the 64-bit implementation is simply an extension of 32-bit PAE. There is one enhancement that I needed to make: Create my own copy of the Page Directory Pointer Table. In Legacy mode, bit 1 is reserved and must be 0; however, in 64- bit mode, it can be 1 for read/write. These are not compatible implementations and it only takes one bit to reboot the machine.
Execute 64-Bit Code. I have provided a few sample 64-bit applications (available electronically) that you can execute. To run these, type mutinyapp <file.bin>. The current driver does not support systems that are not using Physical Address Extensions, so you must ensure that you have booted into this mode. This can be done by editing your BOOT.INI to use the /PAE switching. The sample code uses 0D8000000h, which is where my card is specifying its linear frame buffer for its video display. The driver maps this and the standard VGA video location for application use. The standard VGA location is 0A0000h and this mirrors the first 64K of 0D8000000h but only for the visible area of the screen. There is a “stride” (also known as “pitch”) involved with my card, and anyone who has used Direct Draw understands what this means. Essentially, the video memory is not linear per line, and each line is longer than the actual video mode display line. This means that when you reach the end of a line you need to add on the stride to get to the next visible scan line. I found that on my card, the total line matches the maximum resolution of the display card. The video location of your card, could be found in the resources of the device manager on your display driver. You can use this to figure out where to map the memory for your specific card and you could change the driver to use this. To build applications, you simply need to ensure that any memory you need mapped is available. If you need to increase the pool of allocated memory, then do so in the driver. You will also need to use the RMDHR application to remove any executable headers, as this driver executes raw 64-bit assembly code. You can see the examples of how to build applications using the BUILD.BAT file. This concept actually has a real use implementation— virtualization. The concept could be applied to allow execution and virtualization of a 64-bit operating system running on a 32-bit operating system!
Enhancements
There are plenty of enhancements you could make to this proof-of-concept implementation, starting with a 64-bit Interrupt Descriptor Table. This would be a rich enhancement in terms of keyboard support. It would also let you implement your own debugger, which would be helpful in finding/fixing problems and adding stability to the system.
There is a point in the code that I call “No Man’s Land.” What happens in No Man’s Land is that your computer reboots when an error occurs. In the current implementation, the entire thunking and execution of the program is in No Man’s Land. The operating-system kernel debugger can still be used until you start thunking the operating system. The implementation of your own debugger would limit the scope of No Man’s Land, allowing problems to be found and debugged. The goal would be to only have a small amount of code that would be in this gray area of execution.
The prototype definitely needs more testing and enhancements. This calls for testing on other systems, perhaps even implementing nonPAE support. The code that currently switches to 64-bit mode is under the control of the operating system and I make no attempts to translate if it is on a page boundary. The ideal implementation would be an allocation or an attempted allocation if the identity pages ever cross a page boundary in a physically nonlinear manner. The identity pages need to be physically continuous in order to prevent faults. The driver should know if it cannot switch to 64-bit mode. Multiple processor support could also be added to stop and utilize both processors in 64-bit mode.
Enhancements could range as far as implementing your own virtual machine for 64-bit operating systems. The engine could also just be used to write and test your 256-byte 64-bit demo effects.
Written October 2005