Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / ASM

The Low Level M3ss: DOS Multicore Mode Interface

4.83/5 (23 votes)
8 Jan 2019CPOL8 min read 39.6K   284  
All in one article about raw CPU technologies, access multiple cores and protected or long mode from DOS while still having access to DOS interrupts

Get Ready

Anyone that has already read my "infamous trilogy":

would want to combine all the stuff in one nice application. Here is such a combination, along with some new tips/techniques not discussed in the previous articles. It is implemented as a TSR which other apps can call for true multithreading in real, protected or long mode in raw DOS.

Using this code, you can create a DOS app that can:

  • Use all your CPUs together
  • Lock/Unlock mutexes
  • Start threads in real, protected, long and virtualized mode

You need flat assembler, and a freedos installation in some virtualization environment that can have multiple cores. VMWare works until virtualization. DOSBox doesn't because it doesn't expose an ACPI. Bochs will work in the special SMP edition for real, protected and long mode with virtualization. VirtualBox support is not yet completed. My github project includes all these setups for your convenience.

Background

  • 1024 assembly books
  • 4.023 x 10^23 C++ lines written
  • 1 << 62 free space in your mind. The upper bits are reserved for the kernel.
  • Lots of patience and humor :)

Locking the Mutex

Yes in Win32, you have the nice Mutex functions. But what about in raw DOS?

First, a word about spin loops. When a Win32 thread calls WaitForSingleObject, the kernel checks if the object is signaled and, if not, it does not schedule the thread for resuming. If there is no thread to be scheduled, the kernel halts the CPU code with the HLT instruction, until later. In our little program, we own the system, there is no scheduler. So the code will simply spin loop until the mutex is available.

Therefore, one would expect code like this:

ASM
; BL is the index of this CPU
.Loop1:
CMP [shared_var],0xFF ; shared_var would be 0xFF if mutex is released
JZ .MutexIsFree
JMP .Loop1

.MutexIsFree:
MOV [shared_var],BL ; Lock it

Not so. The problem is that, when the mutex is released, another CPU might lock the variable before this code. That is, something might be executed after the JZ command but before the MOV command.

Therefore, we have to use some atomic operation to achieve the lock:

ASM
; BL is the index of this CPU

CMP [shared_val],BL ; Perhaps it is locked to us anyway
JZ .OutLoop2
.Loop1:
CMP [shared_val],0xFF ; Free
JZ .OutLoop1 ; Yes
pause ; equal to rep nop.
JMP .Loop1 ; Else, retry

.OutLoop1:

; Lock is free, grab it
MOV AL,0xFF
LOCK CMPXCHG [shared_val],BL
JNZ .Loop1 ; Write failed

.OutLoop2: ; Lock Acquired

The magic here is simple. We use the CMPXCHG instruction which, along with the LOCK prefix, atomically tests the shared val if it is still 0xFF (the value in AL), and if yes, then it writes BL to it and sets the ZF. If another CPU has grabbed the mutex, the ZF is cleared and BL is not moved to the shared_var. Most convenient.

The another interesting thing is the pause opcode, a hint to the CPU that we are inside a spin loop. This greatly improves performance since the CPU knows we are in a spin loop and therefore, it will not prefetch code.

Waking the CPUs

As we saw in the trilogy, we send the INIT and the SIPI. The CPU must start in a 4096-aligned address, so I've filled an array with NOPs and adjust the startup address accordingly. The CPU starts in real mode.

Therefore, a "SipiStart" routine would be like that:

ASM
SipiStart:
    
db 4096 dup (144) ; // fill NOPs

CLI
mov di,DATA16
mov ds,di
lidt fword [ds:RealIDT]; Load real mode interrupts in case they are not loaded
STI

call FAR CODE16:EnterUnreal; Far call because CS is not CODE16 at this point

; Enable APIC
MOV EDI,[DS:LocalApic]
ADD EDI,0x0F0
MOV EDX,[FS:EDI]; unreal mode, FS:EDI works.
OR EDX,0x1FF
MOV [FS:EDI],EDX

mov di,StartSipiAddrOfs ; a dd that contains pre-configured jump to the actual routine for this CPU
jmp far [ds:di]

Anyway, to access the APIC, I have to enter unreal mode, so I call EnterUnreal. Note the FAR call; The segment value in which EnterUnreal begins is not the same with the CS which is loaded during the SIPI. A newly awoken CPU must also enable spurious vector and software APIC, as we have seen earlier. Finally, the code jumps far to the 'startup' address for the CPU, depending on the CPU index.

Interprocessor Interrupts

The APIC provides us a way to send a message to another CPU. Apart from INIT and SIPI, which we saw earlier, the local APIC can be used to send a 'normal' interrupt, i.e., merely executing INT XX in the context of the target CPU. We have to take into consideration the following:

  • If the CPU is in HLT state, the interrupt awakes it, and when the interrupt returns the CPU resumes with the instruction after the HLT opcode. If there is also a CLI, then we must send a NMI interrupt (A flag in the APIC Interrupt Register) to wake the CPU.
  • If the CPU is in HLT state and we send again an INIT and a SIPI, the CPU starts all over again from real mode.
  • The interrupt must exist in the target processor. For example, in protected mode, the interrupt must have been defined in IDT.
  • The Local APIC is common to all CPUS (memorywise), therefore, we must lock for write access (mutex) before we can issue the interrupt.
  • Because the registers cannot be passed from CPU to CPU, we have to write all the registers (that will be used for the interrupt, if any) in a separated memory area.
  • The interrupt might fail. I don't know why, but that's what they say. So, you have to rely on some inter-cpu communication (via shared memory and mutexes) to verify the delivery. I'm doing that in my code with a simple flag.
  • Finally, the handler of the interrupt must tell its own Local APIC that there is an "End of Interrupt". Remember out 020h,al in the past? Now we write to the EOI register (LocalApic + 0xB0) the value 0.

CPU Real Mode

If CPU will be running in real mode, you may want to call DOS. It will work, provided that no other CPU calls DOS at the same time, which of course cannot be assumed in our simple app. Therefore, you have to use int 0xF0 function 5 to manage mutexes. The thread starts automatically in unreal mode and with stack and FS stored. The thread terminates with retf. If you call DOS through interrupt 0xF0 function 4, then locking is automatically provided.

This is the code in dmmic.asm real mode thread:

ASM
rt1:

sti
push cs
pop ds
mov dx,m1
mov ax,0x0900
int 0x21

; unlock mut
push cs
pop es
mov di,mut1
mov ax,0x0503
int 0xF0

retf

CPU Protected Mode

This thread runs in 32-bit full 4GB protected mode. GS is pointing to base-0 32-bit data. It uses int 0xF0 to call DOS, then exits:

ASM
; ---- Protected Mode Thread
SEGMENT T32 USE32 
rt2:

; Int 0xF0 works also in protected mode
mov ax,0
int 0xF0

; DOS call
mov ax,0x0421 ; al = interrupt number
mov bp,0x0900 ; bp = new AX when DOS will be called
xor esi,esi
mov si,MAIN16 ; uppser ESI = new DS
shl esi,16
mov dx,m2
int 0xF0

; Unlock mutex
mov ax,0x0503
linear edi,mut1,MAIN16
int 0xF0

retf

CPU Long Mode

As I had said in the trilogy, long mode can be entered directly from real mode, because the instructions RDMSR and WRMSR are available. This is also implemented in two pieces. One to prepare the long mode by:

  • Loading the GDT.
  • Preparing a see-through page table for the first 1GB and ,apping the Local APIC to a fixed position (1GB - 2MB) memory area, because the Local APIC is usually located at 0xFEE00000, which means it won't be visible in our 1GB see through, OR, preparing a 4GB page table with 1GB pages, if your system supports 1GB pages. Most do.
  • Enabling PAE, PSE, and long mode.

And one to enter long mode by enabling paging, enabling interrupts with int 0xf0 accessible, then jumping to the code. Remember long mode is flat 64 bit and CS,DS,ES,SS have no meaning. Or so they say, I still had to set the SS to page64_idx in Bochs. Perhaps a Bochs bug?

ASM
; ---- Long Mode Thread
SEGMENT T64 USE64 
rt3:

nop
nop
nop
nop
nop

; Int 0xF0 works also in long mode
mov ax,0
int 0xF0

; DOS call
mov rax,0x0421
mov rbp,0x0900
xor rsi,rsi
mov si,MAIN16
shl rsi,16
mov rdx,m2
;int 0xF0; Whops, DOS still buggy here

; Unlock mutex
mov ax,0x0503
linear rdi,mut1,MAIN16
int 0xF0

ret

CPU Virtualized Protected Mode

This thread runs in 32-bit full 4GB virtualized protected mode. It can still call DOS. This mode is very useful since, whatever your thread might do, it can never crash the entire PC, only exit with a VMEXIT procedure.

ASM
v1:

; Int 0xF0 works also in protected mode
mov ax,0
int 0xF0

; DOS call
mov ax,0x0421 ; al = interrupt number
mov bp,0x0900 ; bp = new AX when DOS will be called
xor esi,esi
mov si,MAIN16 ; uppser ESI = new DS
shl esi,16
mov dx,m2
int 0xF0

; Unlock mutex
mov ax,0x0503
linear edi,mut1,MAIN16
int 0xF0

retf; or even VMCALL

The DMMI

I've called it DOS Multicore Mode Interface. It is a driver which helps you develop 32 and 64 bit applications for DOS, using int 0xF0. This interrupt is accessible from both real, protected and long mode. Put the function number to AH.

To check for existence, check the vector for INT 0xF0. It should not be pointing to 0 or to an IRET, ES:BX+2 should point to a dword 'dmmi'.

Int 0xF0 provides the following functions to all modes (real, protected, long)

  • AH = 0, verify existence. Return values, AX = 0xFACE if the driver exists, DL = total CPUs. This function is accessible from real, protected and long mode.
  • AH = 1, begin thread. BL is the CPU index (1 to max-1). The function creates a thread, depending on AL:
    • 0, begin (un)real mode thread. ES:DX = new thread seg:ofs. The thread is run with FS capable of unreal mode addressing, must use RETF to return.
    • 1, begin 32 bit protected mode thread. EDX is the linear address of the thread. The thread must return with RETF.
    • 2, begin 64 bit long mode thread. EDX holds the linear address of the code to start in 64-bit long mode. The thread must terminate with RET.
    • 3, begin virtualized thread. BH contains the virtualization mode (currently only mode 2 = protected mode virtualization is supported), and EDX the virtualized linear stack. The thread must return with RETF or VMCALL.
  • AH = 5, mutex functions.
    • AL = 0 => initialize mutex to ES:DI (real) , EDI linear (protected), RDI linear (long).
    • AL = 1 => Lock mutex
    • AL = 2 => Unlock mutex
    • AL = 3 => Wait for mutex
  • AH = 4, execute real mode interrupt. AL is the interrupt number, BP holds the AX value and BX,CX,DX,SI,DI are passed to the interrupt. DS and ES are loaded from the high 16 bits of ESI and EDI.

Now, if you have more than one CPU, your DOS game can now directly access all 2^64 of memory and all your CPUs, while still being able to call DOS directly. Isn't that fun?

INT 0x21 Redirection

In order to avoid calling int 0xF0 directly from assembly and to make the driver compatible with higher level languages, an INT 0x21 redirection handler is installed. If you call INT 0x21 from the main thread, INT 0x21 is executed directly. If you call INT 0x21 from protected or long mode thread, then INT 0xF0 function AX = 0x0421 is executed automatically.

So with a bit of luck, you can use your favorite stdio functions from a C function in another thread directly!

The Code

Once you run entry.exe with /r, the library installs as a TSR and int 0xf0 is available. DMMIC.asm shows example calls. 

ToDo

  • Add more virtualization modes

History

  • 08-1-2018:  Added virtualization capabilities
  • 07-1-2018:  Fixed Long mode int 0xF0 call
  • 06-1-2018: Updated DMMI to my new github project
  • 22-5-2015: Thanks to Brendan for the synchronization tip
  • 18-5-2015: Fixed multiple call bug with End of Interrupt write
  • 17-5-2015: First release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)