I’m trouble-shooting a driver of Intel network interface card (NIC) on pharlap which can’t send Ethernet packets out successfully. I track it down to the following code:
struct Registers
{
u32 DeviceCtrl;
..
};
#define DEVICE_RESET (1 >> 26)
Registers m_pReg;
..
m_pReg->DeviceCtrl |= DEVICE_RESET;
If the following code is used instead, the problem is gone.
m_pReg->DeviceCtrl = m_pReg->DeviceCtrl | DEVICE_RESET;
Frankly, I can’t tell any difference from the former code. It should always hold that...
a |= b
...equals:
a = a | b
No?
So I turn on the compiler flag(MSVC6.0, /FAs) to output assembly.
#1
m_pReg->DeviceCtrl = m_pReg->DeviceCtrl | DEVICE_RESET;
Assembly
mov eax, DWORD PTR [esi+180] ; Load m_pReg to eax
mov eax, DWORD PTR [eax] ; Load DeviceCtrl to eax
or eax, 67108864 ; Or eax with (1<<26)
mov ecx, DWORD PTR [esi+180] ; Load m_pReg to eax
mov DWORD PTR [ecx], eax ; Store eax to update DeviceCtrl
This is straightforward.
#2
m_pReg->DeviceCtrl |= DEVICE_RESET;
Assembly
mov eax, DWORD PTR [esi+180]
pop ecx
or BYTE PTR [eax+3], 4
What?
This surprises me! Where do the magic numbers 3, 4 come from? Can you come up with that?
Wow~
It becomes clear later when I stare at the number 67108864. The result of ORing a 32-bit value with 67108864(which is 0×04000000) is the same as the result of just ORing the highest byte with 0×04, because it’s a no-op to OR the following bytes with pure zeros. That said, the compiler tries to improve the I/O performance by writing a single byte instead of four on the data bus. Since the CPU I’m using (Intel i5-440) is little endian, the offset of the highest byte to this 32-bit register is 3.
Reasonable optimization, isn’t it?
But why does it cause the hardware problem after optimization? After writing the register, I read it back and it turns out to be the old value. Obviously, the write is rejected by the hardware.
My first thought is that the address put on the address bus is incorrect which causes the hardware not to accept the access. Usually, the NIC (I believe many of the other PCI/PCIe devices behave like this) checks the address bus and will reject or ignore the access if the address is not expected. In this case, the address/offset of DeviceCtrl is 0, the next register valid would start at 4. Obviously, offset 3 (highest byte) is not an effective address of a 32 bit register that NIC recognizes.
Is that the reason why the access is rejected?
Wait a minute.. No way..
Actually, the memory access is always aligned to 4 bytes on x86 32-bit platform. Even if a non-aligned address is produced (e.g. 3). CPU always generates a properly aligned address. In this case, it’s 0. But one thing is true. CPU won’t try writing the low 3 bytes on data bus. Reason? Whatever it writes for those bytes, it’s wrong since it’s only given the value of the highest byte.
Byte Enable
On i5-440, there are four Byte Enable signals corresponding to four bytes of the data bus. A Byte Enable is signalled to enable the transferring of a specific byte, effectively telling the memory that this part of the data bus is used. In our example, only the highest Byte Enable is on and only that byte is transferred.
Is it true that the NIC monitors Byte Enable signals and accepts the access only if all the four signals are enabled?
Yes! It’s proven by the data sheet of the NIC.
The device has limited support of read and write requests when only
part of the byte enable bits are set as described later in this
section. Partial writes to the MSI-X table are supported. All other
partial writes are ignored and silently dropped.
More About the Optimization
In MSVC6.0, this optimization takes place even if all optimizations are disabled(/Od). But there is no such optimization in VC2012 even with full optimization enabled(/Ox). Maybe Microsoft considers this as a premature one and discards it ever since.
Takeaways
When troubleshooting low lever drivers, try disabling all optimizations first. Alternatively, run the debug version (should always be the first choice, right?), since most (if not all) optimizations are automatically turned off in debug version.