These articles are written by Codalogic empowerees as a way of sharing knowledge with the programming community. They do not necessarily reflect the opinions of Codalogic.

Exploring PowerPC's "read-modify-write" Operations

By: Pete, December 2022

In multi-threaded systems, the ability for a thread to read a value from memory, modify it and write the result back without another thread being able to unintentionally corrupt the operation is very important. Such operations are the basis for mutexes, locks and indexes to message queues to name a few.

For example, if we look at the following C++ code (https://godbolt.org/z/9E9zf1735):

#include <atomic>

int a;
std::atomic<int> b;

int main()
{
    ++a;
    ++b;
}

The generated code with GCC using -O1 optimisation for x64 is as follows:

main:
        add     DWORD PTR a[rip], 1
        lock add        DWORD PTR b[rip], 1
        xor     eax, eax
        ret

Here you can see that the generated code uses the x64 lock opcode instruction prefix for implementing the atomic increment (opcode f0 83 05 instead of just 83 05).

This tells the processor to "lock" the address and data bus. Once the read has been performed, the processor does not relinguish the bus to other threads until the write is complete.

The PowerPC does not have such bus locking operations. (And there's possibly quite a penalty to implement such operations with memory like DDR4/5 where the address transfers and data transfers are offset).

Instead, PowerPC uses the lwarx (Load Word and Reserve Indexed) and stwcx. (Store Word Conditionl Indexed) instructions.

The format for the lwarx instruction is lwarx rD,RA,rB and the pseudo-code for the instruction is:

if rA=0 then b=0
else b=rA
EA=b+rB
RESERVE=1
RESERVE_ADDR=func(EA)
rD=MEM(EA,4)

The format for the stwcx. instruction is stwcx. Rs,RA,Rb and the pseudo-code for the instruction is (slightly tweaked for clarity):

if rA=0 then b=0
else b=rA
EA=b+rB
if RESERVE then
   MEM(EA,4)=rS
   RESERVE=0
   CR[EQ]=1
else
   CR[EQ]=0

I found the description of these instructions quite limited in my MPC601 PowerPC 601 User's Manual and it took me a while to understand what was going on here (hence the reason for this blog post).

For example, there is no RESERVE or RESERVE_ADDR register in the programmer model. So what is going on?

As the pseudo-code indicates, when the lwarx instruction is executed an internal RESERVE bit is set and the physical address that is used on the bus that the logical address is translated to (by the MMU) is stored in the RESERVE_ADDR register. The RESERVE and RESERVE_ADDR are deep inside the processor implementation and are not accessible to the programmer.

Once these RESERVE and RESERVE_ADDR are set, the processor monitors the address bus for write operations matching the captured RESERVE_ADDR address. This is a snooping operation much the same as is performed to maintain cache coherency across multiple processors (in the PowerPC 601 case, using the MESI protocol).

If such a write is detected the RESERVE bit is cleared.

When the stwcx. instruction is executed, as the pseudo-code shows, the RESERVE bit is checked and if still set it performs the store operation and sets the eq flag in the condition register to 1. If the RESERVE bit has been cleared by a write operation since the lwarx instruction it doesn't perform the write and sets the eq flag in the condition register to 0.

Hence the code can know whether the store was successful or not be looking at the eq flag in the condition register.

Typically if the store operation was unsuccesful, the code would loop back to the lwarx instruction and try the operation again.

For example, to implement the atomic increment in PowerPC code you would do something like the following (where r1 contains the logical address of the memory location to be modified and r2 is used as the intermediate value):

loop: lwarx   r2,0,r1
      addi    r2,r2,#1
      stwcx.  r2,0,r1
      bne     loop

This approach turns out to be quite flexible as essentially arbitrary code can be executed between the lwarx and stwcx. instruction, making it more powerful than the x64-style lock op-code prefixes.

Keywords