These articles are written by Codalogic empowerees as a way of sharing knowledge with the programming community. They do not necessarily reflect the opinions of Codalogic.

Comparing Register Renaming and Tomasulo's Algorithm

By: Pete, November 2022

To improve performance, modern processors allow multiple instructions to executed at the same time. With instructions that take different numbers of clock cycles to execute this means that instructions can complete execution in a different order to which they were started.

This is called "out-of-order" execution.

This completion in a different order to the assembly code instruction order means that without special care the temporal dependencies between register values in the initial assembly code may be lost, resulting in the code not behaving as desired.

These situations are called "hazards".

For example, take the following instructions sequence:

mul r0, r1, r2
add r3, r0, r3
sub r0, r4, r5

The first two instructions are a DSP-like multiply-accumulate set of instructions that might be used on a processor that doesn't have a dedicated MAC instruction (such as MADD on the Aarach64). r1 and r2 are being multiplied and the result accumulated into r3.

mul instructions typically take many more clock cycles to execute than adds and substracts. Therefore, if the mul instruction is issued and then the add instruction is issued, the value of r0 won't be ready in time for the add to use it. Hence without special handling the add instruction will accumulate the wrong value.

This is a "Read-after-Write" or "RAW" hazard.

Similarly, the sub instruction is writing to r0. We don't want the r0 value created by the mul instruction to overwrite this and we don't want the add instruction to use the r0 value created by the sub instruction. These are "Write-after-Write" and "Write-after-Read" hazards respectively.

For small processors (such as Arm LITTLE core processors) this can be handled by delaying the dispatch of dependent instructions. For example, the add instruction could be delayed until the result of the mul is in r0, thus enforcing in-order execution. However, this means stalling the processor pipeline and potentially wasting clock cycles.

One way to avoid this is to do "register renaming".

One scheme for this is to provision an additional set of registers that will be used as alternative destinations for instruction results. In some processors there may be up to 10 times more such registers than registers in the programmer's model. I'll call these "Rename Registers" and reference them in the modifed assembly code below as rnr1, rnr2 etc.

The register rename logic will conceptionally replace the contents of a main architectural or programmer model register with the identity of a rename register. Each time the assembly code has a main register written to, the rename logic will allocate a new rename register and store the identity of that rename register in the main register set.

The register renaming logic will then re-write the above assembly code to look something like:

mul rnr10, r1, r2
add rnr30, rnr10, r3
sub rnr40, r4, r5

Note that the rename registers are allocated to registers before their values are known. For this reason rename registers will contain extra house keeping information to record whether they contain valid data.

When the add instruction is dispatched it will be told to wait until the data in rename register rnr10 is valid before executing. When rnr10 is written to by the mul instruction an implementation typically broadcasts to all functional units that rename register rnr10 has been updated and any instructions waiting on that should go and get its value.

Similarly, the sub instruction has been modified to write to rename register rnr40 and no longer risks being read by the add instruction or being overwritten to by the mul instruction.

One issue that remains is deciding when a value in a rename register is no longer needed. We've gone to great efforts to decouple the lifetime of the rename register values from the original instruction order and now we need to come up with a scheme for working out when we no longer need the rename register values. This is tyically done by a re-order buffer or "ROB" that keeps track of the order in which the instructions were initially dispatched and retires them, writing back their data to the main architectural programmer model registers, in the order they appeared in the original instruction sequence.

An alternative approach to the overall problem of handling out-of-order execution is to use "Tomasulo's Algorithm".

Here, each set of execution units has "reservation stations" to which the dispatcher sends instructions to. If an instruction doesn't have the data it needs to execute immediately the reservation station is told what data it needs to wait for. However, unlike previously where the units are waiting for data to be written to a rename register, the reservation station is told to wait for the data from a particular execution unit.

When each execution unit finishes it operation it broadcasts to all reservation stations a message equivalent to: "I am execution unit XYZ, my data is ABC". If a reservation station is waiting on data from that execution unit, it will capture the data and, if all the data it needs is now available, start executing its functionality.

The main programmer model registers will also be told to wait on results of execution units and update their values when the data is broadcast. In fact, rather than storing just values, the main registers can also store the identity of an execution unit to wait on to get the value from and this can be copied to a reservation station in place of a specific value.

So, for example, if execution unit mul2 is scheduled to execute the first multiply, execution unit alu3 is scheduled to execute the add and execution alu0 the subtract, our original assembly code of:

mul r0, r1, r2
add r3, r0, r3
sub r0, r4, r5

might get interpreted as:

mul announce as mul2, r1, r2     ; set r0 = wait on mul2
add announce as alu3, r0 -> listen on mul2, r3 ; set r3 = wait on alu3
sub announce as alu0, r4, r5     ; set r0 = wait on alu0

Those instructions will be dispatched in-order even though they might complete out-of-order. The add looks at the status of r0 and sees that it has to wait for the announcement of execution unit mul2. Register r3 is set to capture the data associated with the announcement from alu3. When alu3 makes the announcement, r3 will capture the relevant data. At the end of the sequence, r0 will be listening for the completion of alu0 rather than the completion of mul2 and consequently the result of the mul will not overwrite the result of the sub.

As a result the values we want end up in r0 and r3 so the hazards are avoided.

In conclusion, rather than listening for changes to rename registers, Tomasulo's Algorithm moves the the broadcast functionality to the execution units themselves. This avoids the need for the separate set of rename registers and in so doing removes the need to work out when a rename register is no longer required, simplifying the implemenation.

Personally I think Tomasulo's Algorithm is one of the neatest computer solutions I've seen. It takes a complex and mindboggling problem and solves it in a simple and elegant way.

Alas, Tomasulo's Algorithm does have some weaknesses. The allocation of a reservation station for an instruction that is not able to execute due to dependencies can block an execution unit being used by a subsequent instruction that is able to execute. Some decoupling of reservation stations and executions units plus some instruction ticketing may be able to address this.

But most importantly, by itself it does not handle speculative execution rollback. Speculative execution as part of branch prediction and the attendant handling of mis-predicted cases is an essential part of any modern high performance out-of-order processor. Some sort of Reorder Buffer is required to handle this.

Articles

February 2023