These articles are written by Codalogic empowerees as a way of sharing knowledge with the programming community. They do not necessarily reflect the opinions of Codalogic.
To improve performance, modern processors allow multiple instructions to executed at the same time. With instructions that take different numbers of clock cycles to execute this means that instructions can complete execution in a different order to which they were started.
This is called "out-of-order" execution.
This completion in a different order to the assembly code instruction order means that without special care the temporal dependencies between register values in the initial assembly code may be lost, resulting in the code not behaving as desired.
These situations are called "hazards".
For example, take the following instructions sequence:
mul r0, r1, r2
add r3, r0, r3
sub r0, r4, r5
The first two instructions are a DSP-like multiply-accumulate set of instructions that might be used on a processor
that doesn't have a dedicated MAC instruction (such as MADD
on the Aarach64). r1
and r2
are being multiplied and the result accumulated into r3
.
mul
instructions typically take many more clock cycles to execute than adds and substracts.
Therefore, if the mul
instruction is issued and then the add
instruction is issued, the
value of r0
won't be ready in time for the add
to use it. Hence without special handling
the add
instruction will accumulate the wrong value.
This is a "Read-after-Write" or "RAW" hazard.
Similarly, the sub
instruction is writing to r0
. We don't want the r0
value created by the mul
instruction to overwrite this and we don't want the add
instruction to use the r0
value created by the
sub
instruction. These are "Write-after-Write" and "Write-after-Read" hazards respectively.
For small processors (such as Arm LITTLE core processors) this can be handled by delaying the
dispatch of dependent instructions. For example, the add
instruction could be delayed until the result of the mul
is in r0
, thus enforcing in-order
execution. However, this means stalling the processor pipeline and potentially wasting clock cycles.
One way to avoid this is to do "register renaming".
One scheme for this is to provision an additional set of registers that will be used as alternative destinations for
instruction results. In some processors there may be up to 10 times more such registers than registers in the programmer's model.
I'll call these "Rename Registers" and reference them in the modifed assembly code below as rnr1
, rnr2
etc.
The register rename logic will conceptionally replace the contents of a main architectural or programmer model register with the identity of a rename register. Each time the assembly code has a main register written to, the rename logic will allocate a new rename register and store the identity of that rename register in the main register set.
The register renaming logic will then re-write the above assembly code to look something like:
mul rnr10, r1, r2
add rnr30, rnr10, r3
sub rnr40, r4, r5
Note that the rename registers are allocated to registers before their values are known. For this reason rename registers will contain extra house keeping information to record whether they contain valid data.
When the add
instruction is dispatched it will be told to wait until the data in rename register rnr10
is
valid before executing. When rnr10
is written to by the mul
instruction an implementation typically
broadcasts to all functional units that rename register rnr10
has been updated and any instructions waiting on that
should go and get its value.
Similarly, the sub
instruction has been modified to write to rename register rnr40
and no longer risks being read
by the add
instruction or being overwritten to by the mul
instruction.
One issue that remains is deciding when a value in a rename register is no longer needed. We've gone to great efforts to decouple the lifetime of the rename register values from the original instruction order and now we need to come up with a scheme for working out when we no longer need the rename register values. This is tyically done by a re-order buffer or "ROB" that keeps track of the order in which the instructions were initially dispatched and retires them, writing back their data to the main architectural programmer model registers, in the order they appeared in the original instruction sequence.
An alternative approach to the overall problem of handling out-of-order execution is to use "Tomasulo's Algorithm".
Here, each set of execution units has "reservation stations" to which the dispatcher sends instructions to. If an instruction doesn't have the data it needs to execute immediately the reservation station is told what data it needs to wait for. However, unlike previously where the units are waiting for data to be written to a rename register, the reservation station is told to wait for the data from a particular execution unit.
When each execution unit finishes it operation it broadcasts to all reservation stations a message equivalent to: "I am execution unit XYZ, my data is ABC". If a reservation station is waiting on data from that execution unit, it will capture the data and, if all the data it needs is now available, start executing its functionality.
The main programmer model registers will also be told to wait on results of execution units and update their values when the data is broadcast. In fact, rather than storing just values, the main registers can also store the identity of an execution unit to wait on to get the value from and this can be copied to a reservation station in place of a specific value.
So, for example, if execution unit mul2
is scheduled to execute the first multiply, execution unit alu3
is scheduled to execute the
add
and execution alu0
the subtract, our original assembly code of:
mul r0, r1, r2
add r3, r0, r3
sub r0, r4, r5
might get interpreted as:
mul announce as mul2, r1, r2 ; set r0 = wait on mul2
add announce as alu3, r0 -> listen on mul2, r3 ; set r3 = wait on alu3
sub announce as alu0, r4, r5 ; set r0 = wait on alu0
Those instructions will be dispatched in-order even though they might complete out-of-order.
The add
looks at the status of r0
and sees that it has to wait for the announcement of execution unit mul2
.
Register r3
is set to capture the data associated with the announcement from alu3
. When alu3
makes the announcement,
r3
will capture the relevant data.
At the end of the sequence, r0
will be listening for the completion of alu0
rather than the completion of mul2
and consequently the result of the mul
will
not overwrite the result of the sub
.
As a result the values we want end up in r0
and r3
so the hazards are avoided.
In conclusion, rather than listening for changes to rename registers, Tomasulo's Algorithm moves the the broadcast functionality to the execution units themselves. This avoids the need for the separate set of rename registers and in so doing removes the need to work out when a rename register is no longer required, simplifying the implemenation.
Personally I think Tomasulo's Algorithm is one of the neatest computer solutions I've seen. It takes a complex and mindboggling problem and solves it in a simple and elegant way.
Alas, Tomasulo's Algorithm does have some weaknesses. The allocation of a reservation station for an instruction that is not able to execute due to dependencies can block an execution unit being used by a subsequent instruction that is able to execute. Some decoupling of reservation stations and executions units plus some instruction ticketing may be able to address this.
But most importantly, by itself it does not handle speculative execution rollback. Speculative execution as part of branch prediction and the attendant handling of mis-predicted cases is an essential part of any modern high performance out-of-order processor. Some sort of Reorder Buffer is required to handle this.
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
November 2021
June 2021
May 2021
April 2021
March 2021
October 2020
September 2020
September 2019
March 2019
June 2018
June 2017
August 2016