These articles are written by Codalogic empowerees as a way of sharing knowledge with the programming community. They do not necessarily reflect the opinions of Codalogic.
To improve performance, modern processors allow multiple instructions to executed at the same time. With instructions that take different numbers of clock cycles to execute this means that instructions can complete execution in a different order to which they were started.
This is called "out-of-order" execution.
This completion in a different order to the assembly code instruction order means that without special care the temporal dependencies between register values in the initial assembly code may be lost, resulting in the code not behaving as desired.
These situations are called "hazards".
For example, take the following instructions sequence:
mul r0, r1, r2 add r3, r0, r3 sub r0, r4, r5
The first two instructions are a DSP-like multiply-accumulate set of instructions that might be used on a processor
that doesn't have a dedicated MAC instruction (such as
MADD on the Aarach64).
are being multiplied and the result accumulated into
mul instructions typically take many more clock cycles to execute than adds and substracts.
Therefore, if the
mul instruction is issued and then the
add instruction is issued, the
r0 won't be ready in time for the
add to use it. Hence without special handling
add instruction will accumulate the wrong value.
This is a "Read-after-Write" or "RAW" hazard.
sub instruction is writing to
r0. We don't want the
r0 value created by the
instruction to overwrite this and we don't want the
add instruction to use the
r0 value created by the
sub instruction. These are "Write-after-Write" and "Write-after-Read" hazards respectively.
For small processors (such as Arm LITTLE core processors) this can be handled by delaying the
dispatch of dependent instructions. For example, the
add instruction could be delayed until the result of the
mul is in
r0, thus enforcing in-order
execution. However, this means stalling the processor pipeline and potentially wasting clock cycles.
One way to avoid this is to do "register renaming".
One scheme for this is to provision an additional set of registers that will be used as alternative destinations for
instruction results. In some processors there may be up to 10 times more such registers than registers in the programmer's model.
I'll call these "Rename Registers" and reference them in the modifed assembly code below as
The register rename logic will conceptionally replace the contents of a main architectural or programmer model register with the identity of a rename register. Each time the assembly code has a main register written to, the rename logic will allocate a new rename register and store the identity of that rename register in the main register set.
The register renaming logic will then re-write the above assembly code to look something like:
mul rnr10, r1, r2 add rnr30, rnr10, r3 sub rnr40, r4, r5
Note that the rename registers are allocated to registers before their values are known. For this reason rename registers will contain extra house keeping information to record whether they contain valid data.
add instruction is dispatched it will be told to wait until the data in rename register
valid before executing. When
rnr10 is written to by the
mul instruction an implementation typically
broadcasts to all functional units that rename register
rnr10 has been updated and any instructions waiting on that
should go and get its value.
sub instruction has been modified to write to rename register
rnr40 and no longer risks being read
add instruction or being overwritten to by the
One issue that remains is deciding when a value in a rename register is no longer needed. We've gone to great efforts to decouple the lifetime of the rename register values from the original instruction order and now we need to come up with a scheme for working out when we no longer need the rename register values. This is tyically done by a re-order buffer or "ROB" that keeps track of the order in which the instructions were initially dispatched and retires them, writing back their data to the main architectural programmer model registers, in the order they appeared in the original instruction sequence.
An alternative approach to the overall problem of handling out-of-order execution is to use "Tomasulo's Algorithm".
Here, each set of execution units has "reservation stations" to which the dispatcher sends instructions to. If an instruction doesn't have the data it needs to execute immediately the reservation station is told what data it needs to wait for. However, unlike previously where the units are waiting for data to be written to a rename register, the reservation station is told to wait for the data from a particular execution unit.
When each execution unit finishes it operation it broadcasts to all reservation stations a message equivalent to: "I am execution unit XYZ, my data is ABC". If a reservation station is waiting on data from that execution unit, it will capture the data and, if all the data it needs is now available, start executing its functionality.
The main programmer model registers will also be told to wait on results of execution units and update their values when the data is broadcast. In fact, rather than storing just values, the main registers can also store the identity of an execution unit to wait on to get the value from and this can be copied to a reservation station in place of a specific value.
So, for example, if execution unit
mul2 is scheduled to execute the first multiply, execution unit
alu3 is scheduled to execute the
add and execution
alu0 the subtract, our original assembly code of:
mul r0, r1, r2 add r3, r0, r3 sub r0, r4, r5
might get interpreted as:
mul announce as mul2, r1, r2 ; set r0 = wait on mul2 add announce as alu3, r0 -> listen on mul2, r3 ; set r3 = wait on alu3 sub announce as alu0, r4, r5 ; set r0 = wait on alu0
Those instructions will be dispatched in-order even though they might complete out-of-order.
add looks at the status of
r0 and sees that it has to wait for the announcement of execution unit
r3 is set to capture the data associated with the announcement from
alu3 makes the announcement,
r3 will capture the relevant data.
At the end of the sequence,
r0 will be listening for the completion of
rather than the completion of
mul2 and consequently the result of the
not overwrite the result of the
As a result the values we want end up in
r3 so the hazards are avoided.
In conclusion, rather than listening for changes to rename registers, Tomasulo's Algorithm moves the the broadcast functionality to the execution units themselves. This avoids the need for the separate set of rename registers and in so doing removes the need to work out when a rename register is no longer required, simplifying the implemenation.
Personally I think Tomasulo's Algorithm is one of the neatest computer solutions I've seen. It takes a complex and mindboggling problem and solves it in a simple and elegant way.
Alas, Tomasulo's Algorithm does have some weaknesses. The allocation of a reservation station for an instruction that is not able to execute due to dependencies can block an execution unit being used by a subsequent instruction that is able to execute. Some decoupling of reservation stations and executions units plus some instruction ticketing may be able to address this.
But most importantly, by itself it does not handle speculative execution rollback. Speculative execution as part of branch prediction and the attendant handling of mis-predicted cases is an essential part of any modern high performance out-of-order processor. Some sort of Reorder Buffer is required to handle this.