These articles are written by Codalogic empowerees as a way of sharing knowledge with the programming community. They do not necessarily reflect the opinions of Codalogic.
(Code for this mini-series can be downloaded from Github)
When running programs its handy to print out numerical values.
puthex prints of the
value in the
x0 register in hexdecimal format.
In addition to being a useful function we will also see some
interesting Aarch64 instructions, including
Compared to the earlier subroutines, this has turned out to be quite long but hopefully it will all make sense.
I want to prefix the displayed number with
0x. To do that I defined an ASCII
string. I put it in the
.text program segment because I didn't want it to be modifiable.
hex_preamble: .ascii "0x"
The prefix is only 16-bits long. Instructions need to be aligned to 32-bit boundaries so
it is necessary to tell the assembler to re-align its output using the
In this case the
2 tells the assembler to align to a
2^2 byte boundary.
.align 2 // 4 byte boundary
Starting the subroutine, we do the regular stack from preamble sequence.
puthex: // void puthex( int64 x0 /* v */ ) stp fp, lr, [sp,#-16]! mov fp, sp
We want to use some register while the subroutine is running, specifically
x19 as a temporary store and
x20 as a loop counter, so we
store their initial values on the stack for later recovery.
stp x19, x20, [sp,#-16]!
x1 to call subroutines so we need to put the input
value somewhere where we can retrieve it after a subroutine call. I've chosen
to put it in
mov x19, x0
Next we print the preamble string mentioned earlier using our
// Print "0x" to make it clear it's a hex value ldr x0, =hex_preamble mov x1, 2 bl write
I want to skip leading zeros, but I don't want to print nothing if the
whole value is zero. To handle this case a test is made to see if
the input is zero. This is done using a
cmp instruction. If the input is
not zero the "branch not equal"
b.ne instruction wil branch to the next section.
If the input is zero we use
putc to output two zeros.
Note that each time we call
putc we must re-initialise
because called functions are allowed to overwrite the contents of
x7. When this special case is handled we branch
to the exit. (A label beginning with
.L is a local label and won't be exported
in the object file. Local labels still have file wide scope so I have chosen to
name local labels using the format:
.L_<subroutine name>_<local name>.)
// If the input value is 0, print "00" then jump to return cmp x19, 0 b.ne .L_puthex_1 mov x0, #'0' bl putc mov x0, #'0' bl putc b .L_puthex_exit
Now here's a neat Aarch64 instruction! We want the highest
order byte to be processed and displayed first. We therefore use the
rev instruction to reverse the order of the bytes in our
input register. If the input was
result would be
.L_puthex_1: // Reverse the order of the bytes in x19 rev x19, x19
A 64-bit register contains 8 bytes so we need to run the next loop
8 times. We're using
x20 to keep track of how many more loops we
need to do.
// There are 8 bytes in x19 so we have to do the below // operation 8 times mov x20, #8
I want to skip leading zeros.
If the least significant byte in
x19 is non-zero
branch to the main display code to display it.
.L_puthex_2: tst x19, #0x0f bne .L_puthex_3
Otherwise skip the zero byte by shifting in the next byte and decrementing the loop count. Then branch back to the check to see if the next byte is zero.
lsr x19, x19, #8 subs x20, x20, 1 bne .L_puthex_2 b .L_puthex_exit // Defensive - Shouldn't be possible // to get here as value can't be zero
A byte contains two hex nibbles. To avoid code duplication I have added an extra subroutine to
output a 4 bit nibble. I will describe that later. But first
we have to get the top nibble into the low nibble using an
lsr logical shift right instruction. We can then call the
puthexnibble nibble output subroutine, reload
x0 with the output value and
output the low nibble.
.L_puthex_3: // Output top nibble of byte (note lsr) mov x0, x19 lsr x0, x0, #4 bl puthexnibble // Output bottom nibble of byte mov x0, x19 bl puthexnibble
Having output the hex for a byte, we move the next byte into the lower
byte position using another
lsr logical shift right.
lsr x19, x19, #8
Time to work out if we have done enough loops. We subtract
1 from our
loop count stored in
x20. We have seen the
sub instruction before
but this time we use the
subs instruction. Unlike
sub, this updates the
status register with the result of the subtraction. If the result is zero
z zero flag will be set in the status register and we can test that
using the branch if not equal
b.ne instruction. If the count hasn't got to zero
we loop back to display the next byte.
subs x20, x20, 1 b.ne .L_puthex_3
If the count is zero then all the bytes have been output. All that remains to
recover the registers we put aside for safe keeping, do the stack frame
postamble and do the
ret subroutine return.
.L_puthex_exit: ldp x19, x20, [sp] mov sp, fp ldp fp, lr, [sp], #16 ret
posthexnibble instruction mentioned earlier highlights another interesting
feature of the Aarch64 instruction set.
As you know, hex values are represented by the ASCII characters
f. We can convert a number in the range
9 to its ASCII value
by adding the ASCII value of
0 to it. However, this does not work for a
number is the range
f. In this latter case we have to add the
ASCII value of
10. We need to do a test to see which of these
two cases our input number falls into.
In the code below we compare the value to
10 using a
This updates the status register. Rather than perform a branch on the result of this
test we perform both of the above modifications on our number, storing the
result of the first (
x0 + ASCII value of
x1 and the second (
x0 + ASCII value of
x2. Becuase we did these calculations using
add instructions rather
adds instruction, they didn't modify the status register which means the
result of the earlier
cmp instruction is preserved.
This allows us to use the 'conditional select'
csel instruction. This has an
'less than' condition attached to it. Therefore, if the earlier
yielded a 'less than' result, the contents of
x1 will be stored in
x0, otherwise the
x2 will be stored in
Performing conditional selection operations like this avoids having to use
branches that can dramatically slow down the execution of the code. Depending on
the specific Arm core used the two
add instructions could be dispatched to separate
execution pipelines and be performed in parallel meaning no time was lost.
puthexnibble: // void puthexnibble( int64 x0 /* v */ ) and x0, x0, #0x0f cmp x0, #10 add x1, x0, #'0' add x2, x0, #'a'-10 csel x0, x1, x2, LT b putc // As putc is the last and only subroutine called we can do a // jump rather than a subroutine. The ret in putc will take // us back to the calling function // If we could guarentee putc always followed puthexnibble we // wouldn't even need the branch!
Oh, and I snuck in a little
putnl() subroutine to just put out a new line
in order to make life simpler!
Observe that because there is no stack frame
manipulations and the branch to
putc is the last thing done in this routine,
b branch can be done to
putc rather than a
use the link register to directlly return control back to the calling function.
putnl: // void putnl() mov x0, #'\n' b putc
Examples of calling
puthex are below.
This highlights another feature of the instruction set. Because all instructions are coded into 32-bit words it's impossible for an instruction to encode a 64-bit immediate value. The largest immediate value that can be inserted into a register in one instruction is 16-bits.
Therefore the following 16-bit values can be loaded in one instruction.
_start: mov x0, #0 bl puthex bl putnl mov x0, #0x89cd bl puthex bl putnl
Immediate moves of negative values will be automatically assembled to use the
movn 'Move wide with NOT' instruction. This zeros the register, inserts the immediate
value with an optional shift and then inverts the result.
mov x0, #-100 bl puthex bl putnl
This means that the immediate range in one instruction is roughly
+/-65536. I say roughly
because the assembler will try to use shifted values to load values beyond this range if it can.
There are two ways around this 16-bit limit.
Firstly, you can use the 'zero and insert' move instruction
movz together with the
'keep and insert'
movk instruction. These instructions allow the specified 16-bit
immediate value to be left shifted by
48 bits before inserting the
value into a register. The
movz (zero) instruction will set the register to zero before
inserting the shifted immediate value and the
movk (keep) instruction will not zero the
register, keeping its initial value, and insert the shifted immediate value.
Combinations of these instructions allow any 64-bit (or smaller) immediate value to be
placed in a register.
// Loads 0x01234567 movz x0, #0x0123, LSL 16 movk x0, #0x4567 bl puthex bl putnl // Loads 0x0123456789abcdef movz x0, #0x0123, LSL 48 movk x0, #0x4567, LSL 32 movk x0, #0x89ab, LSL 16 movk x0, #0xcdef bl puthex bl putnl
The second is to use some assembler magic with the special assembler
ldr x0, =0xfedcba9876543210 form. This
puts the value in the
.text area of the program and automagically defines a pointer
to it which is inserted in place of the specified value. Using
objdump -d puthex
will give you a better idea of how this works. The result of this is shown after the
ldr x0, =0xfedcba9876543210 bl puthex bl putnl bl exit
The relevant part of the
obj -d puthex output is here. You can see
0x0123456789abcdef value has been put in memory
in little endian order at location
4001d8 and the
has been modified to read
x0 from that location.
4001bc: 580000e0 ldr x0, 4001d8 <_start+0x5c> 4001c0: 97ffffb0 bl 400080 <puthex> 4001c4: 97ffffd7 bl 400120 <putnl> 4001c8: 97ffffea bl 400170 <exit> 4001cc: 00000000 .inst 0x00000000 ; undefined 4001d0: 00400078 .word 0x00400078 4001d4: 00000000 .word 0x00000000 4001d8: 76543210 .word 0x76543210 4001dc: fedcba98 .word 0xfedcba98
And that's it. We're finally done. Although long, this exercise has shown a
number of interesting aspects of the Aarch64 instruction set. We've seen
and conditinal branches, the difference between
for loading large immediate values and the special instructions
I hope you enjoyed it.
The whole subroutine is:
puthex: // void puthex( int64 x0 /* v */ ) stp fp, lr, [sp,#-16]! mov fp, sp // We need x0 and x1 to call subroutines so put input x0 // somewhere safe stp x19, x20, [sp,#-16]! mov x19, x0 // Print "0x" to make it clear it's a hex value ldr x0, =hex_preamble mov x1, 2 bl write // If the input value is 0, print "00" then jump to return cmp x19, 0 b.ne .L_puthex_1 mov x0, #'0' bl putc mov x0, #'0' bl putc b .L_puthex_exit .L_puthex_1: // Reverse the order of the bytes in x19 rev x19, x19 // There are 8 bytes in x19 so we have to do the below // operation 8 times mov x20, #8 .L_puthex_2: // We want to skip leading zeros. // If the least significant byte in x19 is non-zero // branch to display it, else shift in the next byte // and decrement the loop count tst x19, #0x0f b.ne .L_puthex_3 lsr x19, x19, #8 subs x20, x20, 1 bne .L_puthex_2 b .L_puthex_exit // Defensive - Shouldn't be possible // to get here as value can't be zero .L_puthex_3: // Output top nibble of byte (note lsr) mov x0, x19 lsr x0, x0, #4 bl puthexnibble // Output bottom nibble of byte mov x0, x19 bl puthexnibble // See if we've finished lsr x19, x19, #8 subs x20, x20, 1 bne .L_puthex_3 .L_puthex_exit: ldp x19, x20, [sp] mov sp, fp ldp fp, lr, [sp], #16 ret puthexnibble: // void puthexnibble( int64 x0 /* v */ ) and x0, x0, #0x0f cmp x0, #10 add x1, x0, #'0' add x2, x0, #'a'-10 csel x0, x1, x2, LT b putc // As putc is the last and only subroutine called we can do a // jump rather than a subroutine. The ret in putc will take // us back to the calling function // If we could guarentee putc always followed puthexnibble we // wouldn't even need the branch! putnl: // void putnl() mov x0, #'\n' b putc
As usual, using the script, the program can be assembled and run using: