These articles are written by Codalogic empowerees as a way of sharing knowledge with the programming community. They do not necessarily reflect the opinions of Codalogic.

Aarch64 Stack Frames Again

By: Pete, October 2022

In a previous blog I talked about stack frames and presented what I consider a "Traditional" stack frame layout.

The stack frame entry code I presented looked like:

    stp     fp, lr, [sp,#-16]!
    mov     fp, sp
    sub     sp, sp, #160

The link register (AKA x30) and the frame pointer (AKA x29) is pushed on the stack, the modified stack pointer is stored in the frame pointer and then space is made on the stack for any local and temporary variables.

And the postamble looked like this:

    mov     sp, fp
    ldp     fp, lr, [sp], #16
    ret

I noticed that the stack frame pre- and post-amble generated by GCC, Clang and MSVC didn't look like this.

I therefore wrote the following short program and inspected the generated assembly using Compiler Explorer.

#include <string>
#include <iostream>

std::string merge( std::string a, std::string b, std::string c )
{
    std::string d = a + b;
    std::string e = a + d + b;
    return d + e;
}

int main()
{
    merge( "a", "b", "c" );
}

The motivation here is to create a function that requires more data than can fit in the processor's registers and hence has to allocate stack space.

The stack pre-amble generated by armv8-a Clang (Available at: https://godbolt.org/z/3hYzxYG1r) looked as follows:

merge(std::__cxx11::basic_string<char, std::char_traits<char>, ...
        sub     sp, sp, #176
        stp     x29, x30, [sp, #160]            // 16-byte Folded Spill
        add     x29, sp, #160

Here the stack is grown first (towards lower memory) and then the stp reaches back to the top of the allocated region to insert the frame pointer (fp/x29) and link register (x30/lr). The location of where the frame pointer and link register was stored was computed and then stored in the new frame pointer.

The resulting stack ends up similar to my "traditional" layout but computed in a different way. It looks like this:

|                     |
+---------------------+
|          lr         |
+---------------------+
|    original fp      | <- fp
+---------------------+
|                     |
|                     |
|     ...space...     |
|                     |
|                     | <- sp
+---------------------+

The Clang stack post-amble in this case is:

        ldp     x29, x30, [sp, #160]            // 16-byte Folded Reload
        add     sp, sp, #176
        ret

The code reaches back to retrieve the frame pointer and link register and then computes what the stack pointer would have been before the function was entered. Note that it doesn't use the frame pointer to do this.

With GCC the following pre-amble is used (available at: https://godbolt.org/z/PMKWzP93Y):

merge(std::__cxx11::basic_string<char, std::char_traits<char>,...:
        stp     x29, x30, [sp, -160]!
        mov     x29, sp

Here the frame pointer and link register end up stored at the bottom of the allocated stack space. That seems unusual to me!

|                     |
+---------------------+
|                     |
|                     |
|     ...space...     |
|                     |
|                     |
+---------------------+
|          lr         |
+---------------------+
|    original fp      | <- fp, sp
+---------------------+

The post-amble is below. Note again, the frame pointer is not used and it relies on the compiler keeping track of how many words it has allocated on the stack (easy for a compiler to do but not so easy for a person).

        ldp     x29, x30, [sp], 160
        ret

The GCC approach does avoid requiring an additional sub sp, sp, ? instruction so it makes sense in that respect - if you can easily keep track of how much space you've allocated on the stack.

MS Visual Studio does the following. It seems to use some security cookies to protect the stack from (presumably) ROP attacks. It computes space and pushes the frame pointer and link register on the stack and then reaches back to store x19. It stores the revised stack pointer in the frame pointer and the allocates stack space for the local data.

MSVC (https://godbolt.org/z/4q3EoTKhj)

... merge(std::basic_string<char,std::char_traits<char>...
        stp         fp,lr,[sp,#-0x20]!
        str         x19,[sp,#0x10]
        mov         fp,sp
        bl          __security_push_cookie
        sub         sp,sp,#0xA0
        mov         x19,sp

The stack ends up looking like this:

|                     |
+---------------------+
|       (unused)      |
+---------------------+
|         x19         |
+---------------------+
|         lr          |
+---------------------+
|    original fp      | <- fp
+---------------------+
|                     |
|                     |
|     ...space...     |
|                     |
|                     | <- sp
+---------------------+

If we modify the MSVC code to remove the impact of the x19 based security cookie then the code looks like the traditional pre-amble:

... modified merge(std::basic_string<char,std::char_traits<char>...
        stp         fp,lr,[sp,#-0x10]!
        mov         fp,sp
        sub         sp,sp,#0xA0

The MSVC post-amble is:

        ldr         x0,[x19,#8]
        add         sp,sp,#0xA0
        bl          __security_pop_cookie
        ldr         x19,[sp,#0x10]
        ldp         fp,lr,[sp],#0x20
        ret

Again, the security cookie changes things, but as with the other compilers, it is relying on keeping track of how much the stack pointer has been changed rather than relying on the frame pointer.

In summary, it's interesting how the different compilers solve the same problem. Persoanally I would use my "traditional" stack frame for hand crafted code and use the frame pointer in the post-amble unless the slight added efficiency of the GCC approach was demonstrably beneficial.