• Write x86_64 thunk to invoke any lambda function

    From Frederick Virchanza Gotham@21:1/5 to All on Sat Apr 15 17:22:19 2023
    In the C++ programming language, there are lambda functions. Some of them have captures, and some of them don't have captures. The lambda functions without captures can implicitly convert to a normal function pointer. The lamda functions with captures
    cannot implicitly convert to a normal function pointer.

    I started a thread yesterday on comp.lang.c++ in which I wanted to write a thunk in x86_64 assembler that would be able to invoke any lambda. I will copy-past that post here and then continue writing:

    Since C++11, there has been an implicit conversion from a lambda to a
    function pointer so long as the lambda has no captures. If the lambda
    has captures, the implicit conversion is disabled. However it's easy to
    get a function pointer from a lambda-with-captures if we use global
    variables or the heap, something like:

    std::function<void(void)> f; // global object

    void Func(void)
    {
    f(); // invokes the global object
    }

    void Some_Library_Func(void (*const pf)(void))
    {
    pf();
    }

    int main(int argc, char **argv)
    {
    auto mylambda = [argc](void) -> void
    {
    cout << "Hello " << argc << "!" << endl;
    };

    f = mylambda;

    Some_Library_Func(Func);
    }

    It is possible though to procure a normal function pointer from a lambda-with-captures without making use of global variables or the heap
    -- it can all be kept on the stack.

    To invoke a capture lambda, we need two pieces of data:
    Datum A: The address of the lambda object
    Datum B: The address of the 'operator()' member function

    Datum A is a pointer into data memory.
    Datum B is a pointer into code memory.

    The technique described in this post will only work on CPU's where the
    program counter can be set to an address in data memory, and therefore
    we will use 'void*' for Datum B rather than 'void(*)(void)'. I'm open
    to correction here but I think this technique will work on every
    implementation of C++ in existence today, even on microcontrollers such
    as the Texas Instruments F28069 and the Arduino Atmel sam3x8e.

    We will define a simple POD struct to hold these two pieces of data:

    struct LambdaInfo {
    void *data, *code;
    };

    Let's write a function that invokes a capture lambda, passing the 'this' pointer as the first argument to the member function:

    void InvokeLambda(LambdaInfo const *const p)
    {
    void (*pf)(void*) = (void (*)(void*))p->code;

    return pf(p->data);
    }

    And now let's check what this got compiled to on an x86_64 computer:

    mov rdx,QWORD PTR [rdi]
    mov rax,QWORD PTR [rdi+0x8]
    mov rdi,rdx
    jmp rax

    What we've got here is four instructions. To see a little more clearly
    what's going on here, I'm going to replace the function arguments with numerical constants:

    void InvokeLambda(void)
    {
    void (*pf)(void*) = (void (*)(void*))0x1122334455667788;

    return pf( (void*)0x99aabbccddeeffee );
    }

    gets compiled to:

    movabs rdi,0x99aabbccddeeffee
    movabs rax,0x1122334455667788
    jmp rax

    What we've got here now is three simple instructions. Here's the
    assembler alongside the machine code:

    movabs rdi,0x99aabbccddeeffee ---- 48 bf ee ff ee dd cc bb aa 99
    movabs rax,0x1122334455667788 ---- 48 b8 88 77 66 55 44 33 22 11
    jmp rax ---- ff e0

    What we have here is 22 bytes worth of CPU instructions, which we can
    put into a byte array as follows:

    char unsigned instructions[22u] = {
    0x48, 0xBF,
    0xEE, 0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99,
    0x48, 0xB8,
    0x88, 0x77, 0x66, 0x55, 0x44, 0x33, 0x22, 0x11,
    0xFF, 0xE0,
    };

    This 22-byte array can be our thunk. I've written C++ code to manage
    this thunk up on GodBolt, and it works fine:

    https://godbolt.org/z/r84hEsG1G

    This works fine for a function with no return value and no parameters. So next I wanted to work on a solution that would work for _any_ lambda function even it had 27 parameters and a return value greater than 87 bytes. So I followed up my post to comp.
    lang.c++ with the following:

    The first thunk that I wrote accommodated a function that returned void
    and took no arguments. The machine code of the thunk was 22 bytes.

    Now I want to write a thunk that will work for _any_ lambda,
    irrespective of how many parameters it has or how big the return type
    is. On an x86_64 computer, here's how a function call works:

    The return address is pushed onto the stack
    Int/Ptr arguments 1-6 go in registers RDI, RSI, RDX, RCX, R8, R9
    Int/Ptr arguments 7 and above go onto the stack (right to left)
    Floating point arguments 1-8 go in registers XMM0 - XMM7
    Floating point arguments 9 and above go onto the stack
    The callee stores its return value in RAX
    If the return value is > 8 bytes, the next 8 bytes go in RDX
    If the return value is > 16 bytes, extra bytes go onto the stack
    The callee jumps back to the return address

    So let's say we have a lambda function whose signature is something like
    the following:

    bool Func(int a, long b, float c, int *p);

    If this lambda function has captures, then there will need to be a
    hidden 'this' pointer:

    bool Func(void *this, int a, long b, float c, int *p);

    The first function signature without the 'this' pointer would be called
    as follows:
    Store a in RDI
    Store b in RSI
    Store c in XMM0
    Store d in RDX

    However the second function signature would be called as follows:
    Store the 'this' pointer in RDI
    Store a in RSI
    Store b in RDX
    Store c in XMM0
    Store d in RCX

    So in effect, when we introduce a hidden 'this' pointer as the first
    parameter, all of the other registers get moved down one, so inside the
    our thunk, we need to write assembler to do the following:

    push R9 onto stack
    move R8 -> R9
    move RCX -> R8
    move RDX -> RCX
    move RSI -> RDX
    move RDI -> RSI
    store the 'this' pointer in RDI
    invoke the lambda
    pop r9 off the stack
    jump back to the return address

    So the Assembler for the thunk should be something like:

    push r9
    mov r9,r8
    mov r8,rcx
    mov rcx,rdx
    mov rdx,rsi
    mov rsi,rdi
    movabs rdi, 0x1122334455667788 // the 'this' pointer
    movabs rax, 0x99aabbccddeeff22 // address of lambda code
    call rax
    add rsp,8 // pop r9 off stack
    ret

    This new and improved thunk is 44 bytes. This will work fine for a
    function that has any number of parameters, so long as the return
    value isn't more than 16 bytes. See it working up on GodBolt:

    https://godbolt.org/z/xTfEYo5jE

    So next what we need to do is try get it to work with a function that
    returns a value greater than 16 bytes. The thunk we've written so far
    is problematic in this situation because:
    Step 1) We push r9 onto the stack
    Step 2) We invoke the lambda function
    Step 3) We pop r9 off the stack

    Step 3 will cause a problem because instead of popping r9 off the stack,
    it will pop the extra bytes of the return value off the stack. In order
    to get around this, here's what we need to do:
    Step 1) Before invoking the lambda, save the stack pointer
    Step 2) After invoking the lambda, check if stack pointer has changed
    Step 3) If the stack pointer is changed, do a 'memmove' to move
    everything on the stack upwards by 8 bytes

    Our 'memmove' operation would look like:

    void move_stack(char *oldval, char *const newval)
    {
    char *p = oldval,
    *q = oldval - 8u;

    while ( oldval-- > newval )
    {
    *p-- = *q--;
    }
    }

    If we compile that to assembler and use r15,rsp instead of rdi,rsi, here
    is what it looks like:

    start:
    cmp rsp,r15
    jae end
    movzx r13d,BYTE PTR [r15-0x8]
    sub r15,0x1
    mov BYTE PTR [r15+0x1],r13b
    jmp start
    end:

    So now I'm going to try to append this onto the end of the thunk we've
    already written, here goes:

    push r9
    mov r9,r8
    mov r8,rcx
    mov rcx,rdx
    mov rdx,rsi
    mov rsi,rdi
    movabs rdi, 0x1122334455667788 (the 'this' pointer)
    movabs rax, 0x99aabbccddeeff22 (address of lambda code)
    mov r15, rsp
    call rax
    start_of_loop:
    cmp rsp,r15
    jae out_of_loop
    movzx r13d,BYTE PTR [r15-0x8]
    sub r15,0x1
    mov BYTE PTR [r15+0x1],r13b
    jmp start_of_loop
    out_of_loop:
    add rsp,8 ; move stack pointer back 8 bytes
    ret

    This latest thunk comes to 67 bytes. Here is my attempt to implement
    it on GodBolt but it's not working yet:

    https://godbolt.org/z/nzWvE7q34

    I haven't got it working yet but I'm very very close. I realise we'll
    have to figure out how to restore the values of r15 and r13 but sure
    we'll worry about that after we've got this much working.

    Can anyone see what's wrong with my latest thunk? There's something
    not right about where I compare r15 and rsp and then try to move the
    entire stack down 8 bytes.

    Anyone?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frederick Virchanza Gotham@21:1/5 to All on Sun Apr 16 06:22:14 2023
    On Sunday, April 16, 2023, I wrote:

    The callee stores its return value in RAX
    If the return value is > 8 bytes, the next 8 bytes go in RDX
    If the return value is > 16 bytes, extra bytes go onto the stack


    It looks like it isn't actually this simple. First let's see a function
    that returns a 128-Bit value:

    struct ReturnTypeB {
    long long unsigned a,b;
    };

    ReturnTypeB FuncB(void)
    {
    return { 0x1111111111111111, 0x2222222222222222 };
    }

    This gets compiled to:

    mov rax,0x1111111111111111
    mov rdx,0x2222222222222222
    ret

    This is exactly what I expected. But let's now make it return 24 bytes:

    struct ReturnTypeA {
    long long unsigned a,b,c;
    };

    ReturnTypeA FuncA(void)
    {
    return { 0x1111111111111111, 0x2222222222222222,
    0x3333333333333333 };
    }

    This........ rather unexpectedly...... gets compiled to:

    mov rax,rdi
    mov rdx,0x1111111111111111
    mov QWORD [rdi],rdx
    mov rcx,0x2222222222222222
    mov QWORD [rdi+0x8],rcx
    mov rsi,0x3333333333333333
    mov QWORD [rdi+0x10],rsi
    ret

    At first glance, it looks like this function is receiving a pointer in
    RDI, and that it's storing the three 64-Bit numbers sequentially at the
    address specified by RDI. To be totally sure about this, I took this
    machine code and ran it through a decompiler, which gave me:

    uint64_t *FuncA(uint64_t *const p)
    {
    p[0] = 0x1111111111111111u;
    p[1] = 0x2222222222222222u;
    p[2] = 0x3333333333333333u;
    return p;
    }

    What I've learned here I think can be summed up as follows:
    (a) If the return value <= 8 bytes, it goes in RAX
    (b) If the return value <= 16 bytes, the first 8 go in RAX and the
    second 8 go in RDX
    (c) If the return value >= 16 bytes, RAX and RDX are used differently.
    There is a hidden first parameter in RDI to the function which acts
    as a pointer indicating where the return value is to be stored. This
    hidden pointer is returned from the function in RAX. RDX isn't used.

    So now let's see what happens if I turn this into a member function,
    because I don't know if RDI will be used for the 'this' pointer or for
    the address to store the return value. Let's see, I wrote the
    following:

    void *volatile global;

    struct ReturnTypeA {
    long long unsigned a,b,c;
    };

    struct MyClass {
    ReturnTypeA FuncA(void)
    {
    global = this;
    return { 0x1111111111111111, 0x2222222222222222,
    0x3333333333333333 };
    }
    };

    And compiled it to the following assembler:

    // Next two lines set up the stack frame
    push rbp
    mov rbp,rsp
    // The next line sets return value = first argument
    mov rax,rdi
    // The next line 3 lines appear to get the 'this'
    // pointer from RSI and then store it in 'global'
    mov QWORD PTR [rbp-0x8],rsi
    mov rcx,QWORD PTR [rbp-0x8]
    mov QWORD PTR [rip+0x0],rcx # R_X86_64_PC32 global-0x4
    // The next 2 lines store a 64-Bit number in rdi+0
    mov rcx,0x1111111111111111
    mov QWORD PTR [rdi],rcx
    // The next 2 lines store a 64-Bit number in rdi+8
    mov rcx,0x2222222222222222
    mov QWORD PTR [rdi+0x8],rcx
    // The next 2 lines store a 64-Bit number in rdi+16
    mov rcx,0x3333333333333333
    mov QWORD PTR [rdi+0x10],rcx
    // The next 2 lines restore the frame pointer and return
    pop rbp
    ret

    So I think I've solved the mystery here. If the return type is more than
    16 bytes in size, the hidden 'this' pointer is put inside RSI rather
    than RDI, because RDI is used for the address of where the return value
    should be stored.

    So let's go back and take another look at our original thunk code for
    moving all the registers down:

    push r9
    mov r9,r8
    mov r8,rcx
    mov rcx,rdx
    mov rdx,rsi
    mov rsi,rdi
    mov rdi, 0x1122334455667788 (the 'this' pointer)
    mov rax, 0x99aabbccddeeff22 (address of lambda code)

    Let's try to edit this code to put the 'this' pointer in RSI instead of
    RDI, maybe something like:

    push r9
    mov r9,r8
    mov r8,rcx
    mov rcx,rdx
    mov rdx,rsi
    // REMOVE THIS LINE : mov rsi,rdi
    mov rsi, 0x1122334455667788 (the 'this' pointer)
    mov rax, 0x99aabbccddeeff22 (address of lambda code)

    Ok this might be the correct thunk code, and here it is up on GodBolt:

    https://godbolt.org/z/65YEsaT8o

    It works. Ok so now there's only one thing left to do. We need to
    have two separate thunk templates, Thunk Type A and Thunk Type B.
    Thunk Type A = for any function whose return type <= 16 bytes
    Thunk Type B = for any function whose return type > 16 bytes

    At compile-time, we can try to make this distinction by using 'sizeof'
    on the lambda's return type.... which is what I'm going to try do to
    now, see up on GodBolt:

    https://godbolt.org/z/MzYGxfz9Y

    Take a look yourself.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frederick Virchanza Gotham@21:1/5 to Frederick Virchanza Gotham on Sun Apr 16 08:16:31 2023
    On Sunday, April 16, 2023 at 2:23:05 PM UTC+1, Frederick Virchanza Gotham wrote:

    https://godbolt.org/z/MzYGxfz9Y

    Take a look yourself.

    On Sunday, April 16, 2023 at 2:23:02 PM UTC+1, Frederick Virchanza Gotham wrote:

    https://godbolt.org/z/MzYGxfz9Y

    Take a look yourself.


    And finally, if the implementation I've described so far isn't available yet on your target architecture, here's a less-efficient implementation that will 'just work' on any system:

    https://godbolt.org/z/sG1rTbWE6

    And here it is copy-pasted:

    #include <cassert> // assert
    #include <cstddef> // size_t
    #include <utility> // forward

    template<typename LambdaType>
    class LambdaThunk {
    protected:

    static thread_local LambdaType *p_lambda_object;

    template <typename ReturnType, typename... Params>
    static ReturnType Actual_Thunk(Params... args)
    {
    assert( nullptr != p_lambda_object );
    return (*p_lambda_object)(std::forward<Params>(args)...);
    }

    template <typename ReturnType, typename... Params>
    static ReturnType (*Get_Thunk_Address(ReturnType (LambdaType::*)(Params...) const))(Params...)
    {
    return Actual_Thunk<ReturnType,Params...>;
    }

    public:

    LambdaThunk(LambdaType &obj)
    {
    p_lambda_object = &obj;
    }

    auto thunk(void) const volatile // yes this could be a static function
    {
    return Get_Thunk_Address(&LambdaType::operator());
    }
    };

    template<typename LambdaType>
    thread_local LambdaType *LambdaThunk<LambdaType>::p_lambda_object = nullptr;

    int Some_Library_Func(int (*const pf)(char const*))
    {
    return pf("monkey");
    }

    #include <iostream>
    using std::cout;
    using std::endl;

    int main(int argc, char **argv)
    {
    auto mylambda = [argc](char const *const p) -> int
    {
    cout << "Hello " << argc << " " << p << "!" << endl;
    return 77;
    };

    int const z = Some_Library_Func( LambdaThunk(mylambda).thunk() );

    cout << "z = " << z << endl;
    }

    I think the only place where this inefficient implementation could fall
    down is if you have a lambda defined inside a recursive function... but
    you'd just need to make sure that the thunk is invoked before the
    function is re-entered (unless of course, upon re-entry, you account
    for the thunk not having been invoked yet).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Frederick Virchanza Gotham on Sun Apr 16 17:04:14 2023
    Frederick Virchanza Gotham <cauldwell.thomas@nospicedham.gmail.com> writes: >The technique described in this post will only work on CPU's where the >program counter can be set to an address in data memory, and therefore
    we will use 'void*' for Datum B rather than 'void(*)(void)'. I'm open
    to correction here but I think this technique will work on every >implementation of C++ in existence today, even on microcontrollers such
    as the Texas Instruments F28069 and the Arduino Atmel sam3x8e.

    On MacOS on Apple Silicon an mmap() with RWX does not work. I read
    some page where Apple describes the hoops they want you to jump though
    to make JIT compilers work, but I forgot the details.

    From some web page I got the impression that some BSD has the same
    misfeature, but I don't have first-hand experience wrt that.

    - anton
    --
    M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Frederick Virchanza Gotham on Mon Apr 17 05:06:21 2023
    Frederick Virchanza Gotham <cauldwell.thomas@nospicedham.gmail.com> schrieb:

    I started a thread yesterday on comp.lang.c++ in which I wanted to
    write a thunk in x86_64 assembler that would be able to invoke any
    lambda. I will copy-past that post here and then continue writing:

    Without going too much into the details (or even knowing much
    about C++): This will be platform-dependent.

    If you are using Linux or another Unix variant that uses the C++
    Itanium ABI, you should have a look at it it and its platform-specific supplement, specifically https://itanium-cxx-abi.github.io/cxx-abi/ and https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/x86-64-psABI-1.0.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)