C++ exception (1) — zero-cost exception handling

This is the first post of a series I am making on C++ exceptions.

When people say "C++ exceptions are slow", what they really meant is that throwing an exception in C++ is slow. C++'s exception handling has zero-cost when there are no exceptions. Let's look at a very simple example.

class Foo {
    public:
        ~Foo() {}
};

class Bar {
    public:
        ~Bar() {}
};

void func(bool b) {
    Bar bar{}; // <--- Bar::~Bar() should be called on stack unwinding
    if (b) {
        throw 1;
    }
}

int main() {
    try {
      Foo foo{};  // <--- Foo::~Foo() should be called on stack unwinding
      func(false);
    } catch (...) {
        
    }
    return 0;
}

I put throw 1; in the code, even though it never throws; so we can see the generated machine code (compiled on clang 13 with -std=c++17 without optimization). Here's the assembly (in Intel syntax), which we will read one piece at a time.

func(bool):                               # @func(bool)
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     al, dil
        and     al, 1
        mov     byte ptr [rbp - 1], al
        test    byte ptr [rbp - 1], 1
        je      .LBB0_3
        mov     edi, 4
        call    __cxa_allocate_exception
        mov     rdi, rax
        mov     dword ptr [rdi], 1
        mov     esi, offset typeinfo for int
        xor     eax, eax
        mov     edx, eax
        call    __cxa_throw
        jmp     .LBB0_5
        mov     rcx, rax
        mov     eax, edx
        mov     qword ptr [rbp - 16], rcx
        mov     dword ptr [rbp - 20], eax
        lea     rdi, [rbp - 8]
        call    Bar::~Bar() [base object destructor]
        jmp     .LBB0_4
.LBB0_3:
        lea     rdi, [rbp - 8]
        call    Bar::~Bar() [base object destructor]
        add     rsp, 32
        pop     rbp
        ret
.LBB0_4:
        mov     rdi, qword ptr [rbp - 16]
        call    _Unwind_Resume@PLT
.LBB0_5:
Bar::~Bar() [base object destructor]:                            # @Bar::~Bar() [base object destructor]
        push    rbp
        mov     rbp, rsp
        mov     qword ptr [rbp - 8], rdi
        pop     rbp
        ret
main:                                   # @main
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     dword ptr [rbp - 4], 0
        xor     edi, edi
        call    func(bool)
        jmp     .LBB2_1
.LBB2_1:
        lea     rdi, [rbp - 8]
        call    Foo::~Foo() [base object destructor]
        jmp     .LBB2_4
        mov     rcx, rax
        mov     eax, edx
        mov     qword ptr [rbp - 16], rcx
        mov     dword ptr [rbp - 20], eax
        lea     rdi, [rbp - 8]
        call    Foo::~Foo() [base object destructor]
        mov     rdi, qword ptr [rbp - 16]
        call    __cxa_begin_catch
        call    __cxa_end_catch
.LBB2_4:
        xor     eax, eax
        add     rsp, 32
        pop     rbp
        ret
Foo::~Foo() [base object destructor]:                            # @Foo::~Foo() [base object destructor]
        push    rbp
        mov     rbp, rsp
        mov     qword ptr [rbp - 8], rdi
        pop     rbp
        ret

Zero-cost exception handling

Let's start from the simplest ones.

.LBB0_5:
Bar::~Bar() [base object destructor]:                            # @Bar::~Bar() [base object destructor]
        push    rbp                           # store the caller frame pointer on stack
        mov     rbp, rsp                      # set the frame pointer for the callee
        mov     qword ptr [rbp - 8], rdi      # store `this` pointer on stack (passed in by DRI, dictated by AMD64 ABI)
        pop     rbp                           # restore the frame pointer
        ret

Both .LBB0_5 and Bar::~Bar() symbols store the code for Bar::~Bar(). We see the common prologue and epilogue on x86.The code is the same for Foo::~Foo(). Now let's take a look at func.

func(bool):                               # @func(bool)
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
mov     al, dil       # dil is the lower 8 bits from RDI(http://www.tortall.net/projects/yasm/manual/html/arch-x86-registers.html) Here, it's passing the `bool` param to AL
        and     al, 1
        mov     byte ptr [rbp - 1], al
        test    byte ptr [rbp - 1], 1
        je      .LBB0_3   # jump to .LBB0_3 if `b` is not 1/true
        mov     edi, 4
        call    __cxa_allocate_exception
        mov     rdi, rax
        mov     dword ptr [rdi], 1
        mov     esi, offset typeinfo for int
        xor     eax, eax
        mov     edx, eax
        call    __cxa_throw
        jmp     .LBB0_5
        mov     rcx, rax
        mov     eax, edx
        mov     qword ptr [rbp - 16], rcx
        mov     dword ptr [rbp - 20], eax
        lea     rdi, [rbp - 8]
        call    Bar::~Bar() [base object destructor]
        jmp     .LBB0_4

After the initialization of bar, if b is false, the code will just jump to .LBB0_3which calls Bar's destructor, unwinds the stack pointer rsp and returns.

.LBB0_3:
        lea     rdi, [rbp - 8]                        # store `this` in rdi (dictated by AMD64 ABI)
        call    Bar::~Bar() [base object destructor]  # call the dtor
        add     rsp, 32                               # unwind the stack
        pop     rbp                                   # restore caller's frame pointer
        ret                                           # return

Now let's take a look at main.

main:                                   # @main
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     dword ptr [rbp - 4], 0        # looks useless
        xor     edi, edi                      # set edi to zero (again dictated by AMD64 ABI)
        call    func(bool)
        jmp     .LBB2_1
.LBB2_1:
        lea     rdi, [rbp - 8]               # store `this` foo's addr in rdi
        call    Foo::~Foo() [base object destructor]
        jmp     .LBB2_4
        ...
        
.LBB2_4:
        xor     eax, eax                     # set eax (return value) to zero
        add     rsp, 32                      # unwind the stack
        pop     rbp                          # restore caller's frame pointer
        ret

The reason why it uses xor eax eax to set a register to 0 is for efficiency reason. It produces shorter opcode and enables the processor to perform register renaming. Notice the mov dword ptr [rbp - 4], 0 line in main. It initializes the next four bytes on the stack to zero, which we are not really using anyway. If you switch from clang to gcc, this line would disappear (with the same flags).

If you follow the labels from main, to .LBB2_1, to .LBB2_4, all they are doing is just calling func, destructing foo, and returning 0. The code runs as if the exception handling logic is not even there to begin with, when it's not thrown. This is known as zero-cost exception handling. It's an implementation detail for programming languages. E.g. C++ and Java have zero-cost exception handling. CPython 3.11 is thinking about adding zero-cost exception handling support.

You might be thinking, we are not even handling an exception in this code, of course it's zero-cost. How can we call it exception handling to begin with? Well, we actually did handle the exception, it just didn't throw. We can rewrite the code as,

int foo(bool b) {
  if (b) {return -1;} // exceptional case
  else {return 0;}
} 

int main() {
  int ret = foo(false);
  if (ret != 0) { // error handling }
}

Notice that even if func never returns -1 here, main needs to check the return value (handling the exception). A smart compiler can optimize this away in this case. But in practice, func will error out sometimes. So main always needs to check the return value of func because it's possible that it needs to handle the error case. This check, a form of exception handling, is not free, even though 99% of the time there are no errors. This is essentially how Python implements its exception handling. It checks if there's an exception PyErr all the time, and invokes the handling logic if needed. CPython's exception handling (before 3.11) is not zero-cost. With C++ exception, on the other hand, it's free (zero-cost), when there are no exceptions (if you discount iTLB misses). There are binary layout optimizers such as BOLT that can even further optimize the physical layout of the code to minimize iTLB misses.

This strategy is the equivalent of optimistic concurrency control in database. As you might have guessed, it comes with a cost. When an optimistic prediction is wrong — when an exception is actually thrown — it's much more expensive.