Learning C++ Memory Model from a Distributed System's Perspective

Learning C++ Memory Model from a Distributed System's Perspective

If C++ standard were reworded using distributed system terms, it might be more readable.

Your single machine is actually a distributed system in disguise

Multiple cachelines inside your multi-core machine make a distributed system. Cacheline coherence is 100% a distributed system problem. C++ tries to provide a high level abstraction for writing software that runs on a distributed system via std::memory_order. It makes a lot of sense to expose ordering as an API, because if there's a single key word in distributed system, it would be ordering.

This abstraction makes so much sense that rust just copied it word for word, even though it's pretty convoluted. Just to give you an idea about how convoluted it is, there are many types of happens-before in C++20 standard – sequenced-before, dependency-ordered-before, inter-thread happens-before, happens-before, simply happens-before, strongly happens-before, and coherence-ordered-before. Happens-before relationship defines a partial order. So given these aforementioned happens-before relationships, there are a few types of orderings – release-consume ordering, release-acquire ordering, sequentially-consistent ordering, etc..

In this post, we will try to make sense of C++'s memory model by looking at it as a distributed database.

std::memory_order

There're

  • memory_order_relaxed
  • memory_order_consume
  • memory_order_acquire
  • memory_order_release
  • memory_order_acq_rel
  • memory_order_seq_cst

Notice that these settings mostly are not describing the requirements on the atomic operation you are performing. They are mostly for non-atomic memory accesses around the atomic operation (except memory_order_seq_cst). All accesses (reads and writes) on a single std::atomic are consistent with the modification order; notice not the evaluation order.

I am going to assume memory_order_consume and consume operation do not exist in the standard (they are discouraged by the standard anyway). That would make things a little bit more manageable.

Happens-before

The happens-before relationship defined in the standard is actually quite simple (if you compile the formal definitions into common sense). It's a partial order that's consistent with

  • evaluation order (e.g. A is evaluated before, a.k.a sequenced-before, B) – a static relationship
  • dependency order (e.g. A synchronizes-with B) – a runtime relationship

synchronizes-with

memory_order_relaxed is not a synchronization operation. E.g. in the following example,

// Thread 1:
r1 = y.load(std::memory_order_relaxed); // A
x.store(r1, std::memory_order_relaxed); // B
// Thread 2:
r2 = x.load(std::memory_order_relaxed); // C 
y.store(42, std::memory_order_relaxed); // D

, it is allowed to produce r1 == r2 == 42. A is sequenced-before B; C is sequenced-before D accordingly to the evaluation order – a static relationship. During program execution, D can precede A and B can precede C in modification order – a runtime relationship. This can happen because there's no synchronization for memory_order_relaxed. To put it another way, synchronizes-with brings two partial orderings, static order (evaluation order) and runtime order, together.

A thread can write to an atomic with relaxed semantic, another thread trying to read it can see either the state before the write or after the write. This is basically eventual-consistency in an asynchronously replicated distributed system. The write from thread A needs to propagate in a sense to the other cachelines.  

All other operations establish synchronizes-with relationships. One of the most useful one is the release-acquire ordering.

If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

Think of it as a mutex.

Now if you always write to an atomic using release operation, and read from it using acquire operation, you will always see the latest value. From a distributed k/v store's perspective, you are getting the same effect as always doing critical read (or reading from the primary) – while the actual implementation can differ.

memory_order_seq_cst

This is the strongest and also the default memory order. It orders memory the same way as release/acquire. Besides, if you apply memory_order_seq_cst to multiple std::atomics, there exists a total ordering that's consistent with the modification order of all the atomics involved. Basically, you get linearizability across all these atomics, if you stick with memory_order_seq_cst. Within this subset of atomics, you have global linearizability – think of all writes and reads are going to a single MySQL primary instance, unlike acquire/release that reads and writes just need to go to the primary DB that owns its key.