-- Scalable Read Atomic Transaction for Partitioned Datastore
A Story
You are building a messaging app. You start with a non-partitioned single database, where you store unseen message count and the actual messages in two different tables. It served you well ... until more and more people are using your app and the data no longer fits in a single host. Then you tried to partition the data; put unseen message count and messages into two different databases. Works great, until you got bug reports from your upset users complaining that they are seeing a big red badge when there's no unseen messages.
The problem
When you partitioned the data, you have to now deal with Multi-Partition Transaction for certain use cases. The badge count here is just an example of that. (For the sake of this post, let's assume the Badge Count here's pulled from the phone instead of pushed like iOS's app badge count. You can think of this badge as a badge inside your app.)
There are other use cases that share the same underlying challenge. E.g. dis-aggregated secondary index, cross shard transaction on a sharded database. To abstract the problem, let's say you have x=0
and y=0
two variables and you update them to x=1
and y=1
. While it's being updated and you perform a read, you don't want to get x==1 and y == 0
or x==0 and y==1
. You want to get either x==0 and y==0
or x==1 and y==1
.
Multi-Partition Transaction
The most important guarantees we care about here are Write Atomicity(all or none updates would be performed) and Read Atomicity(all or none updates from a transaction would be visible).
Atomic Write can be achieved by Two Phase Commit. But two phase locking based Atomic Read would be too expensive for read heavy workload, as it requires two RTTs and no concurrent transactions be performed on overlapping keys. So for read heavy workload from Multi-partitioned datastore, how can we perform Atomic Reads at scale?
RAMP Transaction
Pater Bailis introduced RAMP transaction to solve exactly this problem. I am going to explain the gist of the idea using the example above. The main idea being, with enough metadata about the transaction being stored at write time, client can detect fractured read and fix it up. And what's really cool is that it requires no coordination (unlike two phase commit) and it allows concurrent read/write transactions.
Write
[step-1] client pick a monotonically increasing and unique id tst. You can imagine it being a tuple of client hostname and timestamp. And two ts
s can be compared based on timestamps.
[step-2] set x=1, ts=tst. It would look something like the following. The write is not yet visible to client by default yet.
on host0
Key | Value | TS | other_participants |
---|---|---|---|
x | 1 | tst | y |
[step-3] set y=1, ts=tst.
on host1
Key | Value | TS | other_participants |
---|---|---|---|
y | 1 | tst | x |
Assume it used to be
| Key | Value | TS|other_participants|
|-----|-------|---------|--|---|
| y | 0 | tsp | z|
[step-4] set x=1 to be visible on host0 by setting the last_committed_ts
to be tst.
Key | last_committed_ts |
---|---|
x | tst |
[step-5] set y=1 to be visible on host1
Notice that this assumes tst >
tsp. If it's the other way around, last_committed_ts
would stay to be tsp.
Key | last_committed_ts |
---|---|
y | tst |
Read
Let's say a read is literally performed when the write transaction is in progress. And client got something like this:
(key: x, value:1, ts:tst, participants: y) and (key:y, value:0, ts:tsp, participants: z), when the write transaction is in between step-4 and step-5.
Step-5 above assumes tsp <
tst. And in this case, client noticed that y
claims to be in the same transaction as x
at tst. Since tsp <
tst, client would know that it's missing (key: y, ts:tst), and can fetch that specific version from host1 before it's fully committed. This means write can literall "fail" at step-5 and it will be fine. Also notice that if write operation were aborted at step-4, no one would ever read and data at tst. Hence it's invisible to clients. See? No coordination needed! No rollback. This is effectively the main idea behind RAMP transaction.
Wrinkle in the original RAMP
tsp
<
tst.
Well does that always hold, even if tst(set y=1
) happened after tspset y=0
?
The answer is NO. Since the timestamp in RAMP is client timestamp, which is subject to various kind of clock-skew or even bogus local timestamp. The original RAMP paper has a policy that highest-timestamped write wins, instead of latest write wins. So it means even if the second write happens later, as long as its timestamp is smaller, it will effectively lose to the first write. The data in some sense is "lost". This is not an inherit limitation of performing RAMP transaction though, you can easily iron out the wrinkle by performing read-modify-write on each key to ensure ts
is monotonically increasing. Naive HLC would work pretty well here. But it's not free as it will impact throughput as well as latency.
Reduce Write Amplification
The Write Amplification of RAMP is not negligible, as it needs to store transaction participants as well as multiple versions of each key.
Can we cache recent writes for each key instead?
I think so. Usually besides RAMP transaction, you would also need traditional two phase commit for supporting read-modify-write transaction. If you already have capability of doing locking on a key, and doing Atomic Read by locking all the keys, you can store multiple versions of recent writes only in cache and fall back to two phase locking reads if writes of specific ts
is not in cache or not yet committed. You reduce write amplification at the cost of potentially more contention. Depending on the workload, this may or may not be a good trade-off.