Meijia Lyu's Blog Always keep mind sharp

『DDIA』CH09 - Consistency and Consensus

| |

Consistency Guarantees

Most replicated databases provide at least eventual consistency - if you stop writing to the database and wait for some unspecified length of time, then eventually all read requests will return the same value. Also convergence as we expect all replicas to eventually converge to the same value. Weak, because it says nothing about when.

Strong consistency deosn’t come for free, systems with stronger guarantees may have worse performance or be less fault-tolerent than the systems with weaker guarantees.

Transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

This chapter starts from one of the strongest consistency models in common use, linearizability, then examine the issue of ordering events in a distributed system particularly around causality and total ordering. Finally, we’ll explore how to atomically commit a distributed transaction.

Linearizability

atomic consistency, strong consistency, immediate concistency or external consistency

The system appears as if it has only one replica (guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica) and all operations are atomic.

In a linearizable system we imagine that there must be some point in time (between the start and end of the write operation) at which the value of x atomically flips from 0 to 1. Thus, if one client’s read returns the new value 1, all subsequent reads must also return the new value, even if the write operation has not yet completed.

avatar

  • Serializability: an isolation property of transactions, where every transaction may read and write multiple objects. It guarantees that transac‐ tions behave the same as if they had executed in some serial order (each transaction running to completion before the next transaction starts). It is okay for that serial order to be different from the order in which transactions were actually run.

  • Linearizability: a recency guarantee on reads and writes of a register. It doesn’t group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts.

Serializable snapshot isolation is not linearizable: by design, it makes reads from a consistent snapshot, to avoid lock contention between readers and writers. The whole point of a consistent snapshot is that it does not include writes that are more recent than the snapshot, and thus reads from the snapshot are not linearizable.

Locking and leader election

A system that uses single-leader replication needs to ensure that there is indeed only one leader, not several (split brain). One way of electing a leader is to use a lock: every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader. No matter how this lock is implemented, it must be linearizable: all nodes must agree which node owns the lock.

Constraints and uniqueness guarantees

Cross-channel timing dependencies

Multiple communication channels, eg. the use case of image upload and resizing. One way, is to use read-after-write consistency. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.

It is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability.

  • If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process requests while they are disconnected: they must either wait until the net‐ work problem is fixed, or return an error (either way, they become unavailable)
  • If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable. => applications that don’t require linearizability can be more tolerant of network problems.

Example: In a multi-core CPU, there are now several copies of the data (one in main memory, and perhaps several more in various caches), and these copies are asynchronously updated, so linearizability is lost. It makes no sense to use the CAP theorem to justify the multi-core memory consistency model: within one computer we usually assume reli‐ able communication, and we don’t expect one CPU core to be able to continue oper‐ ating normally if it is disconnected from the rest of the computer. The reason for dropping linearizability is performance, not fault tolerance. The same is true for many distributed databases.

Ordering Guarantees

Causality imposes an ordering on events. Causal order is not total order. In a linearizable system, we have a total order of operations: for any two operations we can always say which one happened first. We said that two operations are concurrent if neither happened before the other. Put another way, two events are ordered if they are causally related, but they are incomparable if they are concurrent. This means that causality defines a partial order, not a total order: some operations are ordered with respect to each other, but some are incomparable.

We can use sequence numbers or timestamps to order events, eg, logical clock which provides a total order. If there’s not a single leader, various methods can be:

  • Each node can generate its own independent set of sequence numbers.
  • Attach a timestamp from a time-of-day clock to each operation. Such timestamps are not sequential, but if they have sufficiently high resolution, they might be sufficient to totally order operations. This fact is used in the last write wins conflict resolution method.
  • Preallocate blocks of sequence numbers, eg. A from 1 to 1000, B from 1001 to 2000…

Problem: the sequence numbers they generate are not consistent with causality. The causality problems occur because these sequence number generators do not cor‐ rectly capture the ordering of operations across different nodes.

=> Lamport timestamps Each node has a unique identifier, and each node keeps a counter of the number of operations it has pro‐ cessed. The Lamport timestamp is simply a pair of (counter, node ID). The key idea about Lamport timestamps, which makes them consis‐ tent with causality, is the following: every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. As long as the maximum counter value is carried along with every operation, this scheme ensures that the ordering from the Lamport timestamps is consistent with causality, because every causal dependency results in an increased timestamp.

Version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other, whereas Lamport timestamps always enforce a total ordering. From the total ordering, you cannot tell whether the two evnets are concurrent or whether they are causally dependent. The advantage of Lamport timestamps over version vectors is that they are more compact.

However, total ordering is not sufficient. The problem here is that the total order of operations only emerges after you have collected all of the operations. This idea of knowing when your total order is finalized is captured in the topic of total order broadcast or atomic broadcast.

Total order broadcast is usually described as a protocol for exchanging messages between nodes. Its correctness depends on two properties: reliable delivery (no messages are lost: if a message is delivered to one node, it is delivered to all nodes) and totally ordered delivery(messages are delivered to every node in the same order).

Consensus services such as ZooKeeper implement total order broadcast protocol. This fact is a hint that there is a strong connection between total order broadcast and consensus.

A linearizable compare-and-set register and total order broadcast are equivalent to consunsus.

## Distributed transaction and consensus The goal is simply to get server nodes agree on something. #### Atomic commit and two-phase commit A transaction commit must be irrevocable - you are not allowed to change your mind and retroactively abort a transaction after it has been committed. This principle forms the basis of read committed isolation. Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes, which is a classic algorithm in distributed databases. 2PC uses a new component that does not normally appear in single-node transac‐ tions: a coordinator (also known as transaction manager). A 2PC transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transac‐ tion. When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. The coordinator then tracks the responses from the participants:

  • If all participants reply “yes,” indicating they are ready to commit, then the coor‐ dinator sends out a commit request in phase 2, and the commit actually takes place.
  • If any of the participants replies “no,” the coordinator sends an abort request to all nodes in phase 2.

But WHY it works? Break down in detail:

  • When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.
  • The application begins a single-node trasaction on each of the participants, and attach the global unique transaciton ID to the single-node trans. All reads and writes are done in one of these single-node transactions. If anything goes wrong at this stage, the coordinator or any of the participants can abort.
  • When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID. If any of these requests fails or times out, the coordinator sends an abort request for that trans‐ action ID to all participants.
  • When a participant receives the prepare request, it makes sure that it can defi‐ nitely commit the transaction under all circumstances. In other words, the participant surrenders the right to abort the transaction, but without actually committing it.
  • When the coordinator has received responses to all prepare requests, it makes a definitive decision on whether to commit or abort the transaction. The coordinator must write that decision to its transaction log on disk so that it knows which way it decided in case it subse‐ quently crashes. This is called the commit point.
  • Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. the protocol contains two crucial “points of no return”: when a participant votes “yes,” it promises that it will definitely be able to commit later (although the coordinator may still choose to abort); and once the coordinator decides, that deci‐ sion is irrevocable. Those promises ensure the atomicity of 2PC.

What happens if the coordinator fails?

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain. The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants: when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transac‐ tion log. Any transactions that don’t have a commit record in the coordinator’s log are aborted.

=> 2PC is called a blocking atomic commit protocol.

Database-internal distributed transactions / Heterogeneous distributed transactions

Fault-Tolerant Consensus

One or more nodes may propose values, and the consensus algorithm decides on one of those values. Uniform agreement (no two nodes decide differently) + Integrity (no node decides twice) + Validity (if a node decides value v, then v must be proposed by some node) + Termination (every node that does not crash eventually decides some value).

Viewstamped Replication, Paxos, Raft, Zab.

Membership and coordination services

Understand how ZooKeeper works. ZooKeeper is modeled after Google’s Chubby lock service, implementing not only total order broadcast, but also an interesting set of other features:

  • Linearizable atomic operations: atomic compare-and-set operation -> a lock: if several nodes concurrently try to perform the same operation, only one of them will succeed.
  • Total ordering operations: when some resource is protected by a lock or lease, you need a fencing token to prevent clients from conflicting with each other in the case of a process pause.
  • Failure detection: heartbeats
  • Change notifications: subscribing to notifications