Thesis #117

gabrielgiussi · 2019-12-23T11:14:00Z

gabrielgiussi · 2020-02-29T16:18:16Z

Automata

I can explain that the idea behind the book is based on IO Automata, from Mark Tuttle.

It is important to mention which is the goal of this model, which is made for probing that an algorithm is correct, which is not our case.

It will be nice to explain briefly how this model helps to probe correctness but I'm not sure if Tuttle's thesis explains this.

gabrielgiussi · 2020-02-29T16:19:04Z

Eventuate

Is nice to explain that Eventuate uses some of this abstractions, because is a real world tool

gabrielgiussi · 2020-03-01T13:27:20Z

Lack of vocabulary (cappIO motivation)

This is the main reason behind cappIO, the idea is to have a common vocabulary with a well-defined meaning (*).
An example is reliability, if I read about a tool that guarantees reliability when it sends messaged I'm not sure if it is talking about uniform or not because we don't share this vocabulary.

It would be really nice to reference some issue discovered by Jepsen. It mainly covers consistency issues but is good to explain the long-term goal. There is one talks about causal consistency

(*) Another example is consistency is an overused term in different fields (reference Klepman pag 224). Maybe I should talk about this in the thesis and not in the tool because I don't use it.

Reference https://medium.com/databasss/on-ways-to-agree-part-1-links-and-flp-impossibility-f6bd8a6a0980

gabrielgiussi · 2020-03-02T01:37:56Z

Real use cases for Broadcast

Beyond the examples from the book, it would be nice to find some real program (db, queueing system) that uses broadcast.
I could always use eventuate (the good thing is that I already know about this), maybe Riak?

Check https://github.com/basho/riak_core/wiki/Riak-Core-Broadcast

Which is the relation between broadcast and consistency?

Maybe this relation is in total-order broadcast that combines consensus algorithms with broadcast. (I need to read more in-depth the chapter of the book)
Beyond that, which is the relation? Can I talk about causal consistency if I only use broadcast? Or I should only talk about causal order? And then I need causal order to achieve causal consistency? But for that I need to implement a shared memory abstraction that accepts reads and writes.

Read Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS and Kleppmann's book again.

gabrielgiussi · 2020-03-05T22:15:10Z

Introduction to reliable ... 1.6 Chapter notes

The idea of using multiple, replicated processes for tolerating faults of individual processes links together most algorithms presented in this book. This paradigm can be traced back to the work on the Software-Implemented Fault Tolerance (SIFT) project in 1978, which addressed the challenging problem of building a fault-tolerant computer for aircraft control (Wensley et al. 1978).

gabrielgiussi · 2020-03-05T22:49:03Z

DS mantra

Build reliable systems from unreliable components

gabrielgiussi · 2020-03-05T22:49:32Z

Particularity of DS = partial failures!

One process/link fails but the others continue working.

From DDIA page 323

Eventual consistency is hard for application developers because it is so different from the behavior of variables in a normal single-threaded program. If you assign a value to a variable and then read it shortly afterward, you don’t expect to read back the old value, or for the read to fail.

I could use this but is not as simple as is described here because in non distributed systems you can also have multithreaded operations and if you use a database this is accessed by multiple users so you have to deal with consistency semantics (but from the ACID)

gabrielgiussi · 2020-03-16T01:43:55Z

Broadcast

Distributed Systems Concepts and Design

JGroups is a toolkit for reliable group communication written in Java. The toolkit is a part of the lineage of group communication tools that have emerged from Cornell University, building on the fundamental concepts developed in ISIS [Birman 1993], Horus [van Renesse et al. 1996] and Ensemble [van Renesse et al. 1998]. The toolkit is now maintained and developed by the JGroups open source community [www.jgroups.org], which is part of the JBoss middleware community, as discussed in Chapter 8 [www.jboss.org].

jgroups

Atomic Broadcast

gabrielgiussi · 2020-04-03T23:27:57Z

FLP impossibility rule

I can use the FLP to describe the importance of describing the system model in which we are going to design the algorithm, FLP says the consensus is impossible to achieve if at least one node may fail but only in the async model.

It would be nice to show this scenario using cappio (only the screenshot), I mean when a single node fails and consensus is not achieved.

DDIA Pag 353

proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms for achieving consensus. What is going on here?
The answer is that the FLP result is proved in the asynchronous system model (see “System Model and Reality” on page 306), a very restrictive model that assumes a deterministic algorithm that cannot use any clocks or timeouts. If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes (even if the suspicion is sometimes wrong), then consensus becomes solvable [67]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [69].
Thus, although the FLP result about the impossibility of consensus is of great theoretical importance, distributed systems can usually achieve consensus in practice.

[67]
[69] WTF?!

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures.

Fun and Profit

This result means that there is no way to solve the consensus problem under a very minimal system model in a way that cannot be delayed forever. The argument is that if such an algorithm existed, then one could devise an execution of that algorithm in which it would remain undecided ("bivalent") for an arbitrary amount of time by delaying message delivery - which is allowed in the asynchronous system model. Thus, such an algorithm cannot exist.
This impossibility result is important because it highlights that assuming the asynchronous system model leads to a tradeoff: algorithms that solve the consensus problem must either give up safety or liveness (1) when the guarantees regarding bounds on message delivery do not hold.

Consensus

321, 364-375, 554

DDIA pag 322

Once you have an implementation of consensus, applications can use it for various purposes. For example, say you have a database with single-leader replication. If the leader dies and you need to fail over to another node, the remaining database nodes can use consensus to elect a new leader. As discussed in “Handling Node Outages” on page 156, it’s important that there is only one leader, and that all nodes agree who the leader is. If two nodes both believe that they are the leader, that situation is called split brain, and it often leads to data loss.

Distributed transactions and consensus (DDIA page 352)

There are a number of situations in which it is important for nodes to agree. For example:

Leader election (this is also discussed in "In what circumstances is linearizability useful")
Atomic commit: In a database that supports transactions spanning several nodes or partitions, we have the problem that a transaction may fail on some nodes but succeed on others. If we want to maintain transaction atomicity (in the sense of ACID; see “Atomicity” on page 223), we have to get all nodes to agree on the outcome of the transaction: either they all abort/roll back (if anything goes wrong) or they all commit (if nothing goes wrong). This instance of consensus is known as the atomic commit problem

2PC

DDIA page 355

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort. It is a classic algorithm in distributed databases [13, 35, 75]. 2PC is used internally in some databases and also made available to applications in the form of XA transactions [76, 77] (which are supported by the Java Transaction API, for example) or via WS- AtomicTransaction for SOAP web services [78, 79].
The basic flow of 2PC is illustrated in Figure 9-9. Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).
2PC uses a new component that does not normally appear in single-node transactions: a coordinator (also known as transaction manager).
A 2PC transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transaction. When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. The coordinator then tracks the responses from the participants: If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends out a commit request in phase 2, and the commit actually takes place.
But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain.
Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover. In theory, it is possible to make an atomic commit protocol nonblocking, so that it does not get stuck if a node fails. However, making this work in practice is not so straightforward.
As an alternative to 2PC, an algorithm called three-phase commit (3PC) has been proposed [13, 80].
However, 3PC assumes a network with bounded delay and nodes with bounded response times. In general, nonblocking atomic commit requires a perfect failure detector, for this reason, 2PC continues to be used, despite the known problem with coordinator failure.

For database-internal distributed transactions (not XA), the limitations are not so great—for example, a distributed version of SSI is possible. However, there remains the problem that for 2PC to successfully commit a transaction, all participants must respond. Consequently, if any part of the system is broken, the transaction also fails. Distributed transactions thus have a tendency of amplifying failures, which runs counter to our goal of building fault-tolerant systems.

Fault-tolerance consensus

DDIA page 365

If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can just hardcode one node to be the “dictator,” and let that node make all of the decisions. However, if that one node fails, then the system can no longer make any decisions. This is, in fact, what we saw in the case of two-phase commit: if the coordinator fails, in-doubt participants cannot decide whether to commit or abort.

most implementations of consensus ensure that the safety properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or there is a severe network problem [92]. Thus, a large-scale outage can stop the system from being able to process requests, but it can not corrupt the consensus system by causing it to make invalid decisions.

Consensus algorithms and total order broadcast (DDIA page 366)

Viewstamped Replication, Raft, and Zab implement total order broadcast directly, because that is more efficient than doing repeated rounds of one-value-at-a-time consensus. In the case of Paxos, this optimization is known as Multi-Paxos.

Epoch numbering and quorums

(I need to read more about this)

DDIA page 368

All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number (called the ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft) and guarantee that within each epoch, the leader is unique.

Limitations of consensus

DDIA page 369

Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete safety properties (agreement, integrity, and validity) to systems where every‐ thing else is uncertain, and they nevertheless remain fault-tolerant (able to make pro‐ gress as long as a majority of nodes are working and reachable). They provide total order broadcast, and therefore they can also implement linearizable atomic opera‐ tions in a fault-tolerant way.

The process by which nodes vote on proposals before they are decided is a kind of synchronous replication. In this configuration, some committed data can potentially be lost on failover—but many people choose to accept this risk for the sake of better performance.
Consensus systems always require a strict majority to operate. This means you need a minimum of three nodes in order to tolerate one failure. If a network failure cuts off some nodes from the rest, only the majority portion of the network can make progress, and the rest is blocked
Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster
Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, especially geographically distributed systems, it often happens that a node falsely believes the leader to have failed due to a transient network issue. Although this error does not harm the safety properties, frequent leader elections result in terrible performance because the system can end up spend‐ ing more time choosing a leader than doing any useful work.

Types of consensus (from IRSDP)

Split brain

DDIA page 367

Some databases perform automatic leader election and failover, promoting a follower to be the new leader if the old leader fails (see “Handling Node Outages” on page 156). This brings us closer to fault-tolerant total order broadcast, and thus to solving consensus.
However, there is a problem. We previously discussed the problem of split brain, and said that all nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to be the leader, and consequently get the database into an inconsistent state.

Fun and profit

Several computers (or nodes) achieve consensus if they all agree on some value. More formally:
Agreement: Every correct process must agree on the same value.
Integrity: Every correct process decides at most one value, and if it decides some value, then it must have been proposed by some process.
Termination: All processes eventually reach a decision.
Validity: If all correct processes propose the same value V, then all correct processes decide V.
The consensus problem is at the core of many commercial distributed systems. After all, we want the reliability and performance of a distributed system without having to deal with the consequences of distribution (e.g. disagreements / divergence between nodes), and solving the consensus problem makes it possible to solve several related, more advanced problems such as atomic broadcast and atomic commit.

The properties described above represents uniform consensus, "which is equivalent to regular consensus in asynchronous systems with unreliable failure detectors" (DDIA page 365)

Equivalent problems

a wide range of problems are actually reducible to consensus and are equivalent to each other (in the sense that if you have a solution for one of them, you can easily transform it into a solution for one of the others). Such equivalent problems include:

Linearizable compare-and-set registers
Atomic transaction commit
Total order broadcast
Locks and leases
Membership/coordination service
Uniqueness constraint

All of these are straightforward if you only have a single node, or if you are willing to assign the decision-making capability to a single node. This is what happens in a single-leader database: all the power to make decisions is vested in the leader, which is why such databases are able to provide linearizable operations

Interesting here to say that even single leader databases (e.g. they use snapshot isolation) (1) or processors (they use several cache layers) don't provide linearizability due to performance reasons.

(1) Check this better because transaction isolation is not the same as consistency.

Questions

Which is the relation between FLP and exactly once?
What means to give up liveness (1)? Does it means that the algorithm will halt and it will never reach a result?

gabrielgiussi · 2020-04-03T23:55:03Z

Abstractions

DDIA pag 321

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.

Fun and profit

TODO 2. Up and down the level of abstraction

gabrielgiussi · 2020-04-04T00:58:16Z

Distributed Consistency

DDIA Page 323 about eventual consistency

However, this is a very weak guarantee—it doesn’t say anything about when the repli‐ cas will converge. Until the time of convergence, reads could return anything or nothing [1]. For example, if you write a value and then immediately read it again, there is no guarantee that you will see the value you just wrote, because the read may be routed to a different replica (see “Reading Your Own Writes” on page 162).

Linearizability

Defined by Herlihy and Wing [1990]

DDIA page 324

(also known as atomic consistency [7], strong consistency, immediate consistency, or external consistency [8]) the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.
In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. Maintaining the illusion of a single copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica. In other words, linearizability is a recency guarantee.

Recency guarantee

TODO

Linearizability Versus Serializability

DDIA page 329

Serializability
Serializability is an isolation property of transactions, where every transaction may read and write multiple objects (rows, documents, records)—see “Single- Object and Multi-Object Operations” on page 228. It guarantees that transactions behave the same as if they had executed in some serial order (each transaction running to completion before the next transaction starts). It is okay for that serial order to be different from the order in which transactions were actually run [12].
Linearizability
Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew (see “Write Skew and Phantoms” on page 246), unless you take additional measures such as materializing conflicts (see “Materializing conflicts” on page 251).
A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability (strong-1SR)
Implementations of serializability based on two-phase locking (see “Two-Phase Locking (2PL)” on page 257) or actual serial execution (see “Actual Serial Execution” on page 252) are typically linearizable.
However, serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” on page 261) is not linearizable: by design, it makes reads from a consistent snapshot, to avoid lock contention between readers and writers. The whole point of a consistent snapshot is that it does not include writes that are more recent than the snapshot, and thus reads from the snapshot are not linearizable.

Fun and profit ???

Bailis

Linearizability for read and write operations is synonymous with the term “atomic consistency” and is the “C,” or “consistency,” in Gilbert and Lynch’s proof of the CAP Theorem

Serializability is the traditional “I,” or isolation, in ACID. If users’ transactions each preserve application correctness (“C,” or consistency, in ACID), a serializable execution also preserves correctness

Combining serializability and linearizability yields strict serializability: transaction behavior is equivalent to some serial execution, and the serial order corresponds to real time

TODO (read Burckhardt y Viotti)

In what circumstances is linearizability useful? (DDIA page 330)

Locking and leader election

A system that uses single-leader replication needs to ensure that there is indeed only one leader, not several (split brain). One way of electing a leader is to use a lock: every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader [14]. No matter how this lock is implemented, it must be linearizable: all nodes must agree which node owns the lock; otherwise it is useless.
Coordination services like Apache ZooKeeper [15] and etcd [16] are often used to implement distributed locks and leader election. They use consensus algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms later in this chapter, in “Fault-Tolerant Consensus” on page 364).iii There are still many subtle details to implementing locks and leader election correctly (see for example the fencing issue in “The leader and the lock” on page 301), and libraries like Apache Curator [17] help by providing higher-level recipes on top of ZooKeeper. However, a linearizable storage service is the basic foundation for these coordination tasks.

Constraints and uniqueness guarantees

If you want to enforce this constraint as the data is written (such that if two people try to concurrently create a user or a file with the same name, one of them will be returned an error), you need linearizability.

Cross-channel timing dependencies (PAGE 331)

If the file storage service is linearizable, then this system should work fine. If it is not linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in Figure 9-5) might be faster than the internal replication inside the storage service. In this case, when the resizer fetches the image (step 5), it might see an old version of the image, or nothing at all.
(you can use alternative approaches similar to what we discussed in “Reading Your Own Writes” on page 162, at the cost of additional complexity.)

Implementing Linearizable Systems (DDIA page 332)

the simplest answer would be to really only use a single copy of the data. However, that approach would not be able to tolerate faults
The most common approach to making a system fault-tolerant is to use replication.
- Single-leader replication (potentially linearizable): If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable. Using the leader for reads relies on the assumption that you know for sure who the leader is. With asynchronous replication, failover may even lose committed writes which violates both durability and linearizability.
- Consensus algorithms (linearizable): consensus protocols contain measures to prevent split brain and stale replicas
- Multi-leader replication (not linearizable): Systems with multi-leader replication are generally not linearizable, because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes. For this reason, they can produce conflicting writes that require resolution (see “Handling Write Conflicts” on page 171). Such conflicts are an artifact of the lack of a single copy of the data.
- Leaderless replication (probably not linearizable): READ AGAIN

Linearizability and network delays

The reason for dropping linearizability is performance, not fault tolerance.
The same is true of many distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not so much for fault tolerance [46]. Linearizability is slow—and this is true all the time, not only during a network fault.
Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is no: Attiya and Welch [47] prove that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays, like most computer networks (see “Timeouts and Unbounded Delays” on page 281), the response time of linearizable reads and writes is inevitably going to be high.

Linearizability is stronger than causal consistency

making a system linearizable can harm its performance and availability, especially if the system has significant network delays (for example, if it’s geographically distributed). For this reason, some distributed data systems have abandoned linearizability, which allows them to achieve better performance but can make them difficult to work with.
The good news is that a middle ground is possible. Linearizability is not the only way of preserving causality—there are other ways too. A system can be causally consistent without incurring the performance hit of making it linearizable (in particular, the CAP theorem does not apply). In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures [2, 42].
In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently.

Relation between linearizability and consensus

DDIA page 350

In a formal sense, a linearizable read-write register is an “easier” problem. Total order broadcast is equivalent to consensus [67], which has no deterministic solution in the asynchronous crash-stop model [68], whereas a linearizable read-write register can be implemented in the same system model [23, 24, 25]. However, supporting atomic operations such as compare-and-set or increment-and-get in a register makes it equivalent to consensus [28]. Thus, the problems of consensus and a linearizable register are closely related.

DDIA page 352

it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consen‐ sus [28, 67]. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!

Implementing linearizable storage using total order broadcast

You can implement such a linearizable compare-and-set operation as follows by using total order broadcast as an append-only log [62, 63]:

Append a message to the log, tentatively indicating the username you want to claim.

Read the log, and wait for the message you appended to be delivered back to you.xi

Check for any messages claiming the username that you want. If the first message for your desired username is your own message, then you are successful: you can commit the username claim (perhaps by appending another message to the log) and acknowledge it to the client

While this procedure ensures linearizable writes, it doesn’t guarantee linearizable reads—if you read from a store that is asynchronously updated from the log, it may be stale. (To be precise, the procedure described here provides sequential consistency [47, 64], sometimes also known as timeline consistency [65, 66], a slightly weaker guarantee than linearizability.) To make reads linearizable, there are a few options:

You can sequence reads through the log by appending a message, reading the log, and performing the actual read when the message is delivered back to you. The message’s position in the log thus defines the point in time at which the read happens. (Quorum reads in etcd work somewhat like this [16].)

If the log allows you to fetch the position of the latest log message in a linearizable way, you can query that position, wait for all entries up to that position to be delivered to you, and then perform the read. (This is the idea behind Zoo‐Keeper’s sync() operation [15].)

Implementing total order broadcast using linearizable storage

The easiest way is to assume you have a linearizable register that stores an integer and that has an atomic increment-and-get operation [28]. Alternatively, an atomic compare-and-set operation would also do the job.
The algorithm is simple: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.
Note that unlike Lamport timestamps, the numbers you get from incrementing the linearizable register form a sequence with no gaps. Thus, if a node has delivered message 4 and receives an incoming message with a sequence number of 6, it knows that it must wait for message 5 before it can deliver message 6. The same is not the case with Lamport timestamps—in fact, this is the key difference between total order broadcast and timestamp ordering.
How hard could it be to make a linearizable integer with an atomic increment-and-get operation? In general, if you think hard enough about linearizable sequence number generators, you inevitably end up with a consensus algorithm.

More stuff

gabrielgiussi · 2020-04-04T01:17:15Z

Distributed Consistency vs ACID

DDIA page 323

There is some similarity between distributed consistency models and the hierarchy of transaction isolation levels we discussed previously [4, 5] (see “Weak Isolation Levels” on page 233). But while there is some overlap, they are mostly independent concerns: transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

I would say that is not even required to talk about delays and faults if you have multi-writer (???) or leaderless replication (Dynamo, Cassandra)

Burckhardt page 135

Note that in the database literature, the abstract consistency model used for transactions is generally called the
isolation level, not to be confused with data consistency (meaning compliance with data invariants specified by the schema).

gabrielgiussi · 2020-04-04T01:25:57Z

Why CaPPIO?

DDIA page 324 uses a diagram to explain linearizability

gabrielgiussi · 2020-04-04T02:37:23Z

TO READ

gabrielgiussi · 2020-04-04T02:42:40Z

CAP

DDIA page 337

A Critique of the CAP Theorem

CAP Twelve years later: How the "Rules" have Changed

Fun and profit

Check the references! maybe is something else interesting

Others

gabrielgiussi · 2020-04-04T15:54:42Z

CRDTs

https://crdt.tech

Real use case in production ???

gabrielgiussi · 2020-04-04T20:04:04Z

System models

DDIA chapter 8

DDIA page 300

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a dis‐ tributed system, we can state the assumptions we are making about the behavior (the system model) and design the actual system in such a way that it meets those assump‐ tions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.

gabrielgiussi · 2020-04-05T15:28:17Z

Safety and liveness

DDIA page 308

IRSDP

gabrielgiussi · 2020-04-05T16:06:49Z

Zookeeper

DDIA page 370

ZooKeeper is modeled after Google’s Chubby lock service [14, 98], implementing not only total order broadcast (and hence consensus), but also an interesting set of other features that turn out to be particularly useful when building distributed systems:

Linearizable atomic operations
Total ordering of operations: As discussed in “The leader and the lock” on page 301, when some resource is protected by a lock or lease, you need a fencing token to prevent clients from con‐ flicting with each other in the case of a process pause. The fencing token is some number that monotonically increases every time the lock is acquired. ZooKeeper provides this by totally ordering all operations and giving each operation a monotonically increasing transaction ID (zxid) and version number (cversion) [15].
Failure detection
Change notifications

Of these features, only the linearizable atomic operations really require consensus.

Fencing tokens

TODO

gabrielgiussi · 2020-04-06T00:30:34Z

Introduction

IRSDP ✅

Distributed computing addresses algorithms for a set of processes that seek to achieve some form of cooperation. Besides executing concurrently, some of the processes of a distributed system might stop operating, for instance, by crashing or being disconnected, while others might stay alive and keep operating. This very notion of partial failures is a characteristic of a distributed system. In fact, this notion can be useful if one really feels the need to differentiate a distributed system from a concurrent system. It is in order to quote Leslie Lamport here:
“A distributed system is one in which the failure of a computer you did not even know existed can render your own computer unusable.”
When a subset of the processes have failed, or become disconnected, the challenge is usually for the processes that are still operating, or connected to the majority of the processes, to synchronize their activities in a consistent way. In other words, the cooperation must be made robust to tolerate partial failures and sometimes also adversarial attacks. This makes distributed computing a hard, yet extremely stimulating problem. Due to the asynchrony of the processes, the possibility of failures in the communication infrastructure, and perhaps even malicious actions by faulty processes, it may be impossible to accurately detect process failures; in particular, there is often no way to distinguish a process failure from a network failure, as we will discuss in detail later in the book. The challenge in distributed computing is precisely to devise algorithms that provide the processes that remain operating with enough consistent information so that they can cooperate correctly and solve common tasks.

Client-server ✅

In fact, many programs that we use today are distributed programs. Simple daily routines, such as reading e-mail or browsing the Web, involve some form of distributed computing. However, when using these applications, we are typically faced with the simplest form of distributed computing: client–server computing. In client–server computing, a centralized process, the server, provides a service to many remote clients

TODO explain why this is not what we are interested in.

Why design distributed systems? ✅

Fun and profit on Horizontal vs Vertical Scale

Nothing really demands that you use distributed systems. Given infinite money and infinite R&D time, we wouldn't need distributed systems. All computation and storage could be done on a magic box - a single, incredibly fast and incredibly reliable system that you pay someone else to design for you.
However, few people have infinite resources. Hence, they have to find the right place on some real-world cost-benefit curve. At a small scale, upgrading hardware is a viable strategy. However, as problem sizes increase you will reach a point where either the hardware upgrade that allows you to solve the problem on a single node does not exist, or becomes cost-prohibitive. At that point, I welcome you to the world of distributed systems.
Ideally, adding a new machine would increase the performance and capacity of the system linearly. But of course this is not possible, because there is some overhead that arises due to having separate computers. Data needs to be copied around, computation tasks have to be coordinated and so on. This is why it's worthwhile to study distributed algorithms - they provide efficient solutions to specific problems, as well as guidance about what is possible, what the minimum cost of a correct implementation is, and what is impossible.

Fun and profit

So everything starts with size - scalability. Informally speaking, in a scalable system as we move from small to large, things should not get incrementally worse.
Scalability is the ability of a system, network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.
What is growth?

Size scalability: adding more nodes should make the system linearly faster; growing the dataset should not increase latency
Geographic scalability: it should be possible to use multiple data centers to reduce the time it takes to respond to user queries, while dealing with cross-data center latency in some sensible manner.
Administrative scalability: adding more nodes should not increase the administrative costs of the system (e.g. the administrators-to-machines ratio).
A scalable system is one that continues to meet the needs of its users as scale increases. There are two particularly relevant aspects - performance and availability - which can be measured in various ways.

Fault tolerance / Availability (already discussed above in IRSDP), if you only have one process and that process fails you have nothing. ✅

Distributed systems allow us to achieve desirable characteristics that would be hard to accomplish on a single system. For example, a single machine cannot tolerate any failures since it either fails or doesn't.
Distributed systems can take a bunch of unreliable components, and build a reliable system on top of them.
Availability = the proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable.
Availability is in some sense a much wider concept than uptime, since the availability of a service can also be affected by, say, a network outage or the company owning the service going out of business (which would be a factor which is not really relevant to fault tolerance but would still influence the availability of the system). But without knowing every single specific aspect of the system, the best we can do is design for fault tolerance.
What does it mean to be fault tolerant? It is the ability of a system to behave in a well-defined manner once faults occur

Improving Latency
- By distributing the load among more nodes.
- By placing nodes closer to their users, e.g. one datacenter in America and another in Africa.
- By partitioning (?) (TODO work on this one)

Some characteristics of distributed systems ✅

partial failures
network is unreliable
different consistency models (see this)
unsync clocks
no way to distinguish a slow process from a crashed one (at least in async model)
no shared memory, only message passing via an unreliable network

(DDIA page 300)

A node in the network cannot know anything for sure—it can only make guesses based on the messages it receives (or doesn’t receive) via the net‐ work. A node can only find out what state another node is in (what data it has stored, whether it is correctly functioning, etc.) by exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.

Partial failures (DDIA page 274) ✅

There is no fundamental reason why software on a single computer should be flaky: when the hardware is working correctly, the same operation always produces the same result (it is deterministic). If there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a total system failure.
In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure. The difficulty is that partial failures are nondeterministic: if you try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail. As we shall see, you may not even know whether something succeeded or not, as the time it takes for a message to travel across a network is also nondeterministic!
This nondeterminism and possibility of partial failures is what makes distributed sys‐ tems hard to work with

Things break (DDIA page 276) ✅

The bigger a system gets, the more likely it is that one of its components is bro‐ ken. Over time, broken things get fixed and new things break, but in a system with thousands of nodes, it is reasonable to assume that something is always broken [7]
Even in smaller systems consisting of only a few nodes, it’s important to think about partial failure. In a small system, it’s quite likely that most of the components are working correctly most of the time. However, sooner or later, some part of the system will become faulty, and the software will have to somehow handle it. The fault handling must be part of the software design, and you (as operator of the software) need to know what behavior to expect from the software in the case of a fault.
In distributed systems, suspicion, pessimism, and paranoia pay off.

Reliable from unreliable components (DDIA page 277)

intuitively it may seem like a system can only be as reliable as its least reliable component (its weakest link). This is not the case: in fact, it is an old idea in computing to construct a more reliable system from a less reliable underlying base

IP (the Internet Protocol) is unreliable: it may drop, delay, duplicate, or reorder packets. TCP (the Transmission Control Protocol) provides a more reliable transport layer on top of IP: it ensures that missing packets are retransmitted, duplicates are eliminated, and packets are reassembled into the order in which they were sent.

Network is unreliable (DDIA page 278) ✅

The internet and most internal networks in datacenters (often Ethernet) are asyn‐ chronous packet networks. In this kind of network, one node can send a message (a packet) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send a request and expect a response, many things could go wrong

Your request may have been lost (perhaps someone unplugged a network cable).
Your request may be waiting in a queue and will be delivered later (perhaps the network or the recipient is overloaded).
The remote node may have failed (perhaps it crashed or it was powered down).
The remote node may have temporarily stopped responding (perhaps it is expe‐ riencing a long garbage collection pause; see “Process Pauses” on page 295), but it will start responding again later.
The remote node may have processed your request, but the response has been lost on the network (perhaps a network switch has been misconfigured).
The remote node may have processed your request, but the response has been delayed and will be delivered later (perhaps the network or your own machine is overloaded).

If you send a request to another node and don’t receive a response, it is impossible to tell why. The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive. However, when a timeout occurs, you still don’t know whether the remote node got your request or not (and if the request is still queued somewhere, it may still be delivered to the recipient, even if the sender has given up on it)

Network faults in practice (DDIA page 279) ✅

There are some systematic studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, even in controlled environments like a datacenter operated by one company [14]. One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack [15]. Another study measured the failure rates of components like top-of-rack switches, aggregation switches, and load balancers [16]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.
Public cloud services such as EC2 are notorious for having frequent transient net‐ work glitches [14], and well-managed private datacenter networks can be stabler environments. Nevertheless, nobody is immune from network problems: for example, a problem during a software upgrade for a switch could trigger a network topology reconfiguration, during which network packets could be delayed for more than a minute [17]. Sharks might bite undersea cables and damage them [18]. Other surprising faults include a network interface that sometimes drops all inbound packets but sends outbound packets successfully [19]: just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.

TODO See The Network is Reliable. An informal survey of real-world communications failures

TODO there are other stuff to read in DDIA page 280 ‼️

Detecting Faults
Timeouts and unbounded delays
TCP vs UDP
Sync vs Async networks
Latency and resource utilization

Unreliable clocks (DDIA page 287) ✅

In a distributed system, time is a tricky business, because communication is not instantaneous: it takes time for a message to travel across the network from one machine to another. The time when a message is received is always later than the time when it is sent, but due to variable delays in the network, we don’t know how much later. This fact sometimes makes it difficult to determine the order in which things happened when multiple machines are involved.
Moreover, each machine on the network has its own clock, which is an actual hardware device: usually a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own notion of time, which may be slightly faster or slower than on other machines. It is possible to synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which allows the computer clock to be adjusted according to the time reported by a group of servers.

Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope—hardware clocks and NTP can be fickle beasts. To give just a few examples:

The quartz clock in a computer is not very accurate: it drifts (runs faster or slower than it should). Clock drift varies depending on the temperature of the machine. Google assumes a clock drift of 200 ppm (parts per million) for its servers [41], which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best possible accuracy you can achieve, even if everything is working correctly.
If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset [37]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.
If a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time. Anecdotal evidence suggests that this does happen in practice.
NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when you’re on a congested network with variable packet delays. One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet [42], though occasional spikes in network delay lead to errors of around a second. Depending on the configuration, large network delays can cause the NTP client to give up entirely.
Some NTP servers are wrong or misconfigured, reporting time that is off by hours [43, 44]. NTP clients are quite robust, because they query several servers and ignore outliers. Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you were told by a stranger on the internet.
In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [50]. When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as the clock suddenly jumping forward [26].

Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality. If some piece of software is relying on an accurately synchronized clock, the result is more likely to be silent and subtle data loss than a dramatic crash [53, 54].

Timestamp for ordering events (DDIA page 291) ‼️

For example, if two clients write to a distributed database, who got there first? Which write is the more recent one? [there is a figure here] This conflict resolution strategy is called last write wins (LWW), and it is widely used in both multi-leader replication and leaderless databases such as Cassandra [53] and Riak [54] (see “Last write wins (discarding concurrent writes)” on page 186).

gabrielgiussi · 2020-04-25T16:03:06Z

: white_check_mark :

gabrielgiussi · 2020-04-27T22:27:27Z

GC pauses is interesting to add, maybe in the introduction when we are discussing about things that may wrong.

See DDIA page 295, Process Pauses

gabrielgiussi · 2020-04-27T22:48:42Z

Transactions and ACID

DDIA page 321

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees. This is the same approach as we used with transactions in Chapter 7: by using a transaction, the application can pretend that there are no crashes (atomicity), that nobody else is concurrently accessing the database (isolation), and that storage devices are perfectly reliable (durability). Even though crashes, race conditions, and disk failures do occur, the transaction abstraction hides those problems so that the application doesn’t need to worry about them.
We will now continue along the same lines, and seek abstractions that can allow an application to ignore some of the problems with distributed systems. For example, one of the most important abstractions for distributed systems is consensus: that is, getting all of the nodes to agree on something. As we shall see in this chapter, reliably reaching consensus in spite of network faults and process failures is a surprisingly tricky problem.

DDIA page 221

In the harsh reality of data systems, many things can go wrong:
• The database software or hardware may fail at any time (including in the middle of a write operation).
• The application may crash at any time (including halfway through a series of operations).
• Interruptions in the network can unexpectedly cut off the application from the database, or one database node from another.
• Several clients may write to the database at the same time, overwriting each other’s changes.
• A client may read data that doesn’t make sense because it has only partially been updated.
• Race conditions between clients can cause surprising bugs.
In order to be reliable, a system has to deal with these faults and ensure that they don’t cause catastrophic failure of the entire system. However, implementing fault- tolerance mechanisms is a lot of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of testing to ensure that the solution actually works.
For decades, transactions have been the mechanism of choice for simplifying these issues. A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback). If it fails, the application can safely retry. With transactions, error handling becomes much simpler for an application, because it doesn’t need to worry about partial failure—i.e., the case where some operations succeed and some fail (for whatever reason).

Safety guarantees provided by ACID (DDIA page 223)

ACID

ACID, which stands for Atomicity, Consistency, Isolation, and Dura‐ bility. It was coined in 1983 by Theo Härder and Andreas Reuter [7] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
However, in practice, one database’s implementation of ACID does not equal another’s implementation. For example, as we shall see, there is a lot of ambiguity around the meaning of isolation [8].
ACID has unfortunately become mostly a mar‐ keting term.

Atomic

For example, in multi-threaded programming, if one thread executes an atomic operation, that means there is no way that another thread could see the half-finished result of the operation. The system can only be in the state it was before the operation or after the operation, not something in between.
By contrast, in the context of ACID, atomicity is not about concurrency. It does not describe what happens if several processes try to access the same data at the same time, because that is covered under the letter I, for isolation.
Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed—for example, a process crashes, a network connection is interrupted, a disk becomes full, or some integrity constraint is violated. If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction.
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity.

Consistency

The word consistency is terribly overloaded:

replica consistency and the issue of eventual consis‐ tency that arises in asynchronously replicated systems
Consistent hashing is an approach to partitioning that some systems use for reba‐ lancing
In the CAP theorem the word consistency is used to mean linearizability
In the context of ACID, consistency refers to an application-specific notion of the database being in a “good state.”
The idea of ACID consistency is that you have certain statements about your data (invariants) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. If a transaction starts with a database that is valid according to these invariants, and any writes during the transac‐ tion preserve the validity, then you can be sure that the invariants are always satisfied.
However, this idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly so that they preserve consistency. This is not something that the database can guarantee: if you write bad data that violates your invariants, the database can’t stop you. (Some spe‐ cific kinds of invariants can be checked by the database, for example using foreign key constraints or uniqueness constraints. However, in general, the application defines what data is valid or invalid—the database only stores it.)
Atomicity, isolation, and durability are properties of the database, whereas consis‐ tency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID. (Joe Hellerstein has remarked that the C in ACID was “tossed in to make the acronym work” in Härder and Reuter’s paper [7], and that it wasn’t considered important at the time.)

Durability

The purpose of a database system is to provide a safe place where data can be stored without fear of losing it. Durability is the promise that once a transaction has com‐ mitted successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

Isolation

Most databases are accessed by several clients at the same time. That is no problem if they are reading and writing different parts of the database, but if 967A they are accessing the same database records, you can run into concurrency problems (race conditions).
Figure 7-1 is a simple example of this kind of problem. Say you have two clients simultaneously incrementing a counter that is stored in a database. Each client needs to read the current value, add 1, and write the new value back (assuming there is no increment operation built into the database). In Figure 7-1 the counter should have increased from 42 to 44, because two increments happened, but it actually only went to 43 because of the race condition.
Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. The classic database textbooks formalize isolation as serializability, which means that each transaction can pretend that it is the only transaction running on the entire database. The database ensures that when the transactions have committed, the result is the same as if they had run serially (one after another), even though in reality they may have run con‐ currently [10].
However, in practice, serializable isolation is rarely used, because it carries a performance penalty. Some popular databases, such as Oracle 11g, don’t even implement it. In Oracle there is an isolation level called “serializable,” but it actually implements something called snapshot isolation, which is a weaker guarantee than serializability [8, 11].
Concurrently running transactions shouldn’t interfere with each other. For example, if one transaction makes several writes, then another transaction should see either all or none of those writes, but not some subset.
https://stackoverflow.com/a/34834483/3517383 (maybe I should refer to this when I talk about weak isolation levels).

The need for multi-object transactions (DDIA page 231)

But do we need multi-object transactions at all? Would it be possible to implement any application with only a key-value data model and single-object operations?
There are some use cases in which single-object inserts, updates, and deletes are suffi‐ cient. However, in many other cases writes to several different objects need to be coordinated:
• In a relational data model, a row in one table often has a foreign key reference to a row in another table. (Similarly, in a graph-like data model, a vertex has edges to other vertices.) Multi-object transactions allow you to ensure that these refer‐ ences remain valid: when inserting several records that refer to one another, the foreign keys have to be correct and up to date, or the data becomes nonsensical.
• In a document data model, the fields that need to be updated together are often within the same document, which is treated as a single object—no multi-object transactions are needed when updating a single document. However, document databases lacking join functionality also encourage denormalization (see “Rela‐ tional Versus Document Databases Today” on page 38). When denormalized information needs to be updated, like in the example of Figure 7-2, you need to update several documents in one go. Transactions are very useful in this situation to prevent denormalized data from going out of sync.
• In databases with secondary indexes (almost everything except pure key-value stores), the indexes also need to be updated every time you change a value. These indexes are different database objects from a transaction point of view: for exam‐ ple, without transaction isolation, it’s possible for a record to appear in one index but not another, because the update to the second index hasn’t happened yet.

Weak isolation levels (DDIA page 233)

TODO
This is somehow equivalent to this tree? Check #117 (comment)
https://jepsen.io/consistency

NoSQL and Distributed transactions (DDIA page 223)

In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to improve upon the relational status quo by offering a choice of new data models (see Chapter 2), and by including replication (Chapter 5) and partitioning (Chapter 6) by default. Transactions were the main casualty of this movement: many of this new generation of databases abandoned transactions entirely, or redefined the word to describe a much weaker set of guarantees than had previously been under‐ stood [4].
With the hype around this new crop of distributed databases, there emerged a popular belief that transactions were the antithesis of scalability, and that any large-scale system would have to abandon transactions in order to maintain good performance and high availability [5, 6]. On the other hand, transactional guarantees are sometimes presented by database vendors as an essential requirement for “serious applications” with “valuable data.” Both viewpoints are pure hyperbole.
The truth is not that simple: like every other technical design choice, transactions have advantages and limitations

Spanner

BASE systems (DDIA page 223)

(Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency [9]. This is even more vague than the definition of ACID. It seems that the only sensible definition of BASE is “not ACID”; i.e., it can mean almost anything you want.)

gabrielgiussi · 2020-04-27T22:49:24Z

Where CappIO can help?

A race condition between two clients concurrently incrementing a counter (DDIA page 226)

gabrielgiussi added the analysis label Dec 23, 2019

gabrielgiussi changed the title ~~Tesis~~ Thesis Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thesis #117

Thesis #117

Thesis #117

Thesis #117

Comments

Topics

Optional

Automata

Eventuate

Lack of vocabulary (cappIO motivation)

Real use cases for Broadcast

Which is the relation between broadcast and consistency?

Introduction to reliable ... 1.6 Chapter notes

DS mantra

Broadcast

Atomic Broadcast

FLP impossibility rule

DDIA Pag 353

Fun and Profit

Consensus

DDIA pag 322

Distributed transactions and consensus (DDIA page 352)

2PC

Fault-tolerance consensus

Consensus algorithms and total order broadcast (DDIA page 366)

Epoch numbering and quorums

Limitations of consensus

Types of consensus (from IRSDP)

Split brain

Fun and profit

Equivalent problems

Questions

Abstractions

Distributed Consistency

DDIA Page 323 about eventual consistency

Linearizability

Recency guarantee

Linearizability Versus Serializability

In what circumstances is linearizability useful? (DDIA page 330)

Implementing Linearizable Systems (DDIA page 332)

Linearizability and network delays

Linearizability is stronger than causal consistency

Relation between linearizability and consensus

Implementing linearizable storage using total order broadcast

Implementing total order broadcast using linearizable storage

More stuff

Distributed Consistency vs ACID

DDIA page 323

Burckhardt page 135

Why CaPPIO?

CAP

DDIA page 337

A Critique of the CAP Theorem

CAP Twelve years later: How the "Rules" have Changed

Fun and profit

CRDTs

Real use case in production ???

System models

Safety and liveness

Zookeeper

Fencing tokens

Introduction

IRSDP ✅

Client-server ✅

Why design distributed systems? ✅

Fault tolerance / Availability (already discussed above in IRSDP), if you only have one process and that process fails you have nothing. ✅

Some characteristics of distributed systems ✅

Partial failures (DDIA page 274) ✅

Things break (DDIA page 276) ✅

Reliable from unreliable components (DDIA page 277)

Network is unreliable (DDIA page 278) ✅

Network faults in practice (DDIA page 279) ✅

Unreliable clocks (DDIA page 287) ✅

Timestamp for ordering events (DDIA page 291) ‼️

Transactions and ACID

DDIA page 321

DDIA page 221

Safety guarantees provided by ACID (DDIA page 223)

ACID

Atomic

Consistency

Durability