This is the repository for the Artifact Evaluation of OSDI2025 proceeding: "Picsou: Enabling Replicated State Machines to Communicate Efficiently".
For all questions about the artifact please e-mail Reginald Frank reginaldfrank77@berkeley.edu.
- General Information
- High Level Claims
- Artifact Organization
- Validating the Claims
- Step by Step Guide
This artifact contains, and allows to reproduce, experiments for (almost*) all figures included in the paper "Picsou: Enabling Replicated State Machines to Communicate Efficiently".
*Cross-Country experiments have been excluded due to high GCP credit cost. Kafka results are also available but their collection is not automated.
It contains a prototype implementation of Picsou, a Crash and Byzantine Fault Tolerant Cross-Cluster Consistent Broadcast protocol. The prototype is intended to connect two separate consensus instances, and currently there is support for RAFT (Etcd's Raft implementation) as well as FILE (an "infinitely fast" simulated consensus protocol). The prototype can simulate both crash and byzantine failures and it remains resilient to both. While the Picsou protocol can tolerate arbitrary failure patterns, the prototype does not attempt to simulate all possible behaviors. For example, while the protocol can handle crash failures, if a machine is shutdown during a test the prototype will abort after detecting the unresponsive machine (to avoid collecting results with an unexpected crash).
[NOTE] The Picsou prototype codebase is henceforth called "Scrooge". Throughout this document you will find references to Scrooge, and many configuration files are named accordingly. All these occurrences refer to the Picsou prototype.
Picsou's current codebase (BFT-RSM) was modified in order to improve ease of collecting results for artifact evaluation. While takeaways remain consistent, individual performance results may differ slightly across the benchmarks (better performance in some cases) as other minor modifications to the codebase were necessary to support the changes.
In addition to Picsou, this artifact contains prototype implementations for 5 baselines:
- One-Shot (OST): In OST, a message is sent by a single sender to a single receiver. OST is only meant as a performance upper-bound. It does not satisfy C3B as message delivery cannot be guaranteed
- All-To-All (ATA): In ATA, every replica in the sending RSM sends all messages to all receiving replicas (O(N^2) message complexity). As in SCROOGE, every correct receiver is guaranteed to eventually receive the message
- Leader-To-Leader (LL): The leader of the sending RSM sends a message to the leader of the receiving RSM, who then internally broadcasts the message. This protocol does not guarantee eventual delivery when leaders are faulty.
- OTU: Modeling the C3B strategy in the paper GeoBFT (Suyash et al.), OTU breaks down an RSM into a set of sub-RSMs. Much like LL, GeoBFT’s cross-RSM communication protocol, OTU, has the leader of the sender RSM send its messages to at least (#faulty_receivers) +1 receiver RSM replicas. Each receiver then internally broadcast these messages. When the leader is faulty, replicas timeout and request a resend. OTU thus guarantees eventual delivery after at most (#faulty_senders) +1 resends in the worst-case (for O(faulty^2)=O(N^2) total messages).
- KAFKA: Apache KAFKA is the de-facto industry-standard for exchanging data between services. Producers write data to a Kafka cluster, while consumers read data from it. Kafka, internally, uses Raft to reliably dissemi- nate messages to consumers. We use Kafka 2.13-3.7.0.
For more information read the current paper draft section 6 evaluation.
- Main claim 1: For all local baseline tests (
Throughput vs Message Size @ 19-node-network
,Throughput vs Message Size @ 4-node-network
,Throughput vs Network Size @ 100B-messages
,Throughput vs Network Size @ 1MB-messages
) Picsou's Throughput lies between All-To-All (ATA) and One-Shot (OST). Expected performance is between 2.5-15x higher performance than A2A depending on network size - Main claim 2: For all failure tests (
Byzantine Failures: Throughput vs Network Size
,Crash Failures: Throughput vs Network Size
) Picsou has higher performance than its BFT (A2A) and CFT (A2A, LL, OTU) comparative baselines. Expected performance is between 2.5x and 7x higher than A2A where LL and OTU performance is comparatively similar to A2A - Main claim 3: For all geo-distributed tests (
Data Reconciliation: Throughput vs Message Size
,Disaster Recovery: Throughput vs Message Size
) As the size of the sent messages increase, the general throughput of all strategies will increase. Critically, strategies A2A, LL, OTU, will never exceed a system wide throughput of 50MBps (50MBps is the throughput of sending a message between 2 nodes cross region). In contrastOST
andPicsou
both will be able to reach 70MBps (the maximum throughput of Etcd Raft on these machines) - Supplementary: All other microbenchmarks reported realistically represent Picsou.
- Example artifact output: https://drive.google.com/file/d/1lXxlT_wlib-EFfCos_lPbRLkvaWvwtHG/view?usp=sharing
The artifact spans across the following 3 GitHub repositories. Please checkout the corresponding branch when validating claims for a given system.
-
BFT-RSM: Contains testing infrastructure and implementation of Picsou and our Baselines 1-4 (all but Kafka)
-
Baseline implementations
Code/system/scrooge.cpp
-- scrooge implementationCode/system/one_to_one.cpp
-- OST implementation (one_to_one unfair variant)Code/system/all_to_all.cpp
-- all_to_all implementationCode/system/geobft.cpp
-- OTU implementationCode/system/leader_leader.cpp
-- LL implementation
-
Testing Infrastructure
-
Code/experiments/results
-- location where all test results will be stored -
Code/util_graph.py
-- python file which contains mapping between Graphs to visualize, and experiments to run for each datapoints -
Code/auto_run_all_experiments.py
-- Main testing framework which:- Loads all experiments needed to visualize graphs from
util_graph.py
- Checks
Code/experiments/results
to see what experiments are remaining to run - Spawns the required network for testing using
Code/auto_make_nodes.sh
- Runs the required tests using
Code/auto_build_exp.sh
andCode/auto_run_exp.sh
to build and run tests - Deletes the created network using
Code/auto_delete_nodes.sh
- Loads all experiments needed to visualize graphs from
-
Code/experiments/experiment_scripts/eval_all_graphs.py
-- script which usesCode/util_graph.py
to load all desired graphs, pulls in results fromCode/experiments/results
and then outputs multiplegraph.png
files and a text representation of all results collected
-
-
-
scrooge-kafka: Contains a producer / consumer shim to use Apache Kafka as a Cross-Cluster Consistent Broadcast protocol.
-
go-algorand: Contains the edited version of Algorand which sends all blocks to Scrooge to communicate with other Consensus Clusters
-
raft-application: Contains the edited version of Etcd Raft which sends all blocks to Scrooge to communicate with other Consensus Clusters. Additionally has edits for disaster recovery and key reconciliation.
-
resdb-application: Contains the edited version of ResDB which sends all blocks to Scrooge to communicate with other Consensus Clusters.
All our experiments were run using Google Cloud. To reproduce our results and validate our claims with our testing infrastructure, you will need to 1) get access to our google cloud repo 2) Create a vm using our provided worker image, and 3) Run the Code/auto_run_all_experiments.py
script to collect results and 4) Run Code/experiments/experiment_scripts/eval_all_graphs.py
to visualize all trends. If you wish to use our code outside of GCP (e.g. cloud lab) that is ok, but our protocol parameters may need to change for the new machines/network and running cross-region experiments may also not be possible. Let us know if this is your desired method and we can provide additional instructions.
- Email reginaldfrank77@berkeley.edu to get access to our Google Cloud project
- Create a new VM with the osdi-artifact-eval.img
- Navigate to
Virtual Machines
>VM instances
>Create instance
- Ensure region is
us-west1
zone can be any - For Machine type select
c2-standard-8 (8 vCPUs, 32 GB Memory)
- Under
OS and storage
change the image toCustom image
>osdi-artifact-evaluation-image
- Create the VM and wait for it to be created
- Navigate to
- SSH onto the VM.
- Execute
sudo -su scrooge
(this will need to be done any time you ssh onto your created VM)
- Execute
- Run the experiments
- Execute
cd ~/BFT-RSM/Code
to migrate to the directory with all testing infrastructure - Execute
tmux new -s <your session name>
to enter a new tmux session - Execute
./auto_run_all_experiments.py
to start all experiments. - You can follow along the python script output by following logs output into
/tmp
or look at node outputs in~/BFT-RSM/Code/experiments/results
- You can also find the ip addresses of machines in the experiments by navigating to the GCP website >
Instance groups
>Instance groups
Feel free to ssh onto them and monitor node performance using tools likehtop
(CPU usage) andnload
(network usage) - If you want to re-run an experiment -- delete its corresponding folder in
~/BFT-RSM/Code/experiments/results
and re-run./auto_run_all_experiments.py
- Execute
- Visualize the results
- Execute
cd ~/BFT-RSM/Code/experiments/results
to navigate to the directory containing all result files - Execute
./eval_all_graphs.py
to load all result files and run analysis on them - View outputted text to see numerical results -- View graphs visually at the multiple
*.png
files now in the results directory
- Execute
Finally thanks for viewing/evaluating our artifact, let us know if you have any questions!