Awesome Open Source
Awesome Open Source

Distributed consensus reading list

Since its inception in the 1980s, distributed consensus and the related areas of atomic broadcast, state machine replication and byzantine fault tolerance have been the subjects of extensive academic research. This file contains a list of papers (and other works) relating to distributed consensus.

Many of the papers listed below fit into more than one section. However, for simplicity, each paper is listed only in the most relevant section. Where possible, open access links for each paper have been provided. Contributions are welcome.

Key: acmdl = ACM Digital Library

The sections are as follows:

Theoretical results

This section lists theoretical results relating to distributed consensus.

  • Time, Clocks, and the Ordering of Events in a Distributed System, CACM 1978 [acmdl,pdf]
  • The implementation of reliable distributed multiprocess systems, Computer Networks 1978 [pdf]
  • Impossibility of Distributed Consensus with One Faulty Process, JACM 1985 [acmdl,pdf]
    • as known as the FLP result
  • The Byzantine Generals Problem, TPLS 1982 [acmdl,pdf]
  • On the Minimal Synchronism Needed for Distributed Consensus, JACM 1987 [acmdl,pdf]
  • Unreliable Failure Detectors for Reliable Distributed Systems, JACM 1996 [acmdl,pdf]
  • The Weakest Failure Detector for Solving Consensus, JACM 1996 [acmdl,pdf]
  • Omega Meets Paxos: Leader Election and Stability without Eventual Timely Links, DISC 2005 [acmdl,pdf]
  • Lower Bounds for Asynchronous Consensus, Distributed Computing 2006 [acmdl,pdf]
  • The Heard-Of Model: Computing in Distributed Systems with Benign Failures, Distributed Computing 2009 [acmdl,pdf]
  • Virtually Synchronous Methodology for Dynamic Service Replication, MS Tech report 2010 [pdf]
  • CAP Twelve Years Later: How the "Rules" Have Changed, Computer Magazine 2012 [html]

Surveys

This section lists surveys, tutorials, book chapters and systemisation of knowledge papers covering distributed consensus algorithms.

  • A Modular Approach to Fault-Tolerant Broadcasts and Related Problems, Tech Report 1994 [acmdl,pdf]
  • How to Build a Highly Available System Using Consensus, WDAG 1996 [acmdl,pdf]
  • Revisiting the PAXOS algorithm, WDAG 1997 [acmdl,pdf]
  • The ABCD’s of Paxos, PODC 2001 [acmdl,pdf]
  • Paxos Made Simple, SIGACT News 2001 [pdf]
    • describes single-degree Paxos as well as Multi-Paxos for SMR
    • much more readable than the original paper
    • featured in the morning paper
  • Deconstructing paxos, SIGACT News 2003 [pdf,acmdl]
  • Total order broadcast and multicast algorithms: Taxonomy and survey, CSUR 2004 [acmdl,pdf]
  • Readings in Database Systems (5th Edition), Book 2015 [pdf]
  • Vive La Difference: Paxos vs. Viewstamped Replication vs. Zab, TDSC 2005 [pdf]
  • The Alpha of Indulgent Consensus, Comp Journal 2007 [acmdl,pdf]
  • The Paxos Register, SRDS 2007 [acmdl,pdf]
  • Classic Paxos vs. Fast Paxos: Caveat Emptor, HotDep 2007 [acmdl,pdf]
  • Introduction to Reliable and Secure Distributed Programming, Book 2011 [acmdl,website]
  • Tutorial Summary: Paxos Explained from Scratch, OPODIS 2013 [acmdl,pdf]
  • Paxos Made Moderately Complex, CSUR 2015 [acmdl,pdf]
  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Book 2017 [website,amazon]
  • On the Parallels between Paxos and Raft, and how to Port Optimizations, PODC 2019 [acmdl,pdf]
  • Paxos vs Raft: Have we reached consensus on distributed consensus?, PaPoC 2020 [acmdl,arxiv]
  • 60 Years of Mastering Concurrent Computing through Sequential Thinking, SIGACT News 2020 [acmdl]
  • What's Live? Understanding Distributed Consensus, Unpublished 2020 [arxiv]

Algorithms for consensus

This section lists papers describing algorithms for distributed consensus.

  • Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols, PODC 1983 [acmdl]
  • aka Ben-Or algorithm
  • Reliable communication in the presence of failures, TOCS 1987 [acmdl,pdf]
  • Consensus in the Presence of Partial Synchrony, JACM 1988 [acmdl,pdf]
  • Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems, PODC 1988 [acmdl,pdf]
  • Efficient Message Ordering in Dynamic Networks, PODC 1996 [acmdl,pdf]
  • The Part-Time Parliament, TOCS 1998 [acmdl,pdf]
  • Disk Paxos, DISC 2000 [acmdl,pdf]
    • This paper describes how to replace acceptors in Paxos with disks
    • Each disk is divided into blocks, one for each proposer. Each proposer may only write to its own block and read from other blocks, which they do in each of the two usual Paxos phases
    • Each block contains the rough equivalent to last promised ballot number and last accepted proposal for the assigned proposer
  • Specifying and Using a Partitionable Group Communication Service, TOCS 2001 [acmdl,pdf]
  • Active Disk Paxos with infinitely many processes, PODC 2002 [acmdl,pdf]
    • This paper makes Disk Paxos more “Paxos like” by assuming the disks support more operations e.g. conditional write
    • ADP claims that Disk Paxos requires a fixed set of proposers and that ADP fixes this.
  • Cheap Paxos, DSN 2004 [acmdl,pdf]
  • Generalized Consensus and Paxos, Tech Report 2005 [pdf]
    • introduces the idea of deciding a partial ordering of operations for SMR instead of a total ordering
  • Fast Paxos, Distributed Computing 2006 [pdf]
    • proposers can bypassing leader by allowing multiple values to be proposed in the same ballot
    • requires stronger quorums intersection, e.g. fast paxos of 3/4 for a same liveness guarantees as classic Paxos
  • Consensus on Transaction Commit, TODS 2006 [acmdl,pdf]
  • Multicoordinated Paxos, PODC 2007 [acmdl,pdf]
    • Instead of having one leader, have a group of leaders. Client send operations to all leaders and they all propose values to the acceptors. Acceptors only accept a value if they have received proposals from a quorum of leaders. Similar to non-equivocation phase in BFT. Liveness now does not depend on the leader.
  • Stoppable Paxos, Tech Report 2008 [pdf]
  • Dynamic atomic storage without consensus, JACM 2011 [acmdl,pdf]
  • Fast Genuine Generalized Consensus, SRDS 2011 [acmdl,pdf]
  • Viewstamped Replication Revisited, Tech Report 2012 [pdf]
  • On Collision-fast Atomic Broadcast, AINA 2014 [pdf]
  • Paxos Quorum Leases: Fast Reads Without Sacrificing Writes, SOCC 2014 [acmdl,pdf]
    • extends the idea of master read leases to allow the master to promise to use a specified subset of acceptors in every majority quorum. Acceptors in this quorum can then serve reads locally.
    • similar to master read leases, it relies on clock synchrony.
  • Consus: Taming the Paxi, Unpublished 2016 [arxiv]
  • Flexible Paxos: Quorum Intersection Revisited, OPODIS 2016 [pdf]
  • CASPaxos: Replicated State Machines without logs, Unpublished 2018 [pdf]
  • Fast Flexible Paxos: Relaxing Quorum Intersection for Fast Paxos, ICDCN 2021 [arxiv]
  • Paxos Made Practical, Unpublished [pdf]

Consensus for specialist hardware

This section lists papers describing consensus algorithms using specialist hardware such as SDN, IP-multicast or RDMA.

  • Ring Paxos: A high-throughput atomic broadcast protocol, DSN 2010 [pdf,code]
  • Multi-Ring Paxos, DSN 2012 [acmdl,pdf]
  • NetPaxos: consensus at network speed, SOSR 2015 [acmdl,pdf]
  • Taming uncertainty in distributed systems with help from the network, Eurosys 2015 [acmdl,pdf]
  • DARE: High-Performance State Machine Replication on RDMA Networks, HPDC 2015 [acmdl,pdf]
  • Paxos Made Switch-y, CCR 2016 [acmdl,pdf]
  • Consensus in a Box: Inexpensive Coordination in Hardware, NSDI 2016 [acmdl,pdf]
  • Distributed Consensus and Implications of NVM on Database Management Systems, ACM Queue 2016 [acmdl,html]
  • AllConcur: Leaderless Concurrent Atomic Broadcast, HPDC 2017 [acmdl,pdf]
  • APUS: Fast and Scalable Paxos on RDMA, SoCC 2017 [acmdl,pdf]
  • When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network, APNet 2017 [acmdl,pdf]
  • P4xos: Consensus as a Network Service, Tech Report 2018 [pdf]
    • P4xos is also evaluated in The Case For In-Network Computing On Demand [acmdl,pdf,code]
  • Derecho: Fast State Machine Replication for Cloud Services, TOCS 2019 [acmdl,pdf,code]
    • Derecho: Group Communication at the Speed of Light, Unpublished [pdf]
    • Groups, Subgroups and Auto-Sharding in Derecho: A Customizable RDMA Framework for Highly Available Cloud Services, Unpublished [pdf]
  • NetChain: Scale-Free Sub-RTT Coordination, NSDI 2018 [acmdl,pdf]
  • Kernel Paxos, SRDS 2018 [pdf]
  • Partitioned Paxos via the Network Data Plane, Tech Report 2019 [pdf]
  • The Impact of RDMA on Agreement, PODC 2019 [pdf]
  • HovercRaft: Achieving Scalability and Fault-tolerance for microsecond-scale Datacenter Services, Eurosys 2020 [acmdl]
  • Microsecond Consensus for Microsecond Applications, OSDI 2020 [arxiv]

Consensus for geo-distributed systems

This section lists papers describing consensus algorithms for WANs and/or geo-replicated systems.

  • Mencius: Building Efficient Replicated State Machines for WANs, OSDI 2008 [acmdl,pdf]
  • Scalable Consistency in Scatter, SOSP 2011 [acmdl,pdf]
  • MDCC: Multi-Data Center Consistency, Eurosys 2013 [acmdl,pdf]
  • There Is More Consensus in Egalitarian Parliaments, SOSP 2013 [acmdl,pdf]
  • Geo-replicated storage with scalable deferred update replication, DSN 2013 [acmdl,pdf]
  • Low-Latency Multi-Datacenter Databases using Replicated Commit, VLDB 2013 [acmdl,pdf]
  • Be General and Don’t Give Up Consistency in Geo-Replicated Transactional Systems, OPODIS 2014 [pdf]
  • CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems, FAST 2015 [acmdl,pdf]
  • GlobalFS: A Strongly Consistent Multi-Site File System, SRDS 2016 [pdf]
  • Canopus: A Scalable and Massively Parallel Consensus Protocol, CoNEXT 2017 [acmdl,pdf]
  • Multileader WAN Paxos: Ruling the Archipelago with Fast Consensus, Tech report 2017 [pdf]
  • WPaxos: Wide Area Network Flexible Consensus, Unpublished 2017 [pdf]
  • Speeding up Consensus by Chasing Fast Decisions, DSN 2017 [pdf]
    • implements an optimization to EPaxos
  • DPaxos: Managing Data Closer to Users for Low-Latency and Mobile Applications, SIGMOD 2018 [acmdl,pdf]
  • SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines, SoCC 2018 [acmdl,pdf]
  • FleetDB: Follow-the-workload Data Migration for Globe-Spanning Databases, Tech report 2018 [pdf]
  • Session Guarantees with Raft and Hybrid Logical Clocks, ICDCN 2019 [acmdl]
  • Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed Storage, NSDI 2020 [pdf]
  • State-Machine Replication for Planet-Scale Systems, Eurosys 2020 [acmdl,arxiv]
  • Low-Latency Geo-Replicated State Machines with Guaranteed Writes, PaPoC 2020 [acmdl]

Consensus in production

This section lists papers describing experiences of deploying distributed consensus in production.

  • The Chubby lock service for loosely-coupled distributed systems, OSDI 2006 [acmdl,pdf]
  • Paxos Made Live - An Engineering Perspective, PODC 2007 [acmdl,pdf]
  • ZooKeeper: Wait-free coordination for Internet-scale systems, ATC 2010 [acmdl,pdf]
  • Windows Azure Storage: a highly available cloud storage service with strong consistency, SOSP 2011 [acmdl,pdf]
  • Megastore: Providing Scalable, Highly Available Storage for Interactive Services, CIDR 2011 [pdf]
    • seems to use an unusual definition of Multi-Paxos where each instance is district but the 1a/1b messages for slot i is piggybacked onto 2a2/b for i-1
    • uses SMR with witnesses, replicas which participate in log replication but do not run a state machine and read-only replicas which only run a state machine.
  • Zab: High-performance broadcast for primary-backup systems, DSN 2011 [acmdl,pdf]
    • featured in the morning paper
    • Widely utilized Apache licensed open source project written in Java project website
      • Apache Kafka uses Zookeeper, as well as its own replication protocol, by described here
    • Architecture is similar to Google's Chubby but unlike Chubby is described in detail and is open source
    • Writes are linearizable, reads may be stale
    • Note: calling sync before a write doesn't make it linearizable
    • Clients may have multiple outstanding requests, they will be handled FIFO
    • Uses primary-backup replication instead of state machine replication
  • Large-scale cluster management at Google with Borg, Eurosys 2015 [acmdl,pdf]
  • PaxosStore: High-availability Storage Made Practical in WeChat, VLDB 2017 [acmdl,pdf]
  • Bizur: A Key-value Consensus Algorithm for Scalable File-systems, Unpublished 2017 [pdf]
  • SLOG: Serializable, Low-latency, Geo-replicated Transactions, VLDB 2019 [acmdl,pdf]
  • CockroachDB: The Resilient Geo-Distributed SQL Database, ICMD 2020 [acmdl]
  • Millions of Tiny Databases, NSDI 2020 [pdf]

Implementations of consensus

This section lists papers describing implementations of distributed consensus algorithms.

  • Replication and fault-tolerance in the ISIS system, Tech Report 1985 [acmdl,pdf]
  • The ISIS project: real experience with a fault tolerant programming system, OSR 1991 [acmdl,pdf]
  • Replication in the Harp File System, SOSP 1991 [acmdl,pdf]
  • Boxwood: Abstractions as the Foundation for Storage Infrastructure, OSDI 2004 [acmdl,pdf]
  • The Farsite Project: A Retrospective, OSR 2007 [acmdl,pdf]
  • Paxos for System Builders: An Overview, LADIS 2008 [acmdl,pdf]
  • Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore, VLDB 2011 [acmdl,pdf]
  • Paxos replicated state machines as the basis of a high-performance data store, NSDI 2011 [acmdl,pdf]
  • Granola: Low-Overhead Distributed Transaction Coordination, ATC 2012 [acmdl,pdf]
  • S-Paxos: Offloading the Leader for High Throughput State Machine Replication, SRDS 2012 [acmdl,pdf]
  • Calvin: Fast Distributed Transactions for Partitioned Database Systems, SIGMOD 2012 [acmdl,pdf]
  • Commodifying Replicated State Machines with OpenReplica, Tech report 2012 [pdf]
  • Optimizing Paxos with batching and pipelining, Theoretical Computer Science 2013 [acmdl,pdf]
  • Tango: Distributed data structures over a shared log, SOSP 2013 [acmdl,pdf]
  • CORFU: A Distributed Shared Log, TOCS 2013 [acmdl,pdf]
  • Scalable State-Machine Replication, DSN 2014 [acmdl,pdf]
  • When Paxos Meets Erasure Code: Reduce Network and Storage Cost in State Machine Replication, HPDC 2014 [acmdl]
  • In Search of an Understandable Consensus Algorithm (Extended Version), ATC 2014 [acmdl,pdf]
  • Consensus: Bridging Theory and Practice, PhD Thesis 2014 [pdf]
    • PhD thesis describing RAFT consensus in more detail
  • Paxos made transparent, SOSP 2015 [acmdl,pdf]
  • Designing Distributed Systems Using Approximate Synchrony in Data Center Networks, NSDI 2015 [acmdl,pdf]
  • No compromises: distributed transactions with consistency, availability, and performance, SOSP 2015 [acmdl,pdf]
  • Building Consistent Transactions with Inconsistent Replication, SOSP 2015 [acmdl,pdf]
  • MetaSync: File Synchronization Across Multiple Untrusted Storage Services, ATC 2015 [pdf,acmdl]
  • Making Fast Consensus Generally Faster, DSN 2016 [pdf]
  • Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics, SIGMOD 2017 [acmdl,pdf]
  • Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols, HotCloud 2017 [pdf,acmdl,slides]
  • Bolt-On Global Consistency for the Cloud, SoCC 2018 [acmdl,pdf]
  • Stable and Consistent Membership at Scale with Rapid, ATC 2018 [pdf]
    • uses Fast Paxos to decide on membership changes. Conflicts are rare as the proposed value is the output of a membership algorithm so proposers usually propose the same proposal.
    • Fast Paxos implementation is here and here
  • Aegean: Replication beyond the client-server model, SOSP 2019 [acmdl]
    • also supports BFT
  • Exploiting Commutativity For Practical Fast Replication, NSDI 2019 [acmdl,pdf]
  • Unifying Consensus and Atomic Commitment for Effective Cloud Data Management, VLDB 2019 [acmdl,pdf]
  • Linearizable Quorum Reads in Paxos, HotStorage 2019 [pdf,slides]
    • two phase quourm read algorithm which does not require the leader
    • does not rely on bounded clock drift like leases
  • RMWPaxos: Fault-Tolerant In-Place Consensus Sequences, Unpublished 2020 [arxiv]
  • Bipartisan Paxos: A Modular State Machine Replication Protocol, Unpublished [pdf]
  • Compartmentalized Consensus: Agreeing With High Throughput, Unpublished [pdf]
  • Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log, NSDI 2020 [pdf]
  • Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol, ASPLOS 2020 [acmdl,arxiv]
  • PigPaxos: Devouring the communication bottlenecks in distributed consensus, Unpublished 2020 [arxiv]
  • CRaft: An Erasure-coding-supported Version of Raft for Reducing Storage Cost and Network Cost, FAST 2020 [pdf]

Evaluations of consensus

This section lists papers describing standalone evaluations of consensus algorithms.

  • The Performance of Paxos in the Cloud, SRDS 2014 [acmdl,pdf]
  • Consensus in the Cloud: Paxos Systems Demystified, Tech report 2016 [pdf]
  • Spectrum: A Framework for Adapting Consensus Protocols, Unpublished 2019 [pdf]
  • Dissecting the Performance of Strongly-Consistent Replication Protocols, SIGMOD 2019 [acmdl,pdf]
  • Blockchains and Distributed Databases: a Twin Study [arxiv]
    • performance anaylsis of 5 consensus systems, 3 non-byzantine algorithms (including etcd) and 2 byzantine consensus algorithms

State machine replication

This section lists papers the application of consensus to State Machine Replication (SMR/RSMs) and Linearizability.

  • Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial, CSUR 1990 [acmdl,pdf]
  • Linearizability: A Correctness Condition for Concurrent Objects, TOPLAS 1990 [acmdl,pdf]
  • Implementing Linearizability at Large Scale and Low Latency, SOSP 2015 [acmdl,pdf]
  • Cheap and Available State Machine Replication, ATC 2016 [acmdl,pdf]
  • Fine-Grained Replicated State Machines for a Cluster Storage System, NSDI 2020 [pdf]

Reconfiguration

This section lists papers on reconfiguration.

  • The SMART Way to Migrate Replicated Stateful Services, EuroSys 2006 [acmdl,pdf]
  • Vertical Paxos and Primary-Backup Replication, PODC 2009 [acmdl,pdf]
  • Reconfiguring a State Machine, SIGACT News 2010 [acmdl,pdf]
  • Dynamic Reconfiguration of Primary/Backup Clusters, ATC 2012 [acmdl,pdf]
  • Unbounded Pipelining in Dynamically Reconfigurable Paxos Clusters, Unpublished 2016 [pdf]
  • Matchmaker Paxos: A Reconfigurable Consensus Protocol, Unpublished [pdf]

Weaker consistency models

This section lists papers which discuss alternative consistency models to linearizability or systems which depend upon clocks for correctness.

  • Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency, SOSP 1989 [acmdl,pdf]
    • This paper introduced the idea of leases for distributed caches. This idea is used in master leases and read quorum leases.
  • Towards Robust Distributed Systems, PODC 2000 [acmdl,pdf]
    • PODC keynote in which Eric Brewer proposed the now infamous CAP theorem
  • Chain replication for supporting high throughput and availability, OSDI 2004 [acmdl,pdf]
  • Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 [acmdl,pdf]
  • Bigtable: A Distributed Storage System for Structured Data, TOCS 2008 [acmdl,pdf]
  • What consistency does your key-value store actually provide?, HotDep 2010 [acmdl,pdf]
    • offline consistency checking of key-value traces
  • Cassandra - A Decentralized Structured Storage System, OSR 2010 [acmdl,pdf]
  • Benchmarking Cloud Serving Systems with YCSB, SoCC 2010 [acmdl,pdf]
    • Popular benchmarking tool for key-values stores
    • Actively maintained open source project with support for various data stores
  • Spanner: Google’s Globally-Distributed Database, OSDI 2012 [acmdl,pdf]
    • Provides linearizability but it assumes a bounded clock drift
    • Google implement this using Truetime, GPS and atomic clocks in their data centers instead of NTP
    • Closed source but now available as a cloud service, Cloud Spanner
  • TAO: Facebook’s Distributed Data Store for the Social Graph, ATC 2013 [acmdl,pdf]
  • Eventual Consistency Today: Limitations, Extensions, and Beyond, ACM Queue 2013 [acmdl,pdf]
  • Quantifying eventual consistency with PBS, CACM 2014 [acmdl,pdf]
  • Existential Consistency: Measuring and Understanding Consistency at Facebook, SOSP 2015 [acmdl,pdf]
  • Minimizing coordination in replicated systems, PaPoC 2015 [acmdl,pdf]
  • Consistency in Non-Transactional Distributed Storage Systems, CSUR 2016 [acmdl,pdf]
  • Just say NO to Paxos Overhead: Replacing Consensus with Network Ordering, OSDI 2016 [acmdl,pdf]
  • The many faces of consistency, DE 2016 [pdf]
  • Spanner, TrueTime & The CAP Theorem, Tech Report 2017 [pdf]
  • Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases, SIGMOD 2017 [acmdl]
  • Fine-grained consistency for geo-replicated systems, ATC 2018 [pdf]
  • Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes, SIGMOD 2018 [acmdl]
  • Sharding the Shards: Managing Datastore Locality at Scale with Akkio, OSDI 2018 [acmdl,pdf]
  • On mixing eventual and strong consistency: Bayou revisited, PODC 2019 [arxiv,pdf]
  • Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection, VLDB 2020 [pdf]
    • implemented on programmable switches
    • comes with TLA+ spec in tech report
  • Strong and Efficient Consistency with Consistency-Aware Durability, FAST 2020 [pdf]

Failures

This section lists papers which analyze real-world failures of distributed systems.

  • Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications, SIGCOMM 2011 [acmdl,pdf]
  • The Network is Reliable: An informal survey of real-world communications failures, ACM Queue 2014 [acmdl,pdf]
  • What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, SOCC 2014 [acmdl,pdf]
  • All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI 2014 [acmdl,pdf]
  • Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions, FAST 2017 [acmdl,pdf]
  • An Analysis of Network-Partitioning Failures in Cloud Systems, OSDI 2018 [acmdl,pdf]
  • CrashTuner: Detecting Crash-Recovery Bugs in Cloud Systems via Meta-Info Analysis, SOSP 2019 [acmdl]
  • The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure, SOSP 2019 [acmdl]
  • Toward a Generic Fault Tolerance Technique for Partial Network Partitioning, OSDI 2020 [pdf]

Clocks

The liveness of distributed consensus depends on some degree of clock synchronization. The following section lists papers on the topic of clock synchronization.

  • IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems, Standard 1588-2008 [ieee]
  • Globally Synchronized Time via Datacenter Networks, SIGCOMM 2016 [acmdl,pdf]
  • Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization, NSDI 2018 [acmdl,pdf]
  • Sundial: Fault-tolerant Clock Synchronization for Datacenters, OSDI 2020 [pdf]

Correctness of consensus algorithms

This section lists papers on proving or testing the correctness of consensus algorithms.

  • Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers, Book 2002 [acmdl,pdf,website,amazon]
    • directory of TLA+ specs, including many for consensus algorithms TLA+ Examples
  • A Proof of Correctness for Egalitarian Paxos, Tech report 2013 [pdf]
  • Verdi: A framework for implementing and formally verifying distributed systems, PLDI 2015 [acmdl,pdf]
  • IronFleet: Proving Practical Distributed Systems Correct, SOSP 2015 [acmdl,pdf]
  • Lineage-driven Fault Injection, SIGMOD 2015 [acmdl,pdf]
  • How Amazon web services uses formal methods, CACM 2015 [acmdl,html]
  • PSYNC: A partially synchronous language for fault-tolerant distributed algorithms, POPL 2016 [acmdl,pdf]
  • Ivy: safety verification by interactive generalization, PLDI 2016 [acmdl,pdf,code]
  • Brief Announcement: A Family of Leaderless Generalized-Consensus Algorithms, PODC 2016 [acmdl,pdf]
  • Paxos Made EPR: Decidable Reasoning about Distributed Protocols, OOPSLA 2017 [acmdl,pdf]
  • Growing a protocol, HotCloud 2017 [acmdl,pdf]
  • Teaching Rigorous Distributed Systems With Efficient Model Checking, EuroSys 2019 [acmdl,pdf]
  • FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems, Eurosys 2019 [acmdl,pdf]
  • Proving the Correctness of Disk Paxos in Isabelle/HOL, Unpublished 2019 [pdf]
  • I4: Incremental Inference of Inductive Invariants for Verification of Distributed Protocols, SOSP 2019 [acmdl]
  • Scaling symbolic evaluation for automated verification of systems code with Serval, SOSP 2019 [acmdl]
  • WormSpace: A Modular Foundation for Simple, Verifiable Distributed Systems, SoCC 2019 [acmdl]

Quorum systems

This section lists papers on quorum systems.

  • A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases, TODS 1979 [acmdl,pdf]
  • Weighted Voting for Replicated Data, SOSP 1979 [acmdl,pdf]
  • How to Assign Votes in a Distributed System, JACM 1985 [acmdl,pdf]
  • A √N algorithm for mutual exclusion in decentralized systems, TOCS 1985 [acmdl,pdf]
  • A Quorum-Consensus Replication Method for Abstract Data Types, TOCS 1984 [acmdl]
  • The Reliability of Voting Mechanisms, TC 1987 [acmdl,pdf]
  • The Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data, VLDB 1990 [pdf]
  • An Efficient and Fault-tolerant Solution for Distributed Mutual Exclusion, TOCS 1991 [acmdl,pdf]
  • Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data, TC 1991 [acmdl,pdf]
  • The Generalized Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data, TODS 1992 [acmdl,pdf]
  • The Grid Protocol: A High Performance Scheme for Maintaining Replicated Data, TKDE 1992 [acmdl,pdf]
  • Enhancing concurrency and availability for database systems, Thesis [acmdl]
  • The Availability of Quorum Systems, Tech report 1993 [acmdl,pdf]
  • Crumbling Walls: A Class of Practical and Efficient Quorum Systems, PODC 1995 [acmdl,pdf]
  • Evaluating quorum systems over the Internet, PODC 1996 [acmdl]
  • An Adaptive Data Replication Algorithm, TODC 1997 [acmdl]
  • The Load, Capacity, and Availability of Quorum Systems, SIAM 1998 [acmdl,pdf]
  • Optimal availability quorum systems: Theory and practice, IPL 1998 [pdf]
  • Are Quorums an Alternative for Data Replication?, TODS 2003 [acmdl]
  • Coterie Availability in Sites, DISC 2005 [acmdl,pdf]
  • The virtue of dependent failures in multi-site systems, HotDep 2005 [acmdl,pdf]

Byzantine fault tolerance in distributed consensus

This section lists papers on Byzantine Fault Tolerance (BFT), often used as the basis of permissioned blockchains.

  • The Byzantine Generals Problem, ACM TPLS 1982 [acmdl,pdf]
  • Asynchronous consensus and broadcast protocols, JACM 1985 [acmdl,pdf]
  • Byzantine quorum systems, STOC 1997 [acmdl,pdf]
  • The load and availability of Byzantine quorum systems, PODC 1997 [acmdl]
  • Practical Byzantine Fault Tolerance, OSDI 1999 [acmdl,pdf]
  • Separating agreement from execution for byzantine fault tolerant services, SOSP 2003 [acmdl,pdf]
  • Byzantine disk paxos: optimal resilience with byzantine shared memory, PODC 2004 [acmdl,pdf]
  • Fault-Scalable Byzantine Fault-Tolerant Services, SOSP 2005 [acmdl]
    • requires 5f+1 instead of 3f+1
  • Fast Byzantine Consensus, IEEE TDSC 2006 [acmdl,pdf]
    • aka FaB
  • HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault Tolerance, OSDI 2006 [acmdl,pdf]
  • Zyzzyva: speculative byzantine fault tolerance, SOSP 2007 [acmdl,pdf]
  • Attested Append-Only Memory: Making Adversaries Stick to their Word, SOSP 2007 [acmdl,pdf]
  • Tolerating Byzantine Faults in Transaction Processing Systems using Commit Barrier Scheduling, OSR 2007 [pdf,acmdl]
  • Bosco: One-Step Byzantine Asynchronous Consensus, DISC 2008 [acmdl,pdf]
    • byzantine consensus in 1 round (instead the usual three) using quorums of 4f+1 from 5f+1
  • Matrix Signatures: From MACs to Digital Signatures in Distributed Systems, DISC 2008 [pdf]
  • Upright cluster services, SOSP 2009 [acmdl,pdf]
    • develops a BFT fork of Zookeeper and HDFS, source code is available but does not seem to be used/maintained
  • Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults, NSDI 2009 [acmdl,pdf]
  • TrInc: Small Trusted Hardware for Large Distributed Systems, NSDI 2009 [pdf,acmdl]
  • Zzyzx: Scalable Fault Tolerance through Byzantine Locking, DSN 2010 [pdf]
  • Leaderless Byzantine Paxos, DISC 2011 [pdf]
  • Byzantizing Paxos by Refinement, DISC 2011 [acmdl,pdf]
  • Byzantine Chain Replication, OPODIS 2012 [pdf]
  • Automatic Reconfiguration for Large-Scale Reliable Storage Systems, TDSC 2012 [pdf]
    • describes an approach to reconfigure BFT systems
  • The Next 700 BFT Protocols, TOCS 2015 [acmdl,pdf]
  • Consensus in the Age of Blockchains, Unpublished 2017 [arxiv]
  • Hardening Cassandra Against Byzantine Failures, OPODIS 2017 [pdf]
  • Revisiting Fast Practical Byzantine Fault Tolerance, Unpublished 2017 [arxiv]
    • describes bugs in Zyzzyva and FaB
  • HotStuff: BFT Consensus with Linearity and Responsiveness, PODC 2019 [acmdl,arxiv]
  • The latest gossip on BFT consensus, Unpublished 2018 [arxiv]
  • SBFT: a Scalable and Decentralized Trust Infrastructure, Unpublished 2019 [arxiv]
  • Stellar Consensus by Instantiation, DISC 2019 [pdf]
    • Includes Isabelle/HOL proof in AFP
  • Fast and secure global payments with Stellar, SOSP 2019 [acmdl]
    • Formal verification in Ivy and Isabelle/HOL
  • Flexible Byzantine Fault Tolerance, CCS 2019 [acmdl,pdf]

Alternative fault models in distributed consensus

Most of these paper handle crash faults. The previous section considers byzantine faults. This section considers the fault models between crash and byzantine.

  • Practical Hardening of Crash-Tolerant Systems, ATC 2012 [acmdl,pdf]
  • Visigoth Fault Tolerance, EuroSys 2015 [acmdl,pdf]
  • Protocol-Aware Recovery for Consensus-Based Storage, FAST 2018 [acmdl,pdf]
    • featured in the morning paper
    • Enabling nodes who lose persistent storage (e.g. due to corruption) to rejoin consensus systems without reconfiguration.
    • Implemented & evaluated in LogCabin and Zookeeper, but no source code is available
    • Best paper award at FAST 2018
    • authors' claim to have model checked with TLA+ but no spec is available
  • XFT: Practical Fault Tolerance beyond Crashes, OSDI 2016 [acmdl,pdf]

Misc

Blog posts, talks etc...

  • FaunaDB: An Architectural Overview [pdf]
  • Distributed Coordination Engine (DConE) [pdf]
  • Communication Costs in Real-world Networks [html]
  • Modeling Paxos and Flexible Paxos in Pluscal and TLA+ [html]
  • Waltz: A Distributed Write-Ahead Log [html]
  • Open-sourcing LogDevice, a distributed data store for sequential data [html]

Future reading list

The following lists contain places to watch for new writings in the field of distributed consensus.

Blogroll

Reading lists

Academic conferences & symposiums

  • USENIX Symposium on Networked Systems Design and Implementation (NSDI) [2019,2020]
  • USENIX Conference on File and Storage Technologies (FAST) [2019,2020]
  • European Conference on Computer Systems (EuroSys) [2019,2020]
  • IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) [2019,2020]
  • ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) [website,2019]
  • ACM SIGMOD/PODS Conference [2019,2020]
  • ACM SIGMETRICS / IFIP Performance [2019,2020]
  • ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) [2019,2020]
  • ACM Symposium on Theory of Computing (STOC) [2019,2020]
  • ACM Symposium on Principles of Distributed Computing (PODC) [website]
  • IEEE International Conference on Distributed Computing Systems (ICDCS) [2019,2020]
  • USENIX Annual Technical Conference (ATC) [2019,2020]
  • ACM Annual Conference of the Special Interest Group on Data Communication (SIGCOMM) [website,2019,2020]
  • International Conference on Very Large Data Bases (VLDB) [2019,2020]
  • USENIX Symposium on Operating Systems Design and Implementation (OSDI) [website,2018,2020]
    • Biennial evens only
  • International Symposium on Reliable Distributed Systems (SRDS) [website,2019]
  • International Symposium on Distributed Computing (DISC) [website,2019]
  • International Conference on Principles of Distributed Systems (OPODIS) [2019]
  • ACM Symposium on Operating Systems Principles (SOSP) [2019]
    • Biennial odds only
  • ACM Symposium on Cloud Computing (SoCC) [2019]
  • Conference on Innovative Data Systems Research (CIDR) [2021]

Dan Tsafrir maintains a useful list of systems conferences by deadline.

Academic workshops

  • Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC) [2019]
  • ACM SIGOPS Workshop on Large-Scale Distributed Systems and Middleware (LADIS) [website]
  • USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage) [2019]
  • Workshop on Hot Topics in Operating Systems (HotOS) [2019]
  • ACM Workshop on Hot Topics in Networks (HotNets) [2019]
  • USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) [2019]
  • USENIX Workshop on Hot Topics in Edge Computing (HotEdge) [2019]
  • International Workshop on Distributed Cloud Computing (DCC) [2019]

Academic journals & magazines

  • ACM Transactions on Computer Systems (TOCS) [website]
  • Journal of the ACM (JACM) [website]
  • Communications of the ACM (CACM) [website]
  • SIGOPS Operating Systems Review (OSR) [website]
  • ACM Computing Surveys (CSUR) [website]
  • ACM Transactions on Database Systems (TODS) [website]
  • ACM Queue [website]
  • ACM SIGACT News [website]
  • IEEE Transactions on Dependable and Secure Computing (TDSC) [website]
  • IEEE Transactions on Parallel and Distributed Systems (TPDS) [website]
  • IEEE Transactions on Computers (TC) [website]
  • IEEE Transactions on Software Engineering [website]
  • Journal of Systems Research (JSys) [website]

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
distributed-systems (296
zookeeper (119
distributed-computing (87
replication (72
fault-tolerance (30
consistency (16
consensus-algorithm (15