Reliable by Design: Applying Formal Methods to Distributed Systems

Building fast, scalable, and robust distributed systems such as Elasticsearch requires choosing the right algorithms for the right tasks. While many algorithms are readily available, often they need to be extended or adapted to fit the requirements of a real-world implementation. For instance, customizing or combining algorithms in non-trivial ways can easily break safety properties. Using formal methods in the development process of a distributed system helps to catch bugs in the design phase and reveal fundamental issues that testing might not easily uncover.

Learn how the Elasticsearch team is making use of formal methods in the design of distributed algorithms. They'll discuss their specification methodology, toolset, and their experiences applying it to the data replication and cluster consensus algorithms in Elasticsearch. In particular, they'll focus on their recent work on these two core algorithms using the TLA+ toolbox (based on Lamport's Temporal Logic of Actions) and the Isabelle/HOL theorem prover system.

David Turner

David is a distributed systems engineer at Elastic.

Yannick Welsch

Yannick Welsch is an Elasticsearch engineer based in Luxembourg. He works on the distributed bits of Elasticsearch, applying his experience in running large clusters as well as making use of his background in formal specification languages.