Apache ZooKeeper is an open source distributed coordination service originally developed at Yahoo and now at Apache. It is the core coordination service that we use here at Elastic Cloud. Critical data is stored in ZooKeeper and having a reliable backup system for it is vital. However, backing up and restoring a ZooKeeper cluster can be very tricky to do correctly.
A ZooKeeper database contains two broad categories of data: persistent and ephemeral. Persistent data is generally the same type of data stored in traditional datastores (RDBMSes, etc.). It is data that is accessed via traditional CRUD activity. Ephemeral data, however, is unique to ZooKeeper and is usually associated with state machine semantics. The presence of the ephemeral data implies a specific state.
ZooKeeper in production is deployed on multiple processes (typically 3 or 5) each of which maintains its own database. ZooKeeper's consistency guarantee is that
(n/2)+1 processes will receive writes. The important implication of this is that, at any given point in time, it must be assumed that one or more of the individual ZooKeeper databases are in a different consistent state than the others.
Regardless of whether the data is persistent or ephemeral, data stored in ZooKeeper can be:
- Transient: caches, reproducible data, etc.
- Source of truth: the data in ZooKeeper is the data identity
- Stateful: data that implies a state machine at a given point in time. E.g. locks, registries, etc.
Note: this is different than traditional datastores where data is nearly always source of truth.
ZooKeeper clients maintain a session with the server. The server implements this session as a special type of transaction that is stored in its database. Ephemeral nodes are tied to sessions and expire when sessions expire. ZooKeeper sessions are durable and fault tolerant. They are managed by the internal ZooKeeper leader instance.
Distributed State Machine
The transient and stateful data in ZooKeeper extends to data structures held in client processes to form a distributed state machine composed of ZooKeeper servers, client applications, etc. For example, five clients might be involved in a leader election meaning they all have created ephemeral nodes in ZooKeeper and the leader is executing some action. Thus, there are ephemeral nodes in ZooKeeper and objects in client memory that form a single state machine.
With source of truth datastores backups usually are done by periodic copying of a transaction log or some similar type of on-disk representation of the data. These datastores have built-in mechanisms to help/support backup and restore. In the best cases, restores can be achieved with no transaction losses. In the worst cases, only the most recent transactions are lost.
ZooKeeper creates a new class of backup/restore issues due to its use for transient and stateful data. Further, it has no built in support for backup and restore.
As stated above, at any point in time transient or stateful data combine with client data structures to create a distributed state machine. Thus, it is irrational to restore transient and stateful data as it's not possible to recreate the corresponding client data structures.
Back In Time
An improperly restored ZooKeeper can appear to go back in time to a previous state.
Imagine the following scenario:
- Three clients A, B and C attempt to acquire a ZooKeeper lock in the standard manner (each create an ephemeral sequential node, the client that gets the lowest sequence number acquires the lock).
- Client A acquires the lock and does some work.
- Client A releases the lock by deleting its ephemeral node.
- Client B then becomes the lock holder.
- A catastrophic fault occurs and all ZooKeeper server nodes are lost. However, the clients are not lost and go into a waiting state until the ZooKeeper ensemble is available.
- A restore is done of the most recent ZooKeeper transaction logs. However, the logs being restored were captured just prior to Client A deleting its ephemeral node.
- When the ZooKeeper ensemble comes back online, Client A's ephemeral node comes back from the dead and the distributed state machine is in an insane state. Client B will still think it is the lock holder but the corresponding ephemeral nodes will not imply this.
This is just one example. Copious other scenarios can be contemplated.
Backup of ZooKeeper by copying its transaction and snapshot logs is a reasonable method. It must be understood that data loss will occur for writes that happened past the time of last log copy.
These logs should never be restored without filtering. All ephemeral node transactions should appear to be deleted nodes from the logs before the ensemble is restored. This can be accomplished by removing all session transactions from the logs. Dealing with transient persistent nodes is more difficult as it's not possible to automatically identify these nodes. They could be identified by known path prefixes or some other method. Or, hopefully, there are no cases and this can be ignored.
Filtering of ZooKeeper's transaction log is very easily done. There are existing classes in the ZooKeeper library that can be used for this (e.g. https://github.com/phunt/zk-txnlog-tools).
Alternatively, if possible, all clients should be closed prior to restoring an ensemble from a backup. This would have a similar effect of ending all sessions and implying deletion of all ephemeral nodes.