org.apache.zookeeper.server.package.html Maven / Gradle / Ivy
Show all versions of spark-core Show documentation
ZooKeeper server theory of operation
ZooKeeperServer is designed to work in standalone mode and also
be extensible so that it can be used to implement the quorum based
version of ZooKeeper.
ZooKeeper maintains a order when processing requests:
- All requests will be processed in order.
- All responses will return in order.
- All watches will be sent in the order that the update takes place.
We will explain the three aspects of ZooKeeperServer: request processing, data
structure maintenance, and session tracking.
Request processing
Requests are received by the ServerCnxn. Demarshalling of a request is
done by ClientRequestHandler. After a request has been demarshalled,
ClientRequestHandler invokes the relevant method in ZooKeeper and marshals
the result.
If the request is just a query, it will be processed by ZooKeeper and returned.
Otherwise, the request will be validated and a transaction will be generated
and logged. This the request will then wait until the request has been logged
before continuing processing.
Requests are logged as a group. Transactions are queued up and the SyncThread
will process them at predefined intervals. (Currently 20ms) The SyncThread
interacts with ZooKeeperServer the txnQueue. Transactions are added to the
txnQueue of SyncThread via queueItem. When the transaction has been synced to
disk, its callback will be invoked which will cause the request processing to
be completed.
Data structure maintenance
ZooKeeper data is stored in-memory. Each znode is stored in a DataNode object.
This object is accessed through a hash table that maps paths to DataNodes.
DataNodes also organize themselves into a tree. This tree is only used for
serializing nodes.
We guarantee that changes to nodes are stored to non-volatile media before
responding to a client. We do this quickly by writing changes as a sequence
of transactions in a log file. Even though we flush transactions as a group,
we need to avoid seeks as much as possible. Also, since the server can fail
at any point, we need to be careful of partial records.
We address the above problems by
- Pre-allocating 1M chunks of file space. This allows us to append to the
file without causing seeks to update file size. It also means that we need
to check for the end of the log by looking for a zero length transaction
rather than simply end of file.
- Writing a signature at the end of each transaction. When processing
transactions, we only use transactions that have a valid signature at the end.
As the server runs, the log file will grow quite large. To avoid long startup
times we periodically take a snapshot of the tree of DataNodes. We cannot
take the snapshot synchronously as the data takes a while to write out, so
instead we asynchronously write out the tree. This means that we end up
with a "corrupt" snapshot of the data tree. More formally if we define T
to be the real snapshot of the tree at the time we begin taking the snapshot
and l as the sequence of transactions that are applied to the tree between
the time the snapshot begins and the time the snapshot completes, we write
to disk T+l' where l' is a subset of the transactions in l. While we do not
have a way of figuring out which transactions make up l', it doesn't really
matter. T+l'+l = T+l since the transactions we log are idempotent (applying
the transaction multiple times has the same result as applying the transaction
once). So when we restore the snapshot we also play all transactions in the log
that occur after the snapshot was begun. We can easily figure out where to
start the replay because we start a new logfile when we start a snapshot. Both
the snapshot file and log file have a numeric suffix that represent the
transaction id that created the respective files.
Session tracking
Rather than tracking sessions exactly, we track them in batches. That are
processed at fixed intervals. This is easier to implement than exact
session tracking and it is more efficient in terms of performance. It also
provides a small grace period for session renewal.