Deterministic simulation test -- the next step
Distributed systems are a hard problem. Networks, hardware, and even the universe all conspire to render your system in an inconsistent state which you will have the express joy of debugging in the middle of the night. Generally, when faced with the notion of trying to constrain how systems can fail, engineers turn to testing, with a focus on integration tests to ensure that the system as a whole works as expected even in the most dire of circumstances.
While a good idea, in practice integration tests are seldom as comprehensive as they set out to be. They are the first to get afflicted by bit rot and tech debt due to their complex setup requirements. Debugging them can be just as difficult as debugging the actual failures themselves.
When designing a new cluster controller for a new data store, I was faced with solving this problem. I wanted to build a robust system, with a focus on correctness. Plus, I wanted to validate this by running the system under a multitude of scenarios and asserting on the expected behavior. Lucky for me, this is a problem that engineers have been thinking of for a long time.
This article is a exploration of how combining various techniques of building state machines using sans-io practices, and deterministic simulation testing leads to a lot of joy.
I cannot thank Predrag Gruevski enough for pointing me in right direction and helping me understand the more complex pieces.
The state machine§
The main focus of this controller is to be highly available with a focus on correctness, as this service will be the sole arbitrer for keeping the data store instances in a consistent state. The best way to ensure that the controller does not end up in a inconsistent state is to ensure that those states are not reachable (representable), which can be done by using state machines. There are many ways in Rust to design state machines, from complex type states to using match statements. Ana Hoboden has an excellent article on the various ways of implementing state machines in Rust.
For our usecase, our state machine only transitions from one state to the next
when an input is provided, so we have a core step
method which is a function
of the form (CurrentState, Input) -> NewState
, which internally is
implemented via a match
statement.
struct Machine {
state: State
}
enum State {
Initialized,
//...
Failure
}
enum Input {
Initialized,
// ...
ClusterEnabledState(bool),
}
enum Message {
CheckClusterEnabled(Endpoint { uri: String })
}
impl Machine {
fn step(self, input: Input) -> State {
match(&self.state, input) => {
// All possible transitions here
_ => State::Failure
}
}
}
Sans-IO driver§
The match
statement allows us to handle all possible states and their
transitions, plus gives us an escape hatch in the form of the Failure
state
which is returned when an unexpected state transition occurs. State transitions
can have side effects, such as querying the cluster state or issuing a cluster
command. These side effects are not generated for every transition, so state
machine provides an API via a poll_next_message
method to check whether or
not such a request exists. This is why the step
method is designed to be
driven externally, via a method which might look something like this:
impl Machine {
// step is the same
fn poll_next_message(&self) -> Option<Message> {..}
}
fn run_state_machine() {
let machine = Machine {
state: State::Initialized
};
let mut current_input = Input::Initialized;
loop {
let new_state = machine.step(current_input);
if new_state.is_actionable() {
// Perform action
}
if let Some(message) = machine.poll_next_message() {
current_input = handle_message_and_get_input(message);
}
}
}
This notion of querying the state machine for any messages that can cause side-effects comes from the principles of building a sans-io based system. See this article from FireZone to learn more about it, but by employing this pattern we get the following advantages:
- Synchronous core: The core state machine itself is not performing any actions that might require asynchronous operations (think network, FS, time, etc.).
- Deterministic execution: We will expand on this later, but by inverting control of external interactions, the code driving the state machine has complete control over when the action can be performed.
In order to map the state machine Input
and Output
types to specific
actions, and their results, I have the IoCore
type. By making this type
generic over a simulator implementation, we can can get the desired behavior we
want. The Simulator
trait itself will be covered in the next section.
struct IoCore<S> {
simulator: S,
}
impl<S> IoCore<S>
where
S: Simulator,
{
pub fn handle_message_and_get_input(message: Output) -> Result<Input, IoCoreError> {
match input {
// Handle various message types, for example an election message
Output::CheckClusterEnabled(endpoint) => self
.simulator
.check_cluster_enabled(&endpoint.uri)
.map(|enabled| Input::ClusterEnabledState(enabled))
.map_err(Into::into),
}
}
}
This type is then passed to the driver, which can use it to drive the state machine.
Deterministic Simulation Testing§
We have the following pieces so far:
- A state machine.
- A driver to run the machine.
- An abstraction to call simulator APIs.
In order to close the loop, we bring in the concepts of deterministic simulation testing (DST). DST provides guidelines and techniques for building systems that can fully simulate external interactions, rather than prescribing specific implementation steps.
There's a lot of literature on this topic, and they provide excellent context about the problem space and the general guidelines. I'm someone who can internalize a concept only after building so it took me some time to wrap my head around all of the pieces that come together to make for a system that is self describing and replay-able.
The neat thing is, this DST systems work extremely well with state machines and systems built in sans-IO in mind, primarily because the core of the system has been abstracted away from any source of non-determinism. And that's the main principle of DST -- non-determinism. We want to ensure that all possible interactions which can introduce non-determinism have been removed, such that there is only one logical series of steps that leads to a particular outcome. The simplest way of achieving this is by providing your state machine, or the state machine driver a single interface for all non-deterministic interactions.
This interface in my case was the Simulator
trait, which housed all external
systems methods
pub trait Simulator {
// Random value generation
fn gen_node_id(&self, prefix: Option<String>) -> ElectionNodeId;
// Cluster operations
fn is_cluster_enaled(&self, endpoint: String) -> bool;
fn failover(&self, leader: String, replica: String) -> Result<(), Error>;
// Election operations
// Can be make async fns if required
fn create_node(&self node: ElectionNodeId) -> impl Future<Output = Result<CreateState, Error>> + Send;
// and so much more
}
By abstracting these operations behind the Simulator
interface, we can have
multiple implementations of them for specialized purposes. In my case I have
two simulators ConcreteSimulator
and TestSimulator
.
ConcreteSimulator
This as the name suggests contains the concrete implementations of making the
required calls into the external systems. Internally, it is a fairly thin shim
over the API calls themselves, with the addition of the tracing functionality.
This is what makes the system self describing. The ConcreteSimulaotor
has
functionality to serialize the input and outputs of every method that opt into
the tracing functionality. This requires the input and output types of the
methods be serializable, including the error types, as we want to capture the
full values. This is a hard requirement in order to be of use in the
TestSimulator
The values are written to a file on a background task that is started when a
ConcreteSimulator
instance is initialized. The file name, and whether or not
recording is enabled can be easily configurable. While there are many ways of
handling the serialization format, I went for simplicity, and have
ron encoded inputs and output values along with some metadata
like the method name. A single method invocation has the metadata along with
the inputs and outputs serialized into a string, which is then sent to the
worker thread to be written to the file. While it is possible to move the
serialization to the worker thread as well, this system will be only running
locally as well as specific test clusters, so I'm fine with the additional cost
of serialization on the main task, at the benefit of simplicity.
TestSimulator
The TestSimulator
takes the files written created by ConcreteSimulator
and
provides the replay-ability for the testing aspect of DST. The implementation
of TestSimulator
contains facilities for parsing the scenario files created
by the ConcreteSimlator
. The implementation provides a core method
next_operation
which is responsible for deserializing the method invocation
metadata as well as inputs and output values of the method invocation.
fn next_operation<I, O>(&self, method_name: &'static str) -> Result<(I, O), SimulatorError>
where
for<'a> I: serde::Deserialize<'a>,
for<'b> O: serde::Deserialize<'b>,
{
// Read next chunk and deserialize the values
let expected_method_name = ron::from_str(scenario_file);
if method_name != expected_method_name {
panic!(
"Test failure because unexpected method called: {}, expected: {}",
method_name, expected_method_name
);
}
// deserialize input and output
(input, output)
}
Every method in the Simulator
implementation of TestSimulator
simply calls
the next_operation
method with the expected input and output types,
impl Simulator for TestSimulator {
fn is_cluster_enaled(&self, endpoint: String) -> bool {
let (input, output) = self
.next_operation::<String, bool>("is_cluster_enabled")
.unwrap();
if input != endpoint {
panic!(
"is_cluster_enabled called with unexpected endpoint: {}, expected: {}",
endpint, input
);
}
output
}
// And so on
}
And with this, we get replayability along with behavior validation. If the order of method calls does not match the order present in the scenario files, then it is either a bug or a behavior issue, and therefore needs to be investigated. If the input does not match the expected input, that is also treated similarly.
Authoring tests now becomes as simple as instantiating the controller with the
TestSimulator
which is given a scenario file to run through, and driving the
state machine until an expected state is reached.
And the best part is, this abstraction is not just limited to these two
implementations, we could just as easily implement a concrete simulator that
inject random failures driven by an seeded PRNG. The controller is continuously
run with the simulator recording every method invocation, and if a failure
occurs, we can simply run it through the TestSimulator
to identify the exact
sequence of steps and the resulting state that caused the failure to occur.
Conclusion§
This technique is extremely flexible and extensible, allowing for integration tests to be run based on scenarios and abstracting the entire world out for the core system. This allows for quicker feedback time, plus, doesn't require the tests to set up the environment for each scenario individually. On top of this, the techniques are extensible, and can be built on top to automatically create new scenarios to test.
An aside§
One particularly helpful note I found when implementing this system was that the functionally pure core is better architected as a pull based system instead of a push based one. This is because it allows for the core (in my case the state machine, and it's driver) to decide when to check external systems, instead of being notified of changes which can happen at any point of time. By inverting the control, it not only makes the core logic simpler, it also makes the simulator implementations a lot cleaner.
A concrete (pun only slightly not intended) example of this is checking for elections state and cluster state notifications. The controller needs to act on various notifications such as:
- Leader node disconnected
- Data store cluster node addition/failure
- Data store node primary failover
- A config file on the file system changing
- etc.
The sources of these events are disparate, and provide different mechanisms to
get the actual notifications. By inverting control, the core system dictates
when it wants to check on these notifications and acts on it, and by extension
makes it deterministic. The Simulator
interface now only needs to have
methods like fn check_cluster_notifications(&self) -> ClusterEvent
and fn check_election_notification(&self) -> ElectionEvent
, etc., and therefore, the
implementors only need to care about serializing and de-serializing the actual
values, and not the entire mechanism of pushing the notification to the
controller.