Thursday, June 29, 2017

Using TLA + to Understand Your System

A couple years ago I read How Amazon Web Services Uses Formal Methods  which discussed how to use Temporal Logic Analysis to find very infrequent, but persistent, errors in distributed system. Although it sounded intriguing, I shelved it for lack of an appropriate use case (& time).

The appropriate situation and time to pursue it finally arrived. Using TLA+ was an interesting experience, reminding me of Rate Monotonic Analysis (RMA).  RMA similarly assures that, if the analysis is correct, aka the system has been described correctly, certain system characteristics are guaranteed. In the RMA case, the guarantee is that the time between two events will not exceed a specified limit. In the TLA case, the system will not deadlock etc.

Such guarantees are the core advantage of these techniques: telling you that if the system does not abide by the guarantee, e.g., hangs, time window exceeded, there is definitely something wrong, either with the system or with your understanding of the system. By refining your debugging/analysis these techniques allow you to focus upon the specific part that is the cause of the problem.

This is what I was prepared for going in. The interesting thing that I found in using TLA was that, as opposed to RMA, it forced a very different way of thinking about the system design. This is partly because I used them in different situations: RMA was used to analyze a small distributed system that performed repetitive processing, with all the processes residing in a cabinet. Distributed, yes, but also hard real time, to debug a frequently occurring failure — the longest analysis window was 24 ms (with 100 microsecond granularity). TLA, on the other hand, was for a multi cloud operation that failed infrequently with the long time window, high expected variance environment that implies.

In this situation, TLA proved useful just while writing the spec. One of the most difficult questions involved in specifying the system is determining the level of detail to pursue. The rule of thumb I normally follow is that if the spec doesn’t match the behavior, you need more detail. This works for debugging, which was all I was doing in either case.

For the relatively simple system I was considering: a number of independent processes making calls to a (throttled) cloud service, parsing the results and storing them for analysis. The TLA spec was starting as a few queues with an overall throttle. Queue analysis usually has some sort of model of queue arrival and service times which drives overall system behavior. TLA doesn’t offer that kind of facility, which puzzled me for a bit. I finally realized that the lack of such a facility is implicit in the goal of TLA, since TLA’s goal is to explore the problem’s entire state space. Inevitably a state will occur that’s triggered by something will take way too much or way too little time and the queue will fill.

Just thinking through the TLA spec forced me to think through: what happens then? In my case the answer was: drop the requests on the floor, which then raised the followup question: what value does the queue add? The answer was not much, prompting it’s removal and the attendant simplification of the design.

This highlights a useful side benefit of TLA: exploring the state space is a much different mental exercise than covering your core use cases. I’d call it outlier in rather than modal out — useful for pressure testing your designs, even if you only take TLA to a coarse level of detail.

Explore the TLAplus google group for more info on TLA