Thursday, June 29, 2017

Using TLA + to Understand Your System

A couple years ago I read How Amazon Web Services Uses Formal Methods  which discussed how to use Temporal Logic Analysis to find very infrequent, but persistent, errors in distributed system. Although it sounded intriguing, I shelved it for lack of an appropriate use case (& time).

The appropriate situation and time to pursue it finally arrived. Using TLA+ was an interesting experience, reminding me of Rate Monotonic Analysis (RMA).  RMA similarly assures that, if the analysis is correct, aka the system has been described correctly, certain system characteristics are guaranteed. In the RMA case, the guarantee is that the time between two events will not exceed a specified limit. In the TLA case, the system will not deadlock etc.

Such guarantees are the core advantage of these techniques: telling you that if the system does not abide by the guarantee, e.g., hangs, time window exceeded, there is definitely something wrong, either with the system or with your understanding of the system. By refining your debugging/analysis these techniques allow you to focus upon the specific part that is the cause of the problem.

This is what I was prepared for going in. The interesting thing that I found in using TLA was that, as opposed to RMA, it forced a very different way of thinking about the system design. This is partly because I used them in different situations: RMA was used to analyze a small distributed system that performed repetitive processing, with all the processes residing in a cabinet. Distributed, yes, but also hard real time, to debug a frequently occurring failure — the longest analysis window was 24 ms (with 100 microsecond granularity). TLA, on the other hand, was for a multi cloud operation that failed infrequently with the long time window, high expected variance environment that implies.

In this situation, TLA proved useful just while writing the spec. One of the most difficult questions involved in specifying the system is determining the level of detail to pursue. The rule of thumb I normally follow is that if the spec doesn’t match the behavior, you need more detail. This works for debugging, which was all I was doing in either case.

For the relatively simple system I was considering: a number of independent processes making calls to a (throttled) cloud service, parsing the results and storing them for analysis. The TLA spec was starting as a few queues with an overall throttle. Queue analysis usually has some sort of model of queue arrival and service times which drives overall system behavior. TLA doesn’t offer that kind of facility, which puzzled me for a bit. I finally realized that the lack of such a facility is implicit in the goal of TLA, since TLA’s goal is to explore the problem’s entire state space. Inevitably a state will occur that’s triggered by something will take way too much or way too little time and the queue will fill.

Just thinking through the TLA spec forced me to think through: what happens then? In my case the answer was: drop the requests on the floor, which then raised the followup question: what value does the queue add? The answer was not much, prompting it’s removal and the attendant simplification of the design.

This highlights a useful side benefit of TLA: exploring the state space is a much different mental exercise than covering your core use cases. I’d call it outlier in rather than modal out — useful for pressure testing your designs, even if you only take TLA to a coarse level of detail.

Explore the TLAplus google group for more info on TLA

Thursday, July 2, 2015

NSURLSession & CoreData

I just moved one my apps from using a NSURLConnection to using a NSURLSession for REST requests, as this permits background data updates (since my app provides ambient cueing, it’s critical that it run the background). I had more difficulty doing this than expected. This was partly because it involved moving from a single threaded to multi-threaded operational environment, but also because I didn’t completely understand the purpose of a NSManagedObjectContext nor the difference between that and the NSPersistentStoreCoordinator (my lack of understanding might be due, in part, to its dynamic nature — from Apple’s doc: This is a preliminary document for an API or technology in development.)

My concerns were simple: what code drives the process, when does that code have control, and what governs the impact of the code’s execution (which turns out to be key in this situation, since is NOT necessarily governed by the process that has control).

My app has simple, relatively standard, behaviors around web services and CoreData (CoreData storage is in a SQLite database). The sequence is as follows:

On startup, the app initializes some persistent coreData structure as necessary.
It then updates configuration data from a web service (& persists it). A remnant of a previous design, it uses NSURLConnection.
A background thread is then started running an NSURLSession that gets new data from a web service, analyzes it, persists the data and should update the model used by the GUI.

My goal was to get the GUI to display the new data (in a timely manner, of course). I tried a few different strategies:

Just trying it with my current configuration settings didn’t work. Data initialized OK. Data from the background thread hit SQLite, but “spontaneous updating” of the data on the main GUI thread didn’t happen (I started using SQLite Pro a few months ago, It’s ability to see the data that’s hit the db has been a tremendous time saver).

I tried changing the setting for the main thread’s NSManagedObjectContext to NSMergeByPropertyStoreTrumpMergePolicy, which seemed like the behavior I wanted, didn’t help either (I reverted the change to keep myself in a known state).

It turns out that the accepted way to do this is to take the values provided by the save call that commits the changes, and then send them to the context that requires updating. There is a strong caveat, however: the receiving context must be NSMergeByPropertyStoreTrumpMergePolicy. I had a bit of trouble accepting at first. After all, the call to mergeChangesFromContextDidSaveNotification is coming from another context, not from the data store itself.

However, the implementation pattern belies that understanding: the source context is not necessarily responsible for the message to the “stale context”. Any selector posted to the NotificationCenter for the NSManagedObjectContextDidSaveNotification could be the source for the update, as the notification itself comes from the persistent store. A recommended pattern is for the thread responsible for committing the changes add/remove itself as an observer immediately around the save, as shown below.

NSNotificationCenter.defaultCenter().addObserver(self,selector: "notifySiblingContext:", name:NSManagedObjectContextDidSaveNotification, object: context)

NSNotificationCenter.defaultCenter().removeObserver( self, name:NSManagedObjectContextDidSaveNotification, object: context)

Where notifySiblingContext is the function that updates the stale context(s).

@objc func notifySiblingContext(notification: NSNotification){

In summary, the context performing holding the modified objects and initiating the save, should be set to NSMergeByPropertyStoreTrumpMergePolicy, while the context receiving the update should be set to the complementary NSMergeByPropertyStoreTrumpMergePolicy. The processing of the updates is effected by adding/removing the updating process as a context store observer immediately around the save. 

Notification of a successful save kicks off the update process for the sibling context(s).

Debugging tip: If the merge doesn't appear to doing anything, check that the context getting the mergeChangesFromContextDidSaveNotification message is the same as the context you're observing.

Friday, June 26, 2015

Java: Leak, or Memory Pressure?

I recently encountered a memory allocation issue that I hadn’t seen before in (many) years of programming in GC languages (primarily Lisp/Java). The initial indication of a problem occurred a few months ago when a long running process (fetch data from web service, store data in mysql db, analyze data, store results in mysql) ran into a GC overhead error:

java.lang.OutOfMemoryError: GC overhead limit exceeded
running with a -Xmx ~ 200M. 

I did some profiling, found a few leaks, but couldn’t make the problem disappear. I also looked at it again in Netbeans, profiling recording the allocation stack trace. Although I did find a leak caused by some of my Hibernate query constructs (repetitively generating a string rather than using a named query), a complete diagnosis evaded me.

This week, as part of optimizing the process for final hosting on AWS, I revisited the issue: tightening the memory allocation to more quickly induce failure, and watching the GC logs. The GC logs yielded this surprising result:

11665.495: [Full GC (Ergonomics) 77812K->73371K(79872K), 0.0841821 secs]
11665.602: [Full GC (Ergonomics) 77812K->73371K(79872K), 0.0847730 secs]
11665.708: [Full GC (Ergonomics) 77812K->73395K(79872K), 0.0770183 secs]
11665.800: [Full GC (Ergonomics) 77812K->73371K(79872K), 0.0828496 secs]
11665.905: [Full GC (Ergonomics) 77812K->73369K(79872K), 0.0871271 secs]
11666.019: [Full GC (Ergonomics) 77812K->73369K(79872K), 0.0756355 secs]

There was no memory leak (explaining, in part, why it was so hard to find…)! The process ran for ~ 2 hours before finally hitting

java.lang.OutOfMemoryError: GC overhead limit exceeded

This isn’t a surprise, given the log messages above: 6 FULL gc’s in ~ half a second — the process was spending a lot of time in the GC.

Poking around the web a bit I found a thread on StackOverflow suggesting use of 


With this setting, the process reached the final “tenured memory” value quickly and stayed there for a while. However, the process began running very slowly, and the GC still couldn’t keep up, eventually yielding a straight up, vanilla 

java.lang.OutOfMemoryError: Java heap space

Again, this happened even though the tenured heap memory was stable and the value hadn’t increased for hours.

So, what’s going on? Memory pressure. Memory pressure indicates the ability of the memory management infrastructure to keep up with the processes that depend on it. In this case the management infrastructure consists of GC, marking (+ potentially compacting) free areas, and putting them on the free list. The dependent process is the Java analysis program. 

For background, here’s some memory pressure links

I think we all realize that this would possible in the abstract, although I’ve never personally encountered it before in my code. 

At the moment my solution is “punt & restart the process every few hours”, so this post is a work in progress. Since the tenured memory value is stable for each run, but varies with the -Xmx value, I assume that not only are there another tunable values in the system that might impact this performance, but that I can also reduce memory pressure by more clever caching and retrieval methods.

Longer term the options break down into

Run Slower: reduce the frequency of the demands that the Java process is making on the memory subsystem, aka slow down the repetition rate for updates

Run Leaner: reduce the impact of the demands that the Java process is making on the memory subsystem: create less garbage by creating fewer transient objects, etc.


PS: I tried using -XX:+UseParallelGC, and the process crashed in < 2 minutes, which was spectacular

Wednesday, December 17, 2014

Twitter Hashtags: the ideal knowledge management tool?

I loosely define knowledge management as the ability to find (project) relevant information from other projects current or historic (I’m also using the term information very loosely here: I don’t mean information in the information theoretic sense, but in the colloquial sense: unvetted data). 

Relevance is hard: it’s easy enough to create PDFs and dump them into a document store. That renders them findable; but finding the relevant document (or portion thereof) is difficult-to-impossible in most corporate environments. This is partly because there isn't that much linking going on, so you can’t to page rank tricks, and partly because the data being addressed isn’t commensurable. 

The two axes of finding and relevance should be correlated, but aren’t in practice. This is because the terms used for retrieval are “the same” but incommensurable between the documents. The search terms either 
  • Are not specific enough, 
  • Have changed through usage, or 
  • the background assumptions underlying them have changed over time. 

The latter is especially true in clinical trials, e.g., can you compare results between two studies: one of which had a two-value (M/F) dropdown for sex while the other had a three-value dropdown (M/F/U)? The clear answer is maybe. The more maybes your searches turns up, the less incentive there is to even bother looking.

Similar situations occur when the context of the documents change e.g.,
  • SOP’s change
  • Standards for acceptable data change
  • Drug target changes disease area
  • etc.

My experience in watching scientific systems develop/evolve (devolve?) over time is that you start off with a small ball of coherent thought. This coherence is the result of a good development process: one that the vets and resolves differences in vocabulary use with all the stakeholders in the project. 

Such balls of coherence are inherently unstable: without strong curation, changes in personnel and funding levels cause entropic decline. Of course, curatorial work is hard to protect when funding gets reduced (and funding always eventually gets reduced), since its loss rarely has a negative impact on short term corporate goals. 

This is even true in regulated environments: the regulatory goal is to assure that all the data submitted is coherent. assuring that the data is coherent with other submissions is less important.

The result is that these carefully crafted balls of coherence end up going from this coherent set
to this fuzzy idea

Twitter hashtags don’t seem to do that, worst case they seem to go from


 two temporally displaced uses of the same tag that designate different sets of things, which may, or may not be correlated.

There appear to be three main causes for this:
  1. Each tagged tweet only has a few tags, usually only one (there’s only 140 characters/tweet. If you’re not careful, your tweet will consist entirely of tags).
  2. I’ve seen that people using a hashtag search twitter for other tweets that use the same hashtag, if there’s a conflict the situation often autocorrects. 
  3. The combination of these two factors assures that the tag targets a specific sense that the community has in its head at a particular time. Once the sense of the items starts to shift, a new tag must be used, or members the community cannot easily identify the relevant content.

The net result is something that ’s very computationally friendly: hashtags are tightly targeted -- if our search goal encompasses more than one hashtag it is easy to group items with different hashtags together by setting up a “synonym table” that gathers all the results with hashtags A or B or E. On the other hand, standard document search in a corporate environment is more of a situation where A, B and E were all marked A (same term, different contexts) and now we need to somehow distinguish them. The phrase "near impossible" comes to mind. 

This makes me think of a tagging UI, which allows up to 80 characters of # indicated tags, and dynamically shows the search results of these tags against the relevant content repository, cutting results when the tag hasn't been used for a period of time. 

Wednesday, August 20, 2014

This took me an inordinately long time, so I thought I’d share:
This is the log4j2.xml file, placed in 
(where you might also have your hibernate.cfg.xml)

I left the status as TRACE  to assist anyone debugging the configuration, e.g., if you seen CLASS_NOT_FOUND you might be missing a jar in you classpath.

The output appears in the tomcat logs directory, without the “../“ it appears in the bin directory (with, etc.).

Hope this is helpful and saves those who find it some time — Part of the reason getting it right was so hard was distinguishing log4j vs log4j2 hints.

Note:  Running OSX 10.9.4

<?xml version="1.0" encoding="UTF-8"?>

<Configuration status="TRACE">
        <Property name="dest">"etherios_Digester_log4j.log"</Property>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{YYYY/MM/dd:HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
        <File name="rdfsgFile" fileName="../logs/etherios_digester_log4j.log">
                <pattern>%d %p %C{1.} [%t] %m%n</pattern>
        <Logger name="com.rdfsg.aca.etherios.CoreDataInitializer" level="TRACE">
            <AppenderRef ref="rdfsgFile"/>

        <Root level="ERROR">
            <AppenderRef ref="rdfsgFile"/>


Monday, November 4, 2013

Objective C Checklist

As I move closer to production with an app that may get a lot of uptake, I developed the following good practices checklist:


From Effective Objective C 2.0

  • Just designate @class when possible — minimizes leakage (remember no real (complier) enforcement of private) in Objective C
  • Use literal constructors, e.g., rather than NSNumber *someNumber = [NSNumber numberWithInteger:1], use NSNumber *someNumber = 01; --searching for "with" might be effective
  • Use static constants rather than #define, e.g., rather than #define ANIMATION_DURATION 0.3 use static const NSTimeInterval kAnimationDuration = 0.3;
  • @property (nonatomic, readwrite, copy)
    • Sort of obvious ones except for copy -- use copy for strings (which may or may NOT be mutable -- NSMutableString might get passed in, which could lead to weird unexpected behavior), assign works for scalars, but there's also strong and weak (for avoiding circular references that would confuse ARC)
    • One other caveat, if in the “readwrite" slot, retain is also an option -- retain should be used for pointers (and copy, etc. become irrelevant), e.g., @property (nonatomic, strong) NSObject *aThing; retain has been deprecated (replaced with strong) in ARC per
  • Implement description method, e.g.,  (although, if you don’t have any instance variables/properties that are informative, it's probably not worthwhile & you don't, don't don't want to overwrite the autogenerated CoreDataModel code)

  -(NSString *) description{

return [NSString stringWithFormat:@"<%@: %p,  \"%@ %@\">", 

[self class], self, _firstName, _lastName];


  • Define private instance variables in the implementation file. This is done by defining another @interface in the .m file, and prevents leaking -- Although I understand this at the level of "a reasonable workaround for deficiencies in the language"  I do find it a bit off-putting: I like ALL definitions in the .h files (although being a Java/Lisp guy, I hate .h files).  BTW the syntax for this is  (the parens matter!!!)

@interface HCSituationEvaluator ()



  • Use NSCache rather than NSDictionary for caching (didn’t realize this class even existed)

From iOS Programming, The Big Nerd Ranch Guide

  • Set breakpoint to break on all errors



    • Consider using removeObjectIdenticalTo, rather than removeObject (the right answer depends upon the circumstance: removeObject calls isEqual, so it doesn’t require exact instance identity)
    • Check setDelegate in xml parser — the recommendation is to have delegates for each sub node in the xml parse tree. This makes sense, and makes for more maintainable code that building your own stack based state machine. However, I'm a more skeptical of rolling all the parsing into the class associated with that node, primarily because the parsed data may be stored in CoreData, or in a custom datastore. In these cases, I tend to have side classes that can handle “out of band” operations, and put the parsing there, using the pattern CoreDataClassNameHelper
      • HOWEVER (& Critically Importantdelegate is a weak reference, you need to hold onto the value somewhere else or the memory will be reused!


Some "older" writeups advise you to avoid storyboards. This seems to be out of date with Xcode 5, which doesn't present you with a non-storyboard option.


    • Penultimate step "Apple reserves the right to useall two letter class prefixes"
    • Final step, be sure all situation eval is done via notifications and timed events e.g., performSelector:@selector(aSelector) withObject:nilafterDelay:0.5

Monday, July 29, 2013

Integrating Sensors and Prompts/Effectors

I’m putting together an architecture that looks like this:

Sensors signals and actuator

The application is a smart-home/safe-home system supporting a combination of

  • Place basedobjects: primarily sensors, e.g., stovetop monitors, water overflow sensors, etc.,
  • Person basedobjects: either prompting/facilitating (flashing lights, feedback tones, vibration) devices, or sensors facilitating a quantified self paradigm (steps, heart rate, blood pressure).

At first glance this architecture looks overly complicated: why have the ZigBee network at all, wouldn’t it be possible to achieve the same result without it, and if so, wouldn’t that be preferable?


ZigBee Sensor Network Advantages

I was driven to this architecture from two directions, first on the ZigBee side: The ZigBee communication fabric is designed for high capacity sensor networks -- in this case high capacity means many sensors, rather than high bandwidth. Considerations included:


  • ZigBee is used in the Phillips Hue lighting system -- which can control up to 50 (!) bulbs from a single gateway. Their website contains a paper which discusses potential interference issues, demonstrating how unlikely it is that interference will be an issue in practice
  • ZigBee is low power -- the radio can run off of batteries for a “long time” (battery life varies with radio settings). A single radio can report on up to 4 different sensors without requiring an Arduino or other micro controller. This not only increases battery life but also substantially reduces costs (any Arduino is more costly that an XBee radio, and uses substantially more power).

And finally: 

Data Cloud


Once you have the ZigBee sensing network in place, getting the data up to a cloud is straightforward using one of Digi’s gateways (I’m using a ConnectPort X2 at the moment). Digi runs their own cloud service geared towards sensor networks. Etherios provides the service at no cost for a small number of gateway nodes -- a nice way to get going and see if your ideas have any traction.


With the sensor data securely stored in a cloud service, caregiver facing applications can access this data via a REST interface. 



Smartphone Prompting/Effector Network


Output device connectivity is an area where “smartphones” shine. I currently only work with iPhones, since I’m familiar with iOS development, and the libraries provided for building user interfaces are complete, well tested and constantly improving. Smartphones also provide a rich set of information about the user situation, with GPS, accelerometer data, etc..


Smartphones have become the default target for consumer facing add-on devices as they support bluetooth, wifi and cellular data connections. The net effect is that an iPhone app can use bluetooth to integrate with a sphero robotand wifi/cellular data to access the data cloud and control Phillips Hue lightbulbs.


In addition, the phone’s GPS and geo fencing capabilities allow checks to be run before leaving the house (stove off, backdoors locked, etc.), making the system potentially attractive to many people, some of whose only “cognitive disability” is having a hectic life. 


On a more speculative level, multiple new sensor types are becoming available in the quantified selfspace. There are a number of startups developing tools to track and measure your activities along with associated apps and API’s. I expect this area to evolve rapidly as companies with a track record of developing high quality consumer products acquire these technologies. Just to pick one example: Jawbone recently acquired BodyMedia



Network Partitioning


Although the functionality->network partition is flexible. It isn’t arbitrary. Low power, sensors requiring reliable, robust transmission will gravitate towards the ZigBee network. The network’s “self healing mesh” topology gives a higher level of assurance and has the advantageous knock-on effect that more nodes increases rather than decreases reliability.


Aside: As I was writing this up, I read an article in the August issue of Computer describing Washington State’s CASAS project, which has similar characteristics