Monday, December 20, 2010

Adding Structure to Data

This post demonstrates how to add further structure to data after the initial items have been (uniquely) identified and committed to your persistent store. The core idea here is that once you have items uniquely identified, you can overlay a structure (or any number of structures) upon them as desired.


These structural overlays can also be made to interact as much (or as little) as necessary to address the question currently under consideration.


For example, given four cell lines (this is taken from a brochure from the Charles River labs web site):


Cell LineSpeciesOrganID
SW780 Human Bladder CL-1
Hep3B Human Liver CL-2
B16 Mouse Skin CL-3
Madison109 Murine Lung CL-4


(note: not all relationships need to be specifically listed in the parent table)


This gives us the following structure:



Upon which we can overlay a set of relationship showing the source organ




Or source species 



Now, lets say we add a new cell line 


Cell LineSpeciesOrganID
SW780-1 Human Bladder CL-5


Giving us



We may later realize that CL-5 was derived from CL-1 and just use a separate parent child relationship table to store the information 


Parent Child Relationship
CL-1 CL-5 "derived"


(note: "Root" cell lines are those that do not appear in the Child column or do not appear in this table at all) 




This sort of thing can be generally extended and need not be a strict tree: 


Parent Parent Table Child Child Table Relationship
CL-1 Cell_Line CL-5 Cell_Line fusion-parent
CL-2 Cell_Line CL-5 Cell_Line fusion-parent


Obviously the richer the relationship, the more likely you are to move to a table specifically designed to capture that information. 


Mixture Component Mixture Component Table Mixture Mixture Table Amt
C-1 Compound C-9 Compound 0.1
C-1 Compound C-9 Compound 0.1
R-2 Reagent C-9 Compound 0.1


As these structures build up it is easy to then interrogate the information about our available cells. 


Query: What mammalian cell lines do we have? Procedure: Traverse from the mammalian node and collect all cell line instances




Query: What cell lines are derived from C-1? Procedure: Find cell lines derived from C-1, find cell-lines derived from them (recursively), collect all cell line instances.



The overall pattern is pretty straightforward and is can be processed with standard graph algorithms


See also: Considerations in developing a middle distance ontology

Monday, November 8, 2010

Knowing Where Your Bytes Are

I thought  that I'd post a pointer to this analysis of the foursquare/mongodb outage (for completeness, here is foursquare's description of their outage). I'm not a foursquare user, so I didn't know anything about this outage until I was catching up on my RSS reading last week, but I think the lessons are broadly applicable.

The key point being that even on  a modern/cutting-edge platform, low level details of how your data (and processing) maps onto the underlying hardware can be the source of major headaches.

In general, the number of possible details that could bring a system to its knees is too large to keep in one's head at all times. In practice, what is required is the ability to track down why the system isn't responding to your changes as expected. In the FourSquare  case the question became: "why wasn't the RAM usage decreased when 5% of the data was moved to another shard?"

The solution(s) are always obvious in retrospect, and the FourSquare mongodb team appear to have determined the root cause with admirable speed. In my mind the key to this speed was their ability to quickly get to the question "why didn't my RAM usage decrease" from the question "why didn't my system speed up when I added another shard."

At some level this goes back to my post last year on the Instruments performance analysis tool. A good substrate of these tools is critical:  if you don't have a performance meter that's pegged, in this case the swapping/paging meter, getting a handle on what's going on is difficult, to say the least.

I'm glad I didn't personally experience that particular debugging session.

Monday, August 30, 2010

Middle Distance Ontologies -- an Intermediate Summary

I've posted a number of design exercises using middle distance ontologies including:

At this point I think it's worthwhile to look back and see what, if anything, is different and useful about the middle distance approach.

What struck me most as I was thinking through the examples was how similar the process was to object oriented class design. Using opaque identifiers allows moving a clump of stuff (attributes, identifiers, functionality, behavior) to another class so that the objects in the system correspond to objects in the world. They can then be analyzed as a coherent whole (aka modularization). Pragmatically, such partitioning allows developers to become intimately familiar with how the stuff within that partition operates and evolves over time, permitting them to develop a high level of fluency in the domain.

A key component of this approach proved to be the use of the criterion what could change to drive the splitting off of the chunks of stuff. As such, it is an extension to/refinement of existing design techniques rather than being a replacement for them.

So, how does middle distance thinking extend current techniques?

Extension to DB design: A focus on "objects in the world," rather than just cardinality, results in a more fine grained partition of the problem space. All things which are one-to-many or many-to-one are necessarily distinct and have different potential rates of change, but sometimes things which are one-to-one are also distinct, and should be separated to accommodate future change.

Refinement of OO design: middle distance ontologies are problem space oriented rather than software oriented. As such, it is less concerned with factoring out the appropriate superclasses than OO designs. This is because software design criteria that push such refactorings are "within the system" rather than "in the world." There is nothing in the middle distance approach that pushes common functionality up to a common implementation.

One other aspect of the middle distance approach is that it allows you to pull attributes out of the design and hide them behind the opaque identifiers. This modularization allows you to change the definition of such things as validity as your knowledge of what constitutes validity in your problem domain changes.

In summary, I think the middle distance approach is useful as a design principle, but is not a distinct design technique per se. Assessing its utility in practice must await an opportunity to apply it to a real world problem.

Monday, August 2, 2010


Andy Siegel of Genzyme, Hemant Virkar of Digital Infuzion, and I gave a talk, ArchiMate as a Communication Tool in Launching an EA Effort, at the Open Group's Boston Conference. It discussed our experiences using ArchiMate as a communication tool for rolling out an EA Effort at Genzyme.

As I've mentioned before, I have found ArchiMate to be a practical, useful, framework. This presentation delineates some of the reasons why I came to that conclusion.

Monday, July 19, 2010

Middle Distance Ontologies: assays

This analysis of assays is the companion to my previous post on antibodies.

A core bifurcation within assays is between in-vivo and in-vitro assays. I'm entirely ignoring clinical trials etc. since they are a completely different conceptual space.

The main differences between in-vivo and in-vitro assays is that the measurement is more indirect/variable and the delta between planned and actual measurements is much greater in the in-vivo assays.

In both we have
  • the system under test
  • the test response(s) being measured
  • the measurement event (with a potential planned vs. actual component to each)
  • the entity whose impact is being assessed
  • the way this entity was introduced into the system (most important for in-vivo assays)

The system under test

This captures the technology maintaining the experimental conditions, the SOP, the "target", and the readout.
It would therefore seem useful to use four opaque identifiers here:
  • one for the SOP
  • one for the particular technology or system being used for the measurement, e.g., the animal

  • one for the measurement device

  • one for the receptor/disease

The test response(s) being measured.

In normal practice, response types are behind opaque identifiers, e.g., %INH, with an additional qualifier as to the response units. Middle distance thinking does nothing to change this. When it comes to derived data (see also my post on necessary attributes), there are two options which I call
  • "resulted in" (this value "resulted in" this derived result) design.

  • "resulted from" (this value "resulted from" an operation on these results) design in which the transformation that calculated the value is designated by an opaque identifier.

I've seen a number of systems work well in which a more basic value points to a result derived from it -- a "resulted in" design. However, my preference is for the "resulted from" design as it allows the transformation to be more open about the algorithms used and the data points which served as sources of the value. This design allows the result to point back to its source data points (via the opaque identifier), rather than forcing the source data points to designate the derived result. It also permits a many-to-many relationship rather than the many-to-one coerced by the "resulted in" design, albeit with an attendant increase in complexity.

The measurement event.

(The measurement event may include an indication that the actual measurement event deviated in some way from the planned measurement event.)

This one is surprisingly different when viewed from a middle distance perspective. As opposed to the techniques which I'm familiar with from either conventional transactional systems or warehousing efforts, the middle distance approach suggests two factors:
  • hiding the details of the measurement behind an opaque identifier (including equipment operator, time of measurement, deviation from plan)
  • surfacing a flag (again an opaque identifier) to indicate if there were any problems of significance with this measurement.
This delegates the determination of error significance (and its type) to processes more familiar with the unique characteristics of the measurement.

The entity whose impact is being assessed.

Normal practice maps these to opaque identifiers that tie back to sample lots, be they compounds, mixtures, formulations, or natural products.

The way in which this entity was introduced into the system.

In some systems this may be covered by the SOP for the response being measured. However in more complex (in-vivo) systems it is worthwhile to explicitly call this out, since it is easy to imagine the same SOP being performed either with multiple injections or an implantable device.

In summary it appears that "almost all" of the detail is hidden behind opaque identifiers.

Monday, June 14, 2010

Middle Distance Ontologies: antibodies

I want to push on the whole concept of "Middle Distance Ontology" a bit harder and see how it plays out -- my current plan is to concentrate on the discovery space with two entities: Assay Results and Antibodies.

I'll cover antibodies in this post, assay results in the next.

Now, there are many different perspectives from which to view antibodies, to name a few:
  • As a biologist investigating antibody action

  • As a vendor producing antibodies to meet a specification
  • As a pharmaceutical company procuring antibodies from a vendor to use in an assay

For this exercise, I'll take the perspective of a pharmaceutical company storing/analyzing assay results, since it is the viewpoint I understand best. Antibodies are something that I'm not intimately familiar with, so my approach will be to generate a list of attributes and then evaluate them for inclusion/exclusion/opaque identification.

Here are some antibody attributes I came up with. Many were taken from an interesting white paper from Pierce Biotechnology/Thermo Scientific:
  • Basic Attributes: Primary/Secondary; Monoclonal/Polyclonal; Antigen; Vendor Location; Batch

  • IgG Fragments: IgG Whole Molecule; Gamma Chain of IgG; Fc Fragment of IgG; F(ab ́)2 Fragment of IgG

  • IgM Fragments: IgM Whole Molecule;
    Fc5μ Fragment of IgM

  • Mu Chain of IgM
    Light Chains of Immunoglobulins

We certainly care about the primary/secondary antibody distinction. This captures both the fact that the secondary antibody was used (aka the primary antibody does not have a fluorescent tag (or equivalent)) and the characteristics of this secondary antibody if it appears. Interestingly a quick search was able to discover a reference to tertiary antibodies, so the principles outlined in the scenario analysis section calls for us to provide for these in the design of the core ontology, even if they are unlikely to be used.

In our situation, the factor that couples the primary and secondary (and tertiary) antibodies is their co-occurrence in the assay. A priori, there is nothing that requires there to be anything other than the primary antibody, nor is there any necessity for the antibodies to be able to bind to each other (after all mistakes happen). We might want to say that the primary, secondary and tertiary should (must) be able to bind to each other. However it would be inappropriate to include this as part of this core ontology since we are trying to capture the run of an assay and a run may be erroneous.

Therefore for each experimental run we will have a primary antibody and perhaps one secondary+ antibodies. In addition, we might multiplex the experiment and run more than one antibody set per "container" per experiment. A quick search for multiple antibody assay made me think that "multiplex antibody" was the appropriate search term, which results in > 2,000 hits, which indicates that it is possible.

Antigen is something that we (likely) provide or at minimum specify to the vendor. Although it should consist of a unique sequence, understanding its meaning and role within the overall program would require the ability to support an arbitrary level of complexity. This clearly calls for an opaque identifier.

At the vendor level there we will need to track some vendor identifier (opaque), shipment information (again opaque) and some sort of vendor lot/group identifier (opaque). The scenario that we we wish to be able to support is one in which the vendor ships multiple lots per shipment or spreads one lot across multiple shipments. Tracking the most fine grained vendor location as the opaque identifier at this level protects you from mergers/divestitures and new location startup issues, all of which would potentially be permanently hidden by the use of a larger grained identifier (just think if you only had ONE identifier for all of Thermo!).

When it comes to the more specific characteristics of the antibodies e.g., fragments and chains, they are not attributes that present distinctions which are important to the analysis of the results from the perspective of a pharmaceutical company.

  • Opaque identifiers
    • Vendor facility

    • Vendor lot/shipment

    • antigen

  • Primary identifiers
    • quantity

    • monoclonal/polyclonal
    • antibody

  • Elided Completely (stored seperately)
    • antigen/antibody hierarchy
    • antibody fragments and chains

Wednesday, May 26, 2010

Necessary Attributes and Opaque Identifiers

At first blush, the use of necessary attributes and opaque identifiers looks a lot like database normalization, with necessary attributes being the columns and opaque identifiers being the foreign keys. I will admit that there are some similarities, although I maintain there are also some strong differences.

The most obvious difference is that many of the attributes (columns) that you would put in a database table would not appear in the ontology. In particular, you would avoid those attributes that are dependent upon specific business processes or technology. This includes most everything that relates to a hierarchy and any inter-object relationships that are only germane to one particular scenario.

Nullable columns are a mixed bag: The immediate reaction might be that if an attribute can be NULL, a priori, it would not be ontologically necessary. However, pure ontological necessity can be at cross purposes with the goal of stopping our analysis at a Middle Distance. A good example of something that can be NULL which would be included is the "derived from" relation. If we stop our analysis at the instrument (which is likely), some results, having been loaded from an instrument will have no source (derived from) results. I find no compelling reason to eliminate the "derived from" attribute, since independent of business processes, most results will derive from a combination of/analysis of other results and an indication of this is necessary for determining dependencies etc, it is just that some results will have no antecedent.

Note: It is open at the moment if the "results" of an average, should be treated as a single result or a set of results of different types, all of which are created by the same operation. In any case, given that our analysis truncates at the instrument, even if we considered the instrument a transformation that created the result(s), there would be no result(s) that served as inputs for the transformation.

By the same token, some attributes that we might think of as always being present may be elided from the model since they are hidden behind opaque identifiers.
A good example here would be the use of geoposition rather than address. Using a Middle Distance approach we would structure the location of an address as an identity preserving opaque identifier (geoposition) rather than as a set of columns containing foreign keys that reference other items (in other tables in a relational model) which hold the street/city/state/country values. This opaque identifier allows us to ignore all political boundaries, variations in street names etc. when designating our location. If required, these values can be derived on a "just in time" basis.

This is the core Middle Distance ontological question:
Given the way we use an entity, can the entity exist without having a value for the attribute under consideration?
If the answer is no then some flavor of the attribute must be brought forward and attached to the entity. The question of whether or not to represent this attribute as an opaque identifier has to do both with the complexity of the attribute and its variation in practice. I think that experimental conditions are another paradigmatic example of something that should be hidden behind an opaque identifier since the level of detail that is important (e.g. include the instrument, SOP, lab location? etc.) changes depending upon the particular type of experiment conducted and the variability of these conditions within the organization -- any and all of which might change over time.

However, the fact that an experiment will have SOME experimental conditions will be invariant.

This hints at a general rule: if the information about a thing requires one or more ancillary tables/objects to represent it, it is best to wrap the information in an opaque identifier which is designed to be sufficient to disambiguate the reference, but does not contain any detail. This identifier can be expanded out to a "report specific" level of detail on demand.

My next post(s) will work through some examples.

Monday, May 10, 2010

Considerations in developing a middle distance ontology

In my mind there are three essential considerations when developing a middle distance ontology

  1. What are the entities under discussion?

  2. What constitutes the necessary attributes of these entities?

  3. Should these attributes be hidden behind opaque identifiers or should they be an integral part of the entity under consideration?

The first question "What entities are under discussion?" is the easiest to answer: These are the entities that you discuss when performing your activities. If something has never come up as a factor in your activities (and isn't obviously on the horizon) there is no need to consider it.

Patients, trials, compounds, assays etc. are both important and are definitely "ready to hand" in the Heideggerian sense.

The second and third questions "what count as explicit attributes" and "what are the modifiers captured by opaque identifiers" are more subtle and domain specific.

This highlights a core point about the middle distance ontology viewpoint: what's important is what matters to the activity that you are performing. If it doesn't impact what you are doing it should not be modeled in detail. Truncating the detail is what keeps the model's complexity under control.

However, there is one caveat to this "what you know is all you need to know" approach: it is critical to evaluate the likely potential changes to your current situation. Doing this well requires an identification of the scenarios that might impact your operation in the near future and thinking them through in some detail, using the scenarios to pressure test your decisions.

Such a scenario analysis is needed since the ontology (obviously) constitutes a deep structural commitment and any changes at this level are usually both costly and painful.

I would posit the following classifications of the potential changes:

  • Changes in the science: These can be very unpredictable, but often there are precursors consisting of some new "interesting results" in an area. Although the exact resolution of the controversy may not be known, any outline of their structure can help highlight areas of necessary flexibility.

  • Changes in the environment: (mergers etc.) do others in the field think of things similarly. If not, what are the most significant differences?

  • Changes in the business structure: are there any "nearby" functions that would require support in the face of an internal restructuring?

  • Changes in the technology: there are two parts to this:
    • Changes in the computer technology: most likely won't impact your ontology unless you're pushing systems to their limits (more and more unlikely in my experience).

    • Changes in the technology of the systems which you are analyzing: e.g., reactions now produce ten similar but not identical compounds rather than a single compound, suddenly photos become tagged with GPS information etc. Another hint is if you're starting to hear the words "high throughput" in a context in which you've never heard them before.

I will admit that a difficulty of doing this is that it spans all architectural disciplines from application to enterprise, but I don't see any way around it.

My next post will focus on when to hide (attributes) behind an opaque identifier.

Monday, April 26, 2010

MacBookPro Sleeping Problem: Solved.

If you have a MacBookPro that's having trouble waking up from sleep correctly: spinning beachball, infinitely long login check etc., you might want to try the fix suggested on macrumors by jmora71.

It appears to apply primarily to systems with Solid State Disks and a fair amount of RAM (the author mentions 8G, I have 6G).

Since performing the fix, I haven't had problems in over a week. I used to have issues daily.

For various reasons I had thought the cause was Parallels, but I was happily mistaken.

Monday, April 19, 2010

Ontologies in Practice

This is an extension of an old post An Extensible System for Discovery Data. As I've been thinking more about what constitute the Simplest Building Blocks, I've begun to realize that they designate something very close to an ontology in the "middle distance." That is, it isn't about an ontology down to the fundamental constituents of matter, nor is it about specifying things in sufficient detail to adequately compare and track what's going on between organizations (see Barry Smith's presentation for a discussion of these issues), rather it is an ontology of the stuff we deal with in our day to day activities.

For example, an experiment has

  • A protocol consisting of:
    • a set of initial conditions

    • one or more intermediate steps

      • each step may have a set of operators, equipment, constituents/ingredients etc.

  • A result set consisting of one or more members,
    • any of which might be invalid for one or more reasons

  • A set of analysis results
    • with methods

    • parameters

    • derived results (results that are based upon this result)

    • supporting results (results upon which this result is based, such as calibration curves)

My basis for claiming that these are constituents of a "middle distance" ontology is twofold:
  1. Each component is ontologically necessary. That is, an experiment cannot exist without these components.

  2. When analyzing these constituents we do not need to go into further detail. We can hide that detail behind an opaque identifier and need not give it further meaning. This opacity allows us to stop our analysis at that point; we don't have to analyze down to the constituent quarks (or chiral forms, for that matter, if they don't have any impact on our current goals).

In future posts I'll cover the entities and relationships that I take as being important in this middle distance and how to identify them (which, like most modularization efforts, is a more of an art than a science).

Monday, March 29, 2010

Architects As Service Providers

This paper by Roland Faber of Siemens Healthcare recently appeared in IEEE Software. It talks pointedly about how it is more effective for architecture to be structured as a service that provides value by interacting closely with project developers rather than being structured as a function that produces documentation to be followed by the projects. It even advocates that architects perform some hands on coding in the projects (mirable dictu).

It would be impossible for me to agree more, including the part about hands on coding, my fondness for which is pretty obvious given my blog posts.

The article posits that this close, ongoing interaction is a good way of assuring both that projects understand what the architects are trying to accomplish with the architecture and also that the architects develop an appreciation for the practical issues involved in building working software. There is also a side effect of this kind of interaction that they don't mention: its value in preventing the obsolescence of the person doing the architecture.

The standard scenario in this industry is that a person spends a number of years learning their craft and refining their practice at which point, if they're good, they become an architect or manager and stop coding. The architects' (or manager's) skills stay relevant for a few years (five years seems to be a recurring number), after which they become a pointy headed character in a Dilbert cartoon.

As an industry we should be learning from this anti-pattern.

It seems to me a truism that coding keeps you grounded, and current, and keeps this Dilbertization from happening. Certainly at some point other duties e.g., architecture/management require that you take yourself off of the critical coding path, otherwise the success of the project is put in jeopardy, but new technologies or core utilities that are not on a critical path timeline are all fair game.

I strongly believe in this approach and this is the first article I've seen that reflects my personal practice.

Monday, March 15, 2010

Low Level Virtual Machine (LLVM)

LLVM, as described in this article on AppleInsider, stands for Low Level Virtual Machine. It is an open source project that is used and partly supported by apple.

One of the most interesting things about LLVM is a quote at the bottom of page 1 of the article
Apple also uses LLVM in the OpenGL stack in Leopard, leveraging its virtual machine concept of common IR to emulate OpenGL hardware features on Macs that lack the actual silicon to interpret that code. Code is instead interpreted or JIT on the CPU.

This approach makes it very likely that developers will use the hardware optimized instructions. Most other approaches impose significant costs upon the developers, e.g., the need to write additional code to cover every possible hardware configuration. With the LLVM there is no coding penalty, therefore using the optimized routines becomes a no-brainer, resulting in faster code for people with beefier hardware (who are also those who tend to be most worried about performance) and usable code for everyone else.

As background the article points to a presentation by Chris Lattner but I prefer his paper with Vikram Adve LLVM: a compilation framework for lifelong program analysis because it talks in terms I can understand (like "Static Single Assignment").

So here's what's cool: LLVM eats a code representation that is very amenable to optimization and analysis. It optimizes this input and outputs machine code (potentially tuned for the actual hardware which will run the code) decorated to allow low-overhead runtime profiling.


(Click on image for larger version)

This approach permits repeated optimizations based upon recent run-time data rather than generalized heuristics -- it is reminiscent of "hot spot" with larger scope but less immediacy.

I'd be remiss to not mention the strategic implications of this: it allows Apple to radically shift hardware configurations, while restricting the software impact to a relatively small chunk of code c.f. iPad.

Update 21 Aug 2010: Just noticed LLVM got a SIGPLAN award -- well deserved!

Wednesday, March 3, 2010

Hubs & Connectors

I recently stumbled upon the composite software site and was impressed by their architecture. It is a virtualized/federated solution that reminds me of the Hub/Connector system which I had proposed as a data integration model for the drug discovery/cheminformatics space.

The advantages of such an architecture over a conventional data warehouse include:

  • There is no requirement to perform a complete mapping of the data. This allows focus upon solutions that address the particular problem at hand and the mappings required to solve it. Such a focus is especially important when the data structure and mapping rules are in a state of flux for part of the system. It allows the high flux areas to be avoided.

  • The target data store need not have a structure capable of holding all of the data simultaneously. For example, a target table that would hold all of your CDISC SDTM SUPPQUAL values could require upward of 1000 columns reaching the limits of many common relational databases. On the other hand, the solution for an incremental data set would be an order of magnitude smaller.

  • Only the data of interest is accessed/moved. In systems that only analyze a small set of the data at a time, server size can be reduced substantially.

  • Data need not be moved to a central repository, minimizing duplicative storage space.

Of course there are disadvantages

  • A warehouse allows the precalculation of complex results, imposing little operational delay in retrieving these results.

  • Warehouses can be more easily structured to handle analyses which involved large portions of the dataset.

In scientific domains, it isn't uncommon for new assays, results, etc. to break your current mappings. A virtualized approach minimizes the impact of these problems upon your system and is certainly something to look at if this sounds like your situation.

Monday, February 15, 2010

Seam 2.2 + Jboss 5.1: a resolution

A few weeks ago I posted about my difficulties migrating to new versions of jboss and seam.

I was finally able to resolve these difficulties by following these steps:

  • I first took the suggestion of this post on and used seam-gen to generate a basic seam application that worked with my existing database.

  • I then replaced all the libraries with the ones from the newly generated application.

  • As a penultimate step I followed the procedure outlined in the migration guide.

I was now able to search for differences between the new and the old version.
The primary differences were in the * classes

--the new version of the code followed the pattern:

private static final String[] RESTRICTIONS = {“lower( like concat(lower(#{}),’%’)”,};
private static final String EJBQL = “select shows from Shows shows”;
public ShowsList() {

--- the pattern in the old version was:

private static final String[] RESTRICTIONS = {“lower( like concat(lower(#{}),’%’)”,};
private Shows shows = new Shows();
public String getEjbql() {
return “select shows from Shows shows”;
public Integer getMaxResults() {
return 25;

The primary difference is that the restrictions used the setRestrictionExpressionStrings method. As one of the commenters on the seamframeworks post mentioned: it would have been good to have discussed this in the migration document.

Not a tremendously big deal but, as I said previously, disappointing.

Monday, February 1, 2010

popViewControllerAnimated & message sent to deallocated instance

This is just an informational posting -- it is so odd I thought I'd share.

If you have any of the following:

  • controllerDidChangeContent

  • didChangeSection

  • didChangeObject

  • controllerWillChangeContent

in the same class as a save that does a
when the save completes you will get a

controllerWillChangeContent:]: message sent to deallocated instance (insert your address here)

I'm not exactly sure how this happens
but essentially
these methods hold onto the old copy of self in some manner.

Here's an example:

(gdb) p self
$1 = (GroupItemSelectViewController *) 0x3a34f70
Current language: auto; currently objective-c
(gdb) c

Now pop back to the old screen and then do some editing:

2010-01-05 20:34:36.052 GoodToGo[62202:207] (GroupItemSelectViewController:didSelectRowAtIndexPath) setting Item to YES
2010-01-05 20:34:37.960 GoodToGo[62202:207] (GroupController:numberOfRowsInSection) rows: 3
2010-01-05 20:34:39.902 GoodToGo[62202:207] (GroupController:numberOfRowsInSection) rows: 2
2010-01-05 20:34:42.690 GoodToGo[62202:207] (GroupController:numberOfRowsInSection) rows: 3
2010-01-05 20:34:43.561 GoodToGo[62202:207] (didSelectRowAtIndexPath) Selecting row 0
2010-01-05 20:34:43.563 GoodToGo[62202:207] (GroupItemSelectViewController:viewDidLoad) fetchedItemCount:5, holderCount: 5

What is the current value of self?

(gdb) p self
$2 = (GroupItemSelectViewController *) 0x3c84820
(gdb) c
2010-01-05 20:34:51.690 GoodToGo[62202:207] (GroupItemSelectViewController:didSelectRowAtIndexPath): setting Item to NO
2010-01-05 20:34:53.546 GoodToGo[62202:207] (GroupController:numberOfRowsInSection) rows: 3
2010-01-05 20:34:53.547 GoodToGo[62202:207] *** -[GroupItemSelectViewController controllerWillChangeContent:]: message sent to deallocated instance 0x3a34f70

As you can see 0x3a34f70 is the previous value of self, not its current value.

I do consider this a bug in the framework.