Wednesday, May 26, 2010

Necessary Attributes and Opaque Identifiers

At first blush, the use of necessary attributes and opaque identifiers looks a lot like database normalization, with necessary attributes being the columns and opaque identifiers being the foreign keys. I will admit that there are some similarities, although I maintain there are also some strong differences.

The most obvious difference is that many of the attributes (columns) that you would put in a database table would not appear in the ontology. In particular, you would avoid those attributes that are dependent upon specific business processes or technology. This includes most everything that relates to a hierarchy and any inter-object relationships that are only germane to one particular scenario.

Nullable columns are a mixed bag: The immediate reaction might be that if an attribute can be NULL, a priori, it would not be ontologically necessary. However, pure ontological necessity can be at cross purposes with the goal of stopping our analysis at a Middle Distance. A good example of something that can be NULL which would be included is the "derived from" relation. If we stop our analysis at the instrument (which is likely), some results, having been loaded from an instrument will have no source (derived from) results. I find no compelling reason to eliminate the "derived from" attribute, since independent of business processes, most results will derive from a combination of/analysis of other results and an indication of this is necessary for determining dependencies etc, it is just that some results will have no antecedent.

Note: It is open at the moment if the "results" of an average, should be treated as a single result or a set of results of different types, all of which are created by the same operation. In any case, given that our analysis truncates at the instrument, even if we considered the instrument a transformation that created the result(s), there would be no result(s) that served as inputs for the transformation.

By the same token, some attributes that we might think of as always being present may be elided from the model since they are hidden behind opaque identifiers.
A good example here would be the use of geoposition rather than address. Using a Middle Distance approach we would structure the location of an address as an identity preserving opaque identifier (geoposition) rather than as a set of columns containing foreign keys that reference other items (in other tables in a relational model) which hold the street/city/state/country values. This opaque identifier allows us to ignore all political boundaries, variations in street names etc. when designating our location. If required, these values can be derived on a "just in time" basis.

This is the core Middle Distance ontological question:
Given the way we use an entity, can the entity exist without having a value for the attribute under consideration?
If the answer is no then some flavor of the attribute must be brought forward and attached to the entity. The question of whether or not to represent this attribute as an opaque identifier has to do both with the complexity of the attribute and its variation in practice. I think that experimental conditions are another paradigmatic example of something that should be hidden behind an opaque identifier since the level of detail that is important (e.g. include the instrument, SOP, lab location? etc.) changes depending upon the particular type of experiment conducted and the variability of these conditions within the organization -- any and all of which might change over time.

However, the fact that an experiment will have SOME experimental conditions will be invariant.

This hints at a general rule: if the information about a thing requires one or more ancillary tables/objects to represent it, it is best to wrap the information in an opaque identifier which is designed to be sufficient to disambiguate the reference, but does not contain any detail. This identifier can be expanded out to a "report specific" level of detail on demand.

My next post(s) will work through some examples.

Monday, May 10, 2010

Considerations in developing a middle distance ontology

In my mind there are three essential considerations when developing a middle distance ontology

  1. What are the entities under discussion?

  2. What constitutes the necessary attributes of these entities?

  3. Should these attributes be hidden behind opaque identifiers or should they be an integral part of the entity under consideration?

The first question "What entities are under discussion?" is the easiest to answer: These are the entities that you discuss when performing your activities. If something has never come up as a factor in your activities (and isn't obviously on the horizon) there is no need to consider it.

Patients, trials, compounds, assays etc. are both important and are definitely "ready to hand" in the Heideggerian sense.

The second and third questions "what count as explicit attributes" and "what are the modifiers captured by opaque identifiers" are more subtle and domain specific.

This highlights a core point about the middle distance ontology viewpoint: what's important is what matters to the activity that you are performing. If it doesn't impact what you are doing it should not be modeled in detail. Truncating the detail is what keeps the model's complexity under control.

However, there is one caveat to this "what you know is all you need to know" approach: it is critical to evaluate the likely potential changes to your current situation. Doing this well requires an identification of the scenarios that might impact your operation in the near future and thinking them through in some detail, using the scenarios to pressure test your decisions.

Such a scenario analysis is needed since the ontology (obviously) constitutes a deep structural commitment and any changes at this level are usually both costly and painful.

I would posit the following classifications of the potential changes:

  • Changes in the science: These can be very unpredictable, but often there are precursors consisting of some new "interesting results" in an area. Although the exact resolution of the controversy may not be known, any outline of their structure can help highlight areas of necessary flexibility.

  • Changes in the environment: (mergers etc.) do others in the field think of things similarly. If not, what are the most significant differences?

  • Changes in the business structure: are there any "nearby" functions that would require support in the face of an internal restructuring?

  • Changes in the technology: there are two parts to this:
    • Changes in the computer technology: most likely won't impact your ontology unless you're pushing systems to their limits (more and more unlikely in my experience).

    • Changes in the technology of the systems which you are analyzing: e.g., reactions now produce ten similar but not identical compounds rather than a single compound, suddenly photos become tagged with GPS information etc. Another hint is if you're starting to hear the words "high throughput" in a context in which you've never heard them before.

I will admit that a difficulty of doing this is that it spans all architectural disciplines from application to enterprise, but I don't see any way around it.

My next post will focus on when to hide (attributes) behind an opaque identifier.