Wednesday, May 26, 2010

Necessary Attributes and Opaque Identifiers

At first blush, the use of necessary attributes and opaque identifiers looks a lot like database normalization, with necessary attributes being the columns and opaque identifiers being the foreign keys. I will admit that there are some similarities, although I maintain there are also some strong differences.

The most obvious difference is that many of the attributes (columns) that you would put in a database table would not appear in the ontology. In particular, you would avoid those attributes that are dependent upon specific business processes or technology. This includes most everything that relates to a hierarchy and any inter-object relationships that are only germane to one particular scenario.

Nullable columns are a mixed bag: The immediate reaction might be that if an attribute can be NULL, a priori, it would not be ontologically necessary. However, pure ontological necessity can be at cross purposes with the goal of stopping our analysis at a Middle Distance. A good example of something that can be NULL which would be included is the "derived from" relation. If we stop our analysis at the instrument (which is likely), some results, having been loaded from an instrument will have no source (derived from) results. I find no compelling reason to eliminate the "derived from" attribute, since independent of business processes, most results will derive from a combination of/analysis of other results and an indication of this is necessary for determining dependencies etc, it is just that some results will have no antecedent.

Note: It is open at the moment if the "results" of an average, should be treated as a single result or a set of results of different types, all of which are created by the same operation. In any case, given that our analysis truncates at the instrument, even if we considered the instrument a transformation that created the result(s), there would be no result(s) that served as inputs for the transformation.

By the same token, some attributes that we might think of as always being present may be elided from the model since they are hidden behind opaque identifiers.
A good example here would be the use of geoposition rather than address. Using a Middle Distance approach we would structure the location of an address as an identity preserving opaque identifier (geoposition) rather than as a set of columns containing foreign keys that reference other items (in other tables in a relational model) which hold the street/city/state/country values. This opaque identifier allows us to ignore all political boundaries, variations in street names etc. when designating our location. If required, these values can be derived on a "just in time" basis.

This is the core Middle Distance ontological question:
Given the way we use an entity, can the entity exist without having a value for the attribute under consideration?
If the answer is no then some flavor of the attribute must be brought forward and attached to the entity. The question of whether or not to represent this attribute as an opaque identifier has to do both with the complexity of the attribute and its variation in practice. I think that experimental conditions are another paradigmatic example of something that should be hidden behind an opaque identifier since the level of detail that is important (e.g. include the instrument, SOP, lab location? etc.) changes depending upon the particular type of experiment conducted and the variability of these conditions within the organization -- any and all of which might change over time.

However, the fact that an experiment will have SOME experimental conditions will be invariant.

This hints at a general rule: if the information about a thing requires one or more ancillary tables/objects to represent it, it is best to wrap the information in an opaque identifier which is designed to be sufficient to disambiguate the reference, but does not contain any detail. This identifier can be expanded out to a "report specific" level of detail on demand.

My next post(s) will work through some examples.

No comments: