Sunday, July 6, 2008

Temporal Data

As part of fleshing out the design for the Flexible Drug Discovery (FDD) platform, I'm deciding upon the level of support for temporal data. The simplest decision, based upon the existing "webtwo" infrastructure, would be to have an insert_time and update_time attribute for each item. I think that these times are the bare minimum required to understand/debug system operation.

However, in the clinical domain I've become accustomed to thinking about the data in terms of what did we know when? so that it is possible to reconstruct the understanding of a trial at a given point in time. This obviously requires much more extensive tracking.

I recently came across an excellent book on the topic: Temporal Data and the Relational Model by Date, Darwin and Lorentzos. It presents a detailed analysis of the issues involved in working with temporal information using a refreshingly simple example consisting of a few tables of data about parts and their suppliers.

Many systems use begin and end dates for each row to track when the data has changed, supporting the type of use most relevant to clinical/scientific analysis. However, this technique does not support some interesting situations in the business domain. For example, p 166 of the book shows that given an item with the attributes: name, status, city answering simple questions such as "how long has a supplier been at that address", or "how long has a supplier had that name" requires a begin/end date for each attribute. Thinking through the implications of this issue results in refactoring the model into irreducible components (aka sixth normal form), as described on p 173.

As implied by the term sixth normal form, using the temporal behavior of the data as a design axis can have extensive implications e.g.,
  • splitting quantities out in a LIMS system
  • splitting out names (especially last names!) in a system that tracks employees, etc..

This implies that it is important to consider the temporal behavior of the data even if a temporal model is not planned for the system as it helps drive scenarios for evaluating the system's response to expected changes e.g., "high flux" items may require optimized interfaces, surface special reporting requirements etc..

Other noteworthy topics in the book include: merging intervals e.g., the two facts that attribute A had value 3 from t1-t3 and has value 3 from t3-now should be merged into a single fact.

There is also a discussion of the time-from/time-to in the persistent store, vs the time-from/time-to in the world, which although important in developing systems requirements, doesn't appear to require analysis different in character from what is conventionally performed. My view is that world and storage times are disjoint. In scientific systems there is rarely a reason to worry about world times -- other than referencing the date upon which an operation was performed.

Again, an interesting read, highly recommended (despite their frequent exhortations on how to read the book e.g., "are definitely meant to be read in sequence as written (p51)" or "note carefully" (carefully is used an inordinately large number of times in the text)).

As storage becomes cheaper, the downside of not having a temporal capability will more frequently exceed its implementation cost.

No comments: