Monday, February 11, 2008

An Extensible System for Discovery Data

I’ve been thinking about how to make discovery informatics tools significantly more flexible, extensible and perhaps even more maintainable than they currently are. I don’t think of this as being a software problem per se (at least as it involves any new thinking on my part). However imperfectly, the required underlying software capabilities already exist. That is, although we certainly need systems that are more configurable, workflow oriented and flexible in their ability to mix elements on a page (via mash ups etc), and most scientific software is unquestionably deficient in these areas, the state of the art in software development in 2008 affords a clear set of proven practices to satisfy these needs.

The key issue for scientific software is that the science which we are trying to support changes over time in two significant albeit essentially different ways.

The first is that the areas of the business that we need to support/integrate with expands over time e.g., starting at in-vitro testing and moving (in both directions) to support synthesis and in-vivo testing, perhaps eventually into clinical trials. It is inevitable that over time the business changes, the strategy changes, the organizational structure changes -- in any case, once solid (organizational) boundaries become permeable. These changes in the business have a strong impact on the design of scientific software adding anything from the need to support radioisotopes and formulations to tracking freeze/thaw cycles.

The second, and more interesting axis of change is that the science itself changes: new tests come on board that have a radically different cardinality
  • The nature of the data changes: test results start as point averages, evolve into time series and then become two dimensional vectors e.g., %INH, FLIPR, Imaging;
  • Our understanding of the biology changes e.g., one protein -> one gene;
  • Our ability to simplify the problem (a simplification that may have been unconscious) has been shattered e.g., genetic variations of targets, protein pathways, post translational modifications begin to impact the data which we are gathering today.

A common solution is to build the system to support the most complex case. However this doesn’t work if
  • We don’t know the most complex case.
  • The most complex case is a superset of all possible cases only one of which would ever occur in a particular system
  • The group currently being supported can’t supply the information that will enable the distinctions at a later date -- so the information collected is all the same.

In my mind the best way to solve this is to modularize the domain into its simplest building blocks and then build up the necessary complexity using these building blocks. I will admit that this is what we think we’ve been doing all these years, but I don’t think it’s true. I challenge you to look at the actual tables (or objects) in your systems and ask yourself if all of the columns (attributes) are strictly necessary in all (or even most) situations or do they (the columns/attributes) reflect the diffusion of business processes into our base design. My proposal is simply the following: for each building block we scale back the attributes for each entity to the absolutely necessary minimum, paying special attention to items which could potentially change the cardinality.

The temptation in doing this type of analysis is to start at the “most fundamental” level and work your way up. The problem I have with doing things that way is that I find the prospect of describing a mouse in terms of its constituent quarks to be both daunting and without obvious value. My current approach is therefore to start at a well grounded middle level (similar to the “middle distance” ) and stub out to one level beyond the current need.

The goal is to be able to support multiple overlapping hierarchieries so that in different situations we can classify assays by technology, gene, both or neither.

The diagrams below indicate the kind of modeling that I envision

The characteristics of such a “primal entity” include the absence of any bp (business process) foreign keys (direct references) to entities related solely as an aspect of the business process. These business process relationships are moved into relationship tables particular to the situation at hand and reflect the particular hierarchy of the business process currently being addressed. The foreign keys which are allowed under this no bp foreign key constraint involve items that must be present for any entity of this sort e.g., all synthesis batches must have components, methods of synthesis (even if ‘random’), an operator (even if ‘unknown’) and a units of measure in the frames that most of us work in, these can be taken as relatively fixed.

The advantage here is critical: your system, and more importantly your data, can survive in the face a significant unforeseen change in the scientific or business environment. The reason for this is straightforward: the model is now capable of supporting multiple conflicting hierarchies of relationships, so the introduction of a new, conflicting hierarchy doesn’t break your model. This works in a manner similar to they way in which an RDF export of your system can assist you in modeling its ontology, since it can support multiple, potentially conflicting ontologies.

My next post will focus on components of the software architecture.

No comments: