Tuesday, February 26, 2008

Extensible System: Core Software Requirements

In my last post I promised to detail the software requirements that an extensible system for discovery data shares with other Web 2.0 systems -- here they are:

Workflow Orientation: Support for a complete workflow beyond that which is offered by JSF navigation. This workflow must allow the orchestration of multiple events without requiring additional user interaction. Supported workflows may involve sequenced, conditional interaction with multiple back end systems (obviously existing on multiple platforms). For sanity and maintainability, the workflow language should be BPEL (or an slight extension) and should provide the ability to extend predicates and actions using a well designed and understood language (Java, C# etc.).

Integrateable (mashable) data: The data stored by the application (and the results of any analyses performed on the data) are available for repurposing in other applications. Repurposing should be supported in fine grained manner so as to put as few restrictions as possible upon its use.

This has two implications: the first is stable identifiers; the second is restful interfaces which allow data to be retrieved by referencing a static URL.
Note: Restful interfaces have a number of nice side effects: the seam *From trick mentioned below would be much more difficult without a restful interface. Additionally, the speed of rapid prototyping/development of web pages is greatly increased if one can directly access a ‘deep’ page without having to manually negotiate multiple precursor pages.

Event queues: Good workflow/system interaction is facilitated by message queues for guaranteed message delivery. Queuing systems also provide good interface points for logging and analysis tools.

Rules engines: In its most general form a rules engine is a piece of code that evaluates a set of antecedent-consequent pairs e.g., if antecedent then consequent. Given this abstract definition rules engines need to be distributed at a number of places within the product. I see four distinct areas each with its own role

1: Display within a page e.g., should the particular element be displayed corresponding to the ‘rendered’ predicate in JSF. Rules involve availability of data appropriate for display and authentication/authorization restrictions.

2: Predicates involving page flow (JSF, &REST, etc). Rules involve what page gets displayed next?
The jboss seam pages have a very nice convention in which the presence of a *From attribute/value pair allows an editing action, upon completion, to return to the page from which it was launched. Here is an example using peopleFrom
peopleFrom ? ‘PeopleList’ : peopleFrom}.xhtml”

Which will return the the PeopleList page when the editing action has been completed.
3: Security rules for CRUD operations. Rules involve accessing and modifying data.
4: Back end BPEL operations e.g., if the request has been outstanding for more than a week then notify customer support. Rules involve the overall operation of the system.

Logging: Effective debugging of complex systems requires the ability to gather an integrated log for each activity in the chain of events that produces a given result; supporting this requirement in an operational setting requires that all relevant logs can be time-aligned and assembled into a single report for analysis

Monitoring and management: Business rules should be capable of being extended to monitoring system operation: server load, queue depth, latency etc. allowing the system to be ‘self monitoring.’ The use of a common tool permits the maximal number of people to understand its operation.

In addition, interfaces should be provided to allow information to be updated (recached) without bouncing the server. JMX is a reasonable example see also.

Security: Obviously any enterprise system must provide for some level of security minimally with LDAP support, hopefully with out of the box support for OpenID and SAFE. In practice, I would caution against making security/access overly fine grained since it must support people changing their roles in the organization, changes in business processes etc.. The more fine grained your access model the more thought is required to get it right and the greater the probability of getting it wrong.

I have personally found it useful to distinguish reading, writing, and editing data, opening up the reading and dissemination of the information while restricting writing and editing data to specific tools provided for specific stages of the process.

For example, given a standard lab workflow for data collection, analysis, upload and “publication” (to the persons requesting the tests and then the company at large): there is one tool for collecting, analyzing and uploading the data; there is a second set of tools for integrating and viewing the data in a larger context; and there may be a third set of tools for curating and editing data which has been found discrepant.

This rounds out the software requirements for a practical production system. Although these requirements appear (and are) extensive most, if not all, of them appear in a number of enterprise level toolkits. As I said at the beginning of this post: there are clear best practices. A Powerpoint that covers both these posts is available.

Monday, February 11, 2008

An Extensible System for Discovery Data

I’ve been thinking about how to make discovery informatics tools significantly more flexible, extensible and perhaps even more maintainable than they currently are. I don’t think of this as being a software problem per se (at least as it involves any new thinking on my part). However imperfectly, the required underlying software capabilities already exist. That is, although we certainly need systems that are more configurable, workflow oriented and flexible in their ability to mix elements on a page (via mash ups etc), and most scientific software is unquestionably deficient in these areas, the state of the art in software development in 2008 affords a clear set of proven practices to satisfy these needs.

The key issue for scientific software is that the science which we are trying to support changes over time in two significant albeit essentially different ways.

The first is that the areas of the business that we need to support/integrate with expands over time e.g., starting at in-vitro testing and moving (in both directions) to support synthesis and in-vivo testing, perhaps eventually into clinical trials. It is inevitable that over time the business changes, the strategy changes, the organizational structure changes -- in any case, once solid (organizational) boundaries become permeable. These changes in the business have a strong impact on the design of scientific software adding anything from the need to support radioisotopes and formulations to tracking freeze/thaw cycles.

The second, and more interesting axis of change is that the science itself changes: new tests come on board that have a radically different cardinality
  • The nature of the data changes: test results start as point averages, evolve into time series and then become two dimensional vectors e.g., %INH, FLIPR, Imaging;
  • Our understanding of the biology changes e.g., one protein -> one gene;
  • Our ability to simplify the problem (a simplification that may have been unconscious) has been shattered e.g., genetic variations of targets, protein pathways, post translational modifications begin to impact the data which we are gathering today.

A common solution is to build the system to support the most complex case. However this doesn’t work if
  • We don’t know the most complex case.
  • The most complex case is a superset of all possible cases only one of which would ever occur in a particular system
  • The group currently being supported can’t supply the information that will enable the distinctions at a later date -- so the information collected is all the same.

In my mind the best way to solve this is to modularize the domain into its simplest building blocks and then build up the necessary complexity using these building blocks. I will admit that this is what we think we’ve been doing all these years, but I don’t think it’s true. I challenge you to look at the actual tables (or objects) in your systems and ask yourself if all of the columns (attributes) are strictly necessary in all (or even most) situations or do they (the columns/attributes) reflect the diffusion of business processes into our base design. My proposal is simply the following: for each building block we scale back the attributes for each entity to the absolutely necessary minimum, paying special attention to items which could potentially change the cardinality.

The temptation in doing this type of analysis is to start at the “most fundamental” level and work your way up. The problem I have with doing things that way is that I find the prospect of describing a mouse in terms of its constituent quarks to be both daunting and without obvious value. My current approach is therefore to start at a well grounded middle level (similar to the “middle distance” ) and stub out to one level beyond the current need.

The goal is to be able to support multiple overlapping hierarchieries so that in different situations we can classify assays by technology, gene, both or neither.

The diagrams below indicate the kind of modeling that I envision

The characteristics of such a “primal entity” include the absence of any bp (business process) foreign keys (direct references) to entities related solely as an aspect of the business process. These business process relationships are moved into relationship tables particular to the situation at hand and reflect the particular hierarchy of the business process currently being addressed. The foreign keys which are allowed under this no bp foreign key constraint involve items that must be present for any entity of this sort e.g., all synthesis batches must have components, methods of synthesis (even if ‘random’), an operator (even if ‘unknown’) and a units of measure in the frames that most of us work in, these can be taken as relatively fixed.

The advantage here is critical: your system, and more importantly your data, can survive in the face a significant unforeseen change in the scientific or business environment. The reason for this is straightforward: the model is now capable of supporting multiple conflicting hierarchies of relationships, so the introduction of a new, conflicting hierarchy doesn’t break your model. This works in a manner similar to they way in which an RDF export of your system can assist you in modeling its ontology, since it can support multiple, potentially conflicting ontologies.

My next post will focus on components of the software architecture.