Monday, January 22, 2007

RDF vs Ontologies

The way I look at it, RDF talks about what you have, Ontologies talk about what you can have. The combination of ontology and the data can then be fed into various reasoning engines to tease out the implications of your data.

This is pretty scary. Given my assumption that the science advances and the thinking about what you can have will change over time, incorporating inferred “facts” leaves one open to fundamental system instability.

My classic example in this is “sorry, we don’t really mean one protein per gene anymore.” The ontology and the implications drawn from inferencing upon the data are wrecked but identifiers for gene, protein, transcription etc are unaffected.

RDF at its most basic, gives you stable identifiers for what you have and allows the declaration of “stable” relationships between these objects. This allows you to communicate clearly about what you have and (possibly) easily version it when the time comes to change (see below). These statements should remain valid even if the ontology which they are thought to be embedded in changes radically.


Thinking at the RDF triple level also allows a low overhead means of versioning your information in a manner analogous to the ZFS file system ( and its built in “copy-on-write” facility.

If I remember correctly this copy-on-write is performed at the disk block level rather than at the file level. This is thought to be the basis for Apple’s “time machine” capability (in the next release of the OS X). The system can just look back and determine the valid blocks at a particular time and reconstitute the file to appear as it did at that time.

A similar functionality could be made to work at the RDF triple level. Triples that changed would be “overwritten” but the old information would still be available with a timestamp of valid from/to dates. It is easy to overlay some provenance information on top (similar techniques are used in data warehouses to allow clear tracking of when information was updated/corrected).

This turns the fine grained structure of RDF into a feature that provides similar advantage to what is seen in these file systems: the incremental disk space required for versioning can be very small -- to keep multiple versions of a documents just requires the incremental diskspace ~= the size of the changes and is independent of the document size.

I have to admit a certain hesitance in saying “RDF is good” (aside from the rdf represented by my initials). I have not implemented an RDF based system, I plan to do one in the next month or so and will update.

Monday, January 15, 2007

Data Integration and Ontologies


It is useful to think about three types of data integration

  • Type 1. Document level -- the user can determine what documents might have information of interest
  • Type 2. Term level -- the user can build reports using items from multiple documents/systems e.g., each cell in a spreadsheet can come from different systems.
  • Type 3. Inference level -- terms from one or more documents/systems can be combined to derive information (new terms) in the system being examined.

Both the functionality and ontological commitment increases from type 1 to type 3 systems.

The increasing level of ontological commitment as perceived by a user of the system appears as follows

  • Type 1: There is something here which may be meaningful,
  • Type 2: If something does exist it is meaningful
  • Type 3: The implications of a thing’s existence is meaningful

Rather than attempting to determine the costs of these systems a priori let’s look at some examples

The simplest Type 1system involves simply placing documents in a file system; with some attention to naming and structure this system allows the contents to be easily retrieved and accurately assessed. However anecdotal and personal experience shows that the retrievability of the information degrades over time and it does not scale beyond small collections of items. This degradation stems in part from the fact that these systems allow only a single axis of retrieval based upon the heuristics embedded in the (path)name of the files.

The next level up in complexity for Type 1 systems is the web and file/url tagging systems. Such systems continue to make few a priori claims for the utility of the retrieved items but the use of search engines and URL tagging allow for multiple axes of queries to be retrieved based upon either the algorithms embedded in the search engines or the tags and the sources of those tags. Local file system supporting tags allow the users to (eventually) retrieve their tag definitions either via introspection or an examination of other documents containing suspected tags.

Some of the terminology limitations of having free text tags are alleviated by the fact that the items being tagged are urls/files and therefore unique and retrievable. Retrieving and examining the tagged information allows one to assess the information content (retrieval power) of each tag in the context of the current search. Tagging has been getting a lot of traction on the web with sites such as , shadows and flickr appearing as popular web tools for gathering and sharing tags (social bookmarking). Similarly modern file systems allow tagging of files, directories and applications for images and other media content allow sorting and management of media files via tags

Type 2 Systems: Term level integration is the provence of what is commonly called enterprise integration, which allow reporting and integration of applications within the enterprise (Enterprise Application Integration -- EAI). In practice achieving integration requires stability of the term referents and their use in communication between systems. The stability is what might be termed “stability of use. Commonly the more general the use the more restricted the interface. This involves a conscious decision to “narrow” the functionality when moving from the internal data model to the published interface, often restricting the interface to data transfer objects with a limited number of attributes.

Examples include Service Oriented Architectures (SOA’s) with well defined semantics and ways of modifying them over time. Changes to public information require either explicit revision control and/or verification with all stakeholders that any changes will operate as expected. In general, the “wider” the interface the more frequently the verification is required, with the concomitant drag on system agility.

Type 3 Systems Semantic/Inference level integration: Allows inference of new data from existing/newly added information e.g., IF A AND B THEN C can be inferred. This can cascade into IF B AND C THEN D etc. This is a very strong ontological commitment that requires understanding the implications of the complete set of constraints and inferential mechanisms in the system. The payoff is substantial in that it becomes possible to infer a great deal from just a few additional pieces of information. This does however represent a significant “widening” of the interface, with potentially severe implications for system verification and evolution.

Practical Implications

In Type 1 systems imply no ontological fanout from a local commitment and so it is possible to spontaneously evolve the “definition in use” (“ in use” signifies that there is no requirement for an analytic definition) since the definitions are mostly manual or derived from manual definitions.

On the other hand, for change to occur in Type 2 and Type 3 systems, the implications of the change must be understood for downstream systems that rely upon this changed information as an integral part of an automated process.

In Type 2 systems there is a restricted ontological commitment which requires that changes be verified with systems that couple with the system being changed. The analysis is restricted since the change occurs through a restricted interface the analysis is similarly constrained. All else being equal the greater the functionality and use of the data, the greater the analysis that must be preformed.

Type 3 systems, with their ability to build inferences upon new data, have the largest analysis burden of any of these systems. This implies that they will be the least amenable to revision. There is a possiblity that the use of ontologies will assure that the implications of changes are a priori accounted for. Thus any changes consistent with the ontology will easily be integrated. The problem is that there is no historical precedent for developing stable systems of this type.

Wednesday, January 10, 2007

Aspects of a Platform Architecture: Part 2 - Evolution of a Platform

Again, the goal is to allow a small number of applications, that share some core processes/entities to interact in a loose way and ship and evolve as independently as possible in the face of changing science, user needs, infrastructure and developer/application allocation.

In the last section, I talked about what a platform architecture looks like as a static entity. However the reason for having a platform architecture is the evolution of the application suite over time.

With stable identifiers and the architectural components discussed previously, I have found that middleware can be used as a tool for coherent platform development. Again, I’m basing everything on stable identifiers.Without stable identifiers everything is very hard, with them some things are possible, sometimes even easy. In the absence of stable identifiers it is hard for any common substrate to get a leverage point that allows a clear value add to all of the products under development.

My definition of Middleware is a bit broader that that in wikipedia : Middleware is the enabling technology of Enterprise application integration. It describes a piece of software that connects two or more software applications so that they can exchange data.

For me middleware is also a place to incorporate the cross application business logic that allows users to interact with and see the data in a consistent manner. The middleware also shields custom applications from the hidden details of the database(s) or other persistent storage etc..

Some real life examples that I’ve seen in the life sciences include a situation in which the same substance may two different identifiers, the other is where the data is to be combined using a non-trivial algorithm e.g., geometric mean with outlier removal. In both cases the middleware served as the foundation for consistency since it was critical that all of the applications present the same information to users, and that there is only one implementation of the retrieval/calculation methods (for any non-trivial application the result given by two different implementations can skew over time).

Some of the considerations around method naming, signatures etc. are shared with library design and development. The best resource I know for addressing those considerations is Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries
by Krzysztof Cwalina

A conventional diagram of such a system appears below

A more accurate diagram, given the goal of supporting rapid system evolution is

Where the red links show ad-hoc connections which to support rapid development. My preference is to have the middleware be the responsibility of a single person, as it is the key leverage point for the long term evolution of the architecture.

This person is given the time not only to evaluate architectural ideas that come in from other members of the team who may have implemented solutions that should be made available to others in a slightly generalized fashion but also to examine what’s going on in the industry as far as standards, toolkits etc that will help long term product evolution. In addition, for the ad-hoc connections and implementations to be capable of being moved to the middleware as described below, it is important for this person to have influence upon the design of these interfaces. I’ve found it uniformly tempting for the application owners to embed too much information into their interfaces so as to simplify their short term development, thereby hindering long term development.

Still what I’ve described so still sounds a bit static -- how does it play out over time?

Evolution proceeds as follows:

Even if we start with the Platonic "conventional diagram” shown above, it will quickly evolve into something along the lines of the more realistic version which shows that some ad-hoc connections that have evolved over time to give the individual applications the flexibility to meet their requirements.

The next release of the middleware (1.1) is picked up by the Ensemble Review application. In this case the release supports functionality that had required and “outside the box” access solution for the Ensemble Review and so its access can now occur through the middleware. Green arrows show functionality that has moved to the middleware with the release shown.

The arrow from the Small Group Drill Down application to the Ensemble Review app is shown as now going through the middleware since my practice was to ship the middleware as a labeled jar rather than a web service. Although this had the downside of increasing the footprint of each application it did allow the the interface to remain very transparent.

The next release of the middleware (1.2) is picked up by the New Data Requests application. We now have three versions of the middleware in production, each application has shipped independently, but they are all moving in the same direction and there has been no forking for feature support -- forking for bugs is of course possible.

and of course as shown below an application can pick up the latest version without requiring any improvements.

And then the cycle repeats. The only time that a “synchronized ship” (that is when all applications ship to production simultaneously with the same version of the middleware) is required is when there is an incompatible structural change to the shared data structures or a core business process/algorithm. At this point everyone picks up the same version of the middleware, the shared data store is migrated and extensive testing occurs.

The advantages of this sort of approach include:
The testing time for each application is reduced. An application need not pick up a version of the middleware if the new version doesn’t provide any functionality/bug fixed that it requires (an underlying assumption is that there is adequate regression test coverage to assure that the functionality required by the the application is not broken in the new release).

When an application needs the new functionality it upgrades to the current revision.
This also means that the application is not dependent upon the new middleware functionality is not governed by its timelines shipping

This has the additional benefit of allowing middleware releases to be more focussed on the need of a particular product, or to engage the product architect to support early testing of a feature that will help them.

Aspects of a Platform Architecture: Part 1

How does one create a “reasonable” platform architecture? By which I mean, how does one put a system in place that allows a small number of (~ 10) related products to evolve and ship independently while building upon an infrastructure substrate to allow for common policies, consistent data access, and presentation.

First to be clear on the high level goals what is a Enterprise platform architecture
Multiple products
  • different developers, different use cases (users may overlap), same general business area
Multi-year timelines
  • changing developers, resourcing budgets, technologies, even the rate of change will all change.
Enterprise level
  • Multiple applications serving a particular group of areas within a business, which are different from issues involved in building an industry platform, Windows, Spring, Linux etc.
Common views upon the data
  • When the expectation around the data is the same, the result used/displayed should be the same, no matter how complex the processing is to derive it.
  • Ability to deviate when the expectation is unique or novel. Achieving the responsiveness required for iterative/agile techniques requires the ability to “special case” data access and processing methods. This allows these special cases to be well grounded before they are moved into the substrate (as appropriate).
Products should be normally able to ship independently
Products should be able to communicate between each other so that users can switch between applications (web based in this case) when desired with some minimal context being maintained.

Building this requires creation of a system consisting of a
Data Architecture
  • How do you identify what you’re looking for, where do you find it?
Application Roadmap
  • What application should own the functionality, and when will it be deployed?
Technology Architecture
  • What are the technologies that we’ve settled upon, how are we exploring new ones, how do we decide what to explore/who does it.
Functional Architecture
  • How are we structuring the functionality and identifying common components, how much of the implementation can we hide, what are the hiding requirements (timeliness etc.)

The first step, a Data Architecture foundation.
Select a small set of entities and assure that they have stable, anonymous identifiers. This is surprisingly difficult when building systems out of existing products in scientific domains.

I have been involved in frequent discussions with users who wish to embed domain information in the identifiers so that they can easily identify what they are holding in their hands without needing to go to a computer to find out what it is, or worst case wait for an application to be developed that will let them go to a computer to find out what it is -- a.k.a. every user’s worst nightmare. The best solution I’ve come up with is to print both when necessary.

This selection is also constrained by existing/off the shelf systems and what they can support. Selecting internal identifies from the core of these systems and building synonym tables is a reasonable compromise.

The issues of the remaining Data Architecture plus the other aspects of and Application Roadmap, Technology Architecture and Functional Architecture are important. However, given our multi product, multi-year, frequent revision goals, the issues are as much about building an infrastructure that supports a culture as much as building an architecture. Such a culture involves setting minimal contracts and some processes around their evolution more than instantiating any particular set. The core expectation is that the platform will outlive any applications and it is at the platform level that practices around technology adoption, quality practices, and user interaction should be set. Note: this is not meant to imply that the practices should be uniform across the applications, but the patterns of use categories vs development strategy should be set at the platform level.