Saturday, December 22, 2007

jboss Seam?

Lately I have felt the need to get familiar with a new toolset for “industrial strength” applications. There are a couple of factors driving this:

1. The methods that I currently use to build Web 2.0 applications revolve around RoR. However, RoR doesn’t have a strong user community in my customer base (at least on deployed applications). In addition, I’m much more experienced with the Java stack and would be more confident that my proposed web 2.0 (and web 3.0) capabilities would work in production if I knew how to build them using a Java based/oriented toolkit.

2. In thinking about a ‘next generation’ flexible product for the pharmaceutical/discovery space (more on this in a future post) I feel that one of the key components is a good workflow engine. (see FootNote). For a workflow engine to be effective in a scientific environment it is important to be able to drop down into a standard language to perform any special case processing that might be required. This is more important in a scientific setting than in a general business application since a portion of the flow will likely depend upon an algorithmic analysis of the data -- using algorithms that have been newly defined to analyze this dataset. This requires a solid, well defined language suitable for general use, rather than an ad hoc scripting language possibly designed by someone without a background in designing computer languages. My early heuristic in this area was that if Guy Steele hadn’t written on it or had not been involved in it you shouldn’t use it. This heuristic doesn’t appear as useful as it once was e.g., I have heard good things about C#. However, if you read through some of the language specs that Steele’s been involved with, you see what I mean: functionality that is essentially Turing Complete, well thought out exception handling and a clear explanation of operation order, especially around object and class instantiation.

Given these core requirements an obvious candidate is the jboss seam toolkit. Seam has one other interesting feature: the concept of a conversation to handle user context. From the doc But suppose we suddenly discover a system requirement that says that a user is allowed to have multiple concurrent conversations, halfway through the development of the system.

Conversations appear to allow the support of users who want to have multiple browser windows open to look at different result sets, perform different activities etc.. This is definitely a ‘nice to have’ for scientific work -- I have found that users often want to look at and drill down on multiple data sets for comparison and analysis. A similar capability as developed in my group at Millennium by Vlado and David (two very talented guys) with their PageDataServlets framework. It is satisfying to note that PageDataServlets were developed back in 1999 and are still in use. Although it isn’t something that you would use in a new product (1999 was a while ago), it’s a testimony to their quality and depth that they are still in use and provide a capability that you see in very few web sites today.

Seam is open source and has some unusual characteristics for an open-source project:
1. Attractive Demos that work ‘out of the box’ (& I’m running an Intel Mac with Leopard, so that is saying something)
2. Extensive documentation

It also has the prerequisite active user and developer community aka it has enough momentum so that it won’t die soon

So as a test, I’m going to try to get a RoR project ported to seam and also develop a deNovo web 2.0 project in it -- more in my next post.

FootNote: I have been saying that workflow tools are important for a while -- my group shipped an workflow based application based upon BEA’s process integrator platform back around the turn of the century. The workflow space has taken much longer to mature than I expected. The standards groups appear to have split at one point an merged again. BPEL is starting to be a standard feature in product brochures, so I’m hoping that it is not premature to start to use it again.

Wednesday, June 20, 2007

Using Semantic Tools with real data

In one of my previous posts I mentioned the question “what do you gain from having the instance data present and available for processing by semantic web tools?” The core idea is that this would simplify the detection of modeling errors and highlight domain misunderstandings, but I couldn’t be sure until I had tried it with data from a real system.

I’ve finally been able to review the results from one such system and have convinced myself that there it is useful.

A bit of context-- doing this in a way that I considered valid required data from of a system that I understood well. The best available system was an “art submissions” application that tracks available artwork and its attributes.

The data was extracted using my rdf_rails utility which generate RDF and OWL files from a RoR application

After trying this out for a bit, I have come to the conclusion there is utility here.

The utility can be best demonstrated with a simple example (all examples are in RacerPro, see my previous post)

(retrieve (?x ) (and (?x Artwork) (neg (?x shipable-art)) (neg (?x unshipable-art))))*

This query retrieves all artworks that are neither shippable nor unshippable.

Since the shippable-art & unshippable-art concepts were intended completely cover the Artworks category, any non-null result indicates either an error in domain modeling/understanding or data cleanliness, either of which should be detected prior to rolling out the model and the application that embodies it.

I admit that dichotomous coverings are a simple case but think that it clearly demonstrates the value of the approach.

While on the topic of semantic web tools it would be remiss of me to not mention the swtools spreadsheet at which is a handy and thorough catalog of available semantic web tools

Published by Michael K. Bergman

* Note from the RacerPro documentation neg is a unary constructor, the negation as failure (NAF) negation. The argument is
a query body.

The concepts shippable-art and unshippable-art are defined as follows”

(define-concept shipable-art
(or (boolean= Artwork_framed #T)
(some Artwork_weight_pounds (<> racer-internal%has-real-value 10) )
(some Artwork_width_inches (> racer-internal%has-real-value 10) )
(some Artwork_height_inches (> racer-internal%has-real-value 10) )
(some Artwork_depth_inches (> racer-internal%has-real-value 10) )

Monday, June 18, 2007


I’ve been doing some ontology analysis using RacerPro. I decided to try out RacerPro primarily because my current modeling project uses numeric relationships to determine class membership. My quick scan of the available tools determined that RacerPro offered the best support in this space (any suggestions about other tools supporting such operations would be appreciated)

RacerPro supports the use of numerics in two ways.

The first is as a simple query e.g.,
(retrieve (?x) (?x (and (<= Artwork_width_inches 10) (<= Artwork_height_inches 10))))
;; aka retrieve every object for which the role (property) Artwork_width_inches < 010 and Artwork_height_inches <= 10

The second is as a class/concept

(define-concept can-ship
(some Artwork_width_inches (<>
(some Artwork_height_inches (<>

Which is essentially the same only this time defining a concept rather than issuing a query.

It is fair to say that the racer-internal%has-real-value term did not leap out from the documentation. However the Racer technical support was both very accurate and extremely responsive, getting me up and running pretty quickly.

All in all, I’m happy with the product.

Tuesday, May 29, 2007

rdf_rails on rubyforge

I just posted a project, rdf-rails, up on RubyForge that will take a RoR application and convert it to OWL/RDF (including all of the instance data)

This is very much an alpha level project -- I only had one RoR application available for testing.

If you’d like to give it a quick spin I’d appreciate any feedback.

I’m interested in two types of feedback

The first is the conventional software testing: bugs, missing features, inadequate documentation etc.

The second revolves around the utility of the function -- what do you gain from having the instance data present and available for processing by semantic web tools? My postulate is that this would allow a more thorough understanding of the data constituting the domain. An application could then be initially deployed with minimal constraints while the domain is still under exploration. Then, as experience is gained and the system has been populated with data, intuitions about the ontology of the domain could be validated against the data.

I haven’t fully convinced myself of this as yet. It still appears possible, but the compelling example hasn’t yet surfaced. I would like to hear any (positive or negative) experiences that you’ve had in this area.

Wednesday, May 2, 2007

Architecture and Spreadsheets

I gave a presentation May 1st at BioIT World 2007 as part of a panel Interoperability and Standards: Progress towards an Industry Architecture in which I develop an example of the evolution and increasing scope of a spreadsheet to show the ramifications of software engineering principles, software architecture and eventually industry architecture upon bringing timely data to a cell in your spreadsheet.

Here’s a link to the PDF

Thursday, April 12, 2007

RDF and RubyOnRails

Haven’t posted in a while -- I’ve been working on a ruby utility that generates RDF and RDFS files from a Ruby On Rails application.

The goal is to enable the use of semantic tools to explore the underlying structure of your data, using the built in-capabilities of these tools to quickly discover contradictions between postulated structure and the actual data. e.g., “I think that every event only has a single location”

In this scenario, one uses the utility to build a base ontology (including instances) from the production system. An application could then be deployed with relatively loose constraints, with the constraints being refined after the application has had the opportunity to accumulate real data. This is especially useful where the character of the data is not well understood/controversial at the time of the initial deployment.

The project has a useful side benefit of allowing me to gain a better understanding of both Ruby and the RoR framework (see the note on class loading below)

The good news is that it was relatively straightforward to get .rdf and .rdfs files generated so that I could read them into Protege (which may not be the best tool for this, but it is one with which I’m familiar and has a substantial amount of documentation)

The bad news is that I couldn’t add any refinements to the an rdf/rdfs project in Protege -- it requires a ‘real ontology’ e.g., an .owl file (which is obvious in retrospect but.....)

I’m now in the process of generating the .owl file.

Class Loading: The classes in a Ruby On Rails application are not loaded when the console is started. ActiveRecord subclasses are loaded on demand from a hook added to the missing_const exception

The way it is handled is interesting so I’m reproducing it here
(from le: /usr/local/lib/ruby/gems/1.8/gems/activesupport-1.4.1/lib/active_support/dependencies.rb in my installation)

class Module #:nodoc:
# Rename the original handler so we can
# chain it to the new one

alias :rails_original_const_missing :const_missing

# Use const_missing to autoload associations
# so we don’t have to

# require_association when using single-table inheritance.
def const_missing(class_id)
Dependencies.load_missing_constant self, class_id

def unloadable(const_desc = self)


Monday, February 5, 2007

Microformats, RDF and Life Sciences

I attended the microformats session at mashupcamp3 the week of Jan 17th which had an interesting sidebar about how microformats are not RDF.

For those unfamiliar with microformats the core reference appears to be Microformats consist of ~10 specifications plus approximately the same number of draft specifications

The sidebar centered upon the fact that a lot of real work can get done with micro formats but some people particularly in the biology/life sciences domain are very heavily committed to RDF. The question arose as to why. There wasn’t enough time remaining (and probably not enough interest) to explore the why.

Looking at the adopted and proposed formats I am struck my a few things. The first is that they each capture a nice nugget of functionality: calendars, contact information, news feeds etc. The second is that they are designed to capture the common cases while avoiding the complexity of handling the uncommon situations, which I think is a good thing.

The nice side effect of this aesthetic is that you can be up and running quickly doing real work, exchanging information etc. The bad side effect is that if you need to do something more complicated the hooks don’t exist to allow you to describe what’s going on.

In general, the more unstructured you are willing to be, the easier it is to capture all of the information. The difficulty arises when you try to curate it or use it in another context. As an example, think of the initial attempt that often appears in a database design: a single table consisting of N text fields. It can work and can hold pretty much anything. Detecting duplicates and understanding the structure come later if at all. However, depending upon the scale and use of the information this may just be fine.

In my opinion, at the enterprise level we sometimes overemphasize the scale and structural integrity issues. Scale can be a big deal if you’re trying to achieve perfect reconciliation of information. If “good enough” is OK, large-scale integration can be achieved in practice with very unstructured data. A good example of this is Hype Machine (which was demo’d at mashupcamp3) which mines music blogs effectively -- something that I would have thought impossible, given all the issues around spelling, new band names etc. It works partly because the problem is to find something rather than to find everything with complete accuracy.

At a deeper level what is more characteristic of the areas addressed by microformats is that we can develop a good understanding of what’s going on from our intuitions of how the world works and these intuitions should be able to cover a good number of the situations that we will actually encounter.

In life sciences this is simply not the case. Our intuitions are often wrong, in the clinical area the number of potential confounding factors is immense (e.g.
lists the MEDRA term count as 65872 ) so there is an understandable push to design formats that are extensible and can capture information in well defined and reusable ways.

However, microformats and mashups in general do raise the question “Is this a case of the best driving out the good?” Despite my bias towards ‘scalability’ and ‘enterprise solutions’, it is hard to argue with standing up an application in a few hours that provides some utility to end users (and some real data on use etc.). Even if requires some real work to migrate to a more scalable application when it comes online.

Monday, January 22, 2007

RDF vs Ontologies

The way I look at it, RDF talks about what you have, Ontologies talk about what you can have. The combination of ontology and the data can then be fed into various reasoning engines to tease out the implications of your data.

This is pretty scary. Given my assumption that the science advances and the thinking about what you can have will change over time, incorporating inferred “facts” leaves one open to fundamental system instability.

My classic example in this is “sorry, we don’t really mean one protein per gene anymore.” The ontology and the implications drawn from inferencing upon the data are wrecked but identifiers for gene, protein, transcription etc are unaffected.

RDF at its most basic, gives you stable identifiers for what you have and allows the declaration of “stable” relationships between these objects. This allows you to communicate clearly about what you have and (possibly) easily version it when the time comes to change (see below). These statements should remain valid even if the ontology which they are thought to be embedded in changes radically.


Thinking at the RDF triple level also allows a low overhead means of versioning your information in a manner analogous to the ZFS file system ( and its built in “copy-on-write” facility.

If I remember correctly this copy-on-write is performed at the disk block level rather than at the file level. This is thought to be the basis for Apple’s “time machine” capability (in the next release of the OS X). The system can just look back and determine the valid blocks at a particular time and reconstitute the file to appear as it did at that time.

A similar functionality could be made to work at the RDF triple level. Triples that changed would be “overwritten” but the old information would still be available with a timestamp of valid from/to dates. It is easy to overlay some provenance information on top (similar techniques are used in data warehouses to allow clear tracking of when information was updated/corrected).

This turns the fine grained structure of RDF into a feature that provides similar advantage to what is seen in these file systems: the incremental disk space required for versioning can be very small -- to keep multiple versions of a documents just requires the incremental diskspace ~= the size of the changes and is independent of the document size.

I have to admit a certain hesitance in saying “RDF is good” (aside from the rdf represented by my initials). I have not implemented an RDF based system, I plan to do one in the next month or so and will update.

Monday, January 15, 2007

Data Integration and Ontologies


It is useful to think about three types of data integration

  • Type 1. Document level -- the user can determine what documents might have information of interest
  • Type 2. Term level -- the user can build reports using items from multiple documents/systems e.g., each cell in a spreadsheet can come from different systems.
  • Type 3. Inference level -- terms from one or more documents/systems can be combined to derive information (new terms) in the system being examined.

Both the functionality and ontological commitment increases from type 1 to type 3 systems.

The increasing level of ontological commitment as perceived by a user of the system appears as follows

  • Type 1: There is something here which may be meaningful,
  • Type 2: If something does exist it is meaningful
  • Type 3: The implications of a thing’s existence is meaningful

Rather than attempting to determine the costs of these systems a priori let’s look at some examples

The simplest Type 1system involves simply placing documents in a file system; with some attention to naming and structure this system allows the contents to be easily retrieved and accurately assessed. However anecdotal and personal experience shows that the retrievability of the information degrades over time and it does not scale beyond small collections of items. This degradation stems in part from the fact that these systems allow only a single axis of retrieval based upon the heuristics embedded in the (path)name of the files.

The next level up in complexity for Type 1 systems is the web and file/url tagging systems. Such systems continue to make few a priori claims for the utility of the retrieved items but the use of search engines and URL tagging allow for multiple axes of queries to be retrieved based upon either the algorithms embedded in the search engines or the tags and the sources of those tags. Local file system supporting tags allow the users to (eventually) retrieve their tag definitions either via introspection or an examination of other documents containing suspected tags.

Some of the terminology limitations of having free text tags are alleviated by the fact that the items being tagged are urls/files and therefore unique and retrievable. Retrieving and examining the tagged information allows one to assess the information content (retrieval power) of each tag in the context of the current search. Tagging has been getting a lot of traction on the web with sites such as , shadows and flickr appearing as popular web tools for gathering and sharing tags (social bookmarking). Similarly modern file systems allow tagging of files, directories and applications for images and other media content allow sorting and management of media files via tags

Type 2 Systems: Term level integration is the provence of what is commonly called enterprise integration, which allow reporting and integration of applications within the enterprise (Enterprise Application Integration -- EAI). In practice achieving integration requires stability of the term referents and their use in communication between systems. The stability is what might be termed “stability of use. Commonly the more general the use the more restricted the interface. This involves a conscious decision to “narrow” the functionality when moving from the internal data model to the published interface, often restricting the interface to data transfer objects with a limited number of attributes.

Examples include Service Oriented Architectures (SOA’s) with well defined semantics and ways of modifying them over time. Changes to public information require either explicit revision control and/or verification with all stakeholders that any changes will operate as expected. In general, the “wider” the interface the more frequently the verification is required, with the concomitant drag on system agility.

Type 3 Systems Semantic/Inference level integration: Allows inference of new data from existing/newly added information e.g., IF A AND B THEN C can be inferred. This can cascade into IF B AND C THEN D etc. This is a very strong ontological commitment that requires understanding the implications of the complete set of constraints and inferential mechanisms in the system. The payoff is substantial in that it becomes possible to infer a great deal from just a few additional pieces of information. This does however represent a significant “widening” of the interface, with potentially severe implications for system verification and evolution.

Practical Implications

In Type 1 systems imply no ontological fanout from a local commitment and so it is possible to spontaneously evolve the “definition in use” (“ in use” signifies that there is no requirement for an analytic definition) since the definitions are mostly manual or derived from manual definitions.

On the other hand, for change to occur in Type 2 and Type 3 systems, the implications of the change must be understood for downstream systems that rely upon this changed information as an integral part of an automated process.

In Type 2 systems there is a restricted ontological commitment which requires that changes be verified with systems that couple with the system being changed. The analysis is restricted since the change occurs through a restricted interface the analysis is similarly constrained. All else being equal the greater the functionality and use of the data, the greater the analysis that must be preformed.

Type 3 systems, with their ability to build inferences upon new data, have the largest analysis burden of any of these systems. This implies that they will be the least amenable to revision. There is a possiblity that the use of ontologies will assure that the implications of changes are a priori accounted for. Thus any changes consistent with the ontology will easily be integrated. The problem is that there is no historical precedent for developing stable systems of this type.

Wednesday, January 10, 2007

Aspects of a Platform Architecture: Part 2 - Evolution of a Platform

Again, the goal is to allow a small number of applications, that share some core processes/entities to interact in a loose way and ship and evolve as independently as possible in the face of changing science, user needs, infrastructure and developer/application allocation.

In the last section, I talked about what a platform architecture looks like as a static entity. However the reason for having a platform architecture is the evolution of the application suite over time.

With stable identifiers and the architectural components discussed previously, I have found that middleware can be used as a tool for coherent platform development. Again, I’m basing everything on stable identifiers.Without stable identifiers everything is very hard, with them some things are possible, sometimes even easy. In the absence of stable identifiers it is hard for any common substrate to get a leverage point that allows a clear value add to all of the products under development.

My definition of Middleware is a bit broader that that in wikipedia : Middleware is the enabling technology of Enterprise application integration. It describes a piece of software that connects two or more software applications so that they can exchange data.

For me middleware is also a place to incorporate the cross application business logic that allows users to interact with and see the data in a consistent manner. The middleware also shields custom applications from the hidden details of the database(s) or other persistent storage etc..

Some real life examples that I’ve seen in the life sciences include a situation in which the same substance may two different identifiers, the other is where the data is to be combined using a non-trivial algorithm e.g., geometric mean with outlier removal. In both cases the middleware served as the foundation for consistency since it was critical that all of the applications present the same information to users, and that there is only one implementation of the retrieval/calculation methods (for any non-trivial application the result given by two different implementations can skew over time).

Some of the considerations around method naming, signatures etc. are shared with library design and development. The best resource I know for addressing those considerations is Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries
by Krzysztof Cwalina

A conventional diagram of such a system appears below

A more accurate diagram, given the goal of supporting rapid system evolution is

Where the red links show ad-hoc connections which to support rapid development. My preference is to have the middleware be the responsibility of a single person, as it is the key leverage point for the long term evolution of the architecture.

This person is given the time not only to evaluate architectural ideas that come in from other members of the team who may have implemented solutions that should be made available to others in a slightly generalized fashion but also to examine what’s going on in the industry as far as standards, toolkits etc that will help long term product evolution. In addition, for the ad-hoc connections and implementations to be capable of being moved to the middleware as described below, it is important for this person to have influence upon the design of these interfaces. I’ve found it uniformly tempting for the application owners to embed too much information into their interfaces so as to simplify their short term development, thereby hindering long term development.

Still what I’ve described so still sounds a bit static -- how does it play out over time?

Evolution proceeds as follows:

Even if we start with the Platonic "conventional diagram” shown above, it will quickly evolve into something along the lines of the more realistic version which shows that some ad-hoc connections that have evolved over time to give the individual applications the flexibility to meet their requirements.

The next release of the middleware (1.1) is picked up by the Ensemble Review application. In this case the release supports functionality that had required and “outside the box” access solution for the Ensemble Review and so its access can now occur through the middleware. Green arrows show functionality that has moved to the middleware with the release shown.

The arrow from the Small Group Drill Down application to the Ensemble Review app is shown as now going through the middleware since my practice was to ship the middleware as a labeled jar rather than a web service. Although this had the downside of increasing the footprint of each application it did allow the the interface to remain very transparent.

The next release of the middleware (1.2) is picked up by the New Data Requests application. We now have three versions of the middleware in production, each application has shipped independently, but they are all moving in the same direction and there has been no forking for feature support -- forking for bugs is of course possible.

and of course as shown below an application can pick up the latest version without requiring any improvements.

And then the cycle repeats. The only time that a “synchronized ship” (that is when all applications ship to production simultaneously with the same version of the middleware) is required is when there is an incompatible structural change to the shared data structures or a core business process/algorithm. At this point everyone picks up the same version of the middleware, the shared data store is migrated and extensive testing occurs.

The advantages of this sort of approach include:
The testing time for each application is reduced. An application need not pick up a version of the middleware if the new version doesn’t provide any functionality/bug fixed that it requires (an underlying assumption is that there is adequate regression test coverage to assure that the functionality required by the the application is not broken in the new release).

When an application needs the new functionality it upgrades to the current revision.
This also means that the application is not dependent upon the new middleware functionality is not governed by its timelines shipping

This has the additional benefit of allowing middleware releases to be more focussed on the need of a particular product, or to engage the product architect to support early testing of a feature that will help them.

Aspects of a Platform Architecture: Part 1

How does one create a “reasonable” platform architecture? By which I mean, how does one put a system in place that allows a small number of (~ 10) related products to evolve and ship independently while building upon an infrastructure substrate to allow for common policies, consistent data access, and presentation.

First to be clear on the high level goals what is a Enterprise platform architecture
Multiple products
  • different developers, different use cases (users may overlap), same general business area
Multi-year timelines
  • changing developers, resourcing budgets, technologies, even the rate of change will all change.
Enterprise level
  • Multiple applications serving a particular group of areas within a business, which are different from issues involved in building an industry platform, Windows, Spring, Linux etc.
Common views upon the data
  • When the expectation around the data is the same, the result used/displayed should be the same, no matter how complex the processing is to derive it.
  • Ability to deviate when the expectation is unique or novel. Achieving the responsiveness required for iterative/agile techniques requires the ability to “special case” data access and processing methods. This allows these special cases to be well grounded before they are moved into the substrate (as appropriate).
Products should be normally able to ship independently
Products should be able to communicate between each other so that users can switch between applications (web based in this case) when desired with some minimal context being maintained.

Building this requires creation of a system consisting of a
Data Architecture
  • How do you identify what you’re looking for, where do you find it?
Application Roadmap
  • What application should own the functionality, and when will it be deployed?
Technology Architecture
  • What are the technologies that we’ve settled upon, how are we exploring new ones, how do we decide what to explore/who does it.
Functional Architecture
  • How are we structuring the functionality and identifying common components, how much of the implementation can we hide, what are the hiding requirements (timeliness etc.)

The first step, a Data Architecture foundation.
Select a small set of entities and assure that they have stable, anonymous identifiers. This is surprisingly difficult when building systems out of existing products in scientific domains.

I have been involved in frequent discussions with users who wish to embed domain information in the identifiers so that they can easily identify what they are holding in their hands without needing to go to a computer to find out what it is, or worst case wait for an application to be developed that will let them go to a computer to find out what it is -- a.k.a. every user’s worst nightmare. The best solution I’ve come up with is to print both when necessary.

This selection is also constrained by existing/off the shelf systems and what they can support. Selecting internal identifies from the core of these systems and building synonym tables is a reasonable compromise.

The issues of the remaining Data Architecture plus the other aspects of and Application Roadmap, Technology Architecture and Functional Architecture are important. However, given our multi product, multi-year, frequent revision goals, the issues are as much about building an infrastructure that supports a culture as much as building an architecture. Such a culture involves setting minimal contracts and some processes around their evolution more than instantiating any particular set. The core expectation is that the platform will outlive any applications and it is at the platform level that practices around technology adoption, quality practices, and user interaction should be set. Note: this is not meant to imply that the practices should be uniform across the applications, but the patterns of use categories vs development strategy should be set at the platform level.