Monday, February 5, 2007

Microformats, RDF and Life Sciences

I attended the microformats session at mashupcamp3 the week of Jan 17th which had an interesting sidebar about how microformats are not RDF.

For those unfamiliar with microformats the core reference appears to be http://microformats.org/. Microformats consist of ~10 specifications plus approximately the same number of draft specifications

The sidebar centered upon the fact that a lot of real work can get done with micro formats but some people particularly in the biology/life sciences domain are very heavily committed to RDF. The question arose as to why. There wasn’t enough time remaining (and probably not enough interest) to explore the why.

Looking at the adopted and proposed formats I am struck my a few things. The first is that they each capture a nice nugget of functionality: calendars, contact information, news feeds etc. The second is that they are designed to capture the common cases while avoiding the complexity of handling the uncommon situations, which I think is a good thing.

The nice side effect of this aesthetic is that you can be up and running quickly doing real work, exchanging information etc. The bad side effect is that if you need to do something more complicated the hooks don’t exist to allow you to describe what’s going on.

In general, the more unstructured you are willing to be, the easier it is to capture all of the information. The difficulty arises when you try to curate it or use it in another context. As an example, think of the initial attempt that often appears in a database design: a single table consisting of N text fields. It can work and can hold pretty much anything. Detecting duplicates and understanding the structure come later if at all. However, depending upon the scale and use of the information this may just be fine.

In my opinion, at the enterprise level we sometimes overemphasize the scale and structural integrity issues. Scale can be a big deal if you’re trying to achieve perfect reconciliation of information. If “good enough” is OK, large-scale integration can be achieved in practice with very unstructured data. A good example of this is Hype Machine http://hype.non-standard.net/ (which was demo’d at mashupcamp3) which mines music blogs effectively -- something that I would have thought impossible, given all the issues around spelling, new band names etc. It works partly because the problem is to find something rather than to find everything with complete accuracy.

At a deeper level what is more characteristic of the areas addressed by microformats is that we can develop a good understanding of what’s going on from our intuitions of how the world works and these intuitions should be able to cover a good number of the situations that we will actually encounter.

In life sciences this is simply not the case. Our intuitions are often wrong, in the clinical area the number of potential confounding factors is immense (e.g. http://gforge.nci.nih.gov/docman/view.php/53/2278/MedDRA_Source_Information.html
lists the MEDRA term count as 65872 ) so there is an understandable push to design formats that are extensible and can capture information in well defined and reusable ways.

However, microformats and mashups in general do raise the question “Is this a case of the best driving out the good?” Despite my bias towards ‘scalability’ and ‘enterprise solutions’, it is hard to argue with standing up an application in a few hours that provides some utility to end users (and some real data on use etc.). Even if requires some real work to migrate to a more scalable application when it comes online.