Wednesday, March 26, 2008

Taxonomies, Ontologies and the Semantic Web

A couple of weeks ago I attended C-SHALS 2008 (Conference on Semantics in Healthcare and Life Sciences), one aspect of it that I found striking was the number of people who conflated taxonomies with ontologies -- my initial reaction was to want to post a remark about the confusion and highlight the distinctions (see this for a short set of descriptions of these and related terms).

I’ve instead come to view this conflation as reflecting the pragmatic bias of these systems: if the difference between taxonomies and ontologies isn’t apparent to you, the difference doesn’t matter for what you are trying to do (modulo the assumption that the speakers were competent, but that did appear to be the case). The implication is that such systems require no significant machine based inference across organizations. Significant inference, in this context, would involve something beyond the use of term matching to gather locally related terms/individuals (local vis a vis the terms being matched). Note: although I categorize this as being ‘non-signficant’ that’s only from the standpoint of inference -- these systems do cover most of the Business Intelligence/anlaysis use cases being implemented today.

As you might expect, given this characterization, these presentations involved the aggregation of data from multiple sites, using RDF or taxonomies such as Snomed to link data between sites. This is a good thing -- as I’ve mentioned a number of times having stable identifiers across systems is the key to integration. The system presented demonstrated that useful integration is possible even when the same term e.g., the same Snomed terms, have slightly different meanings in the different organizations. (see How Doctors Think for an anecdotal study of physicians classifying patients).

This is an interesting result: although fully vetted, 100% one-to-one mappings would obviously be preferable, in these systems the value of more data outweighs the penalty imposed by increased noise. Rough quick integration is proving more valuable than detailed integration requiring a thorough analysis of all systems used -- probably because the difference between ‘rough, quick’ and ‘thorough, slow’ is measured in months, if not years.

This is related to a discussion at the conference on the contrast between developing ‘problem specific’ ontologies vs. ‘general use’ ontologies. That is: does taking the time to ‘get it right’ add any value? This is roughly equivalent to the old AI scruffy vs. neat distinction.

Although I wouldn’t go so far as to claim that a general purpose ontology is impossible (at least in some limited domain), I am skeptical that it can be achieved. My concern centers around the fact that when you are constructing a general use ontology it is hard to know where to stop e.g., given a small molecule bioactive compound you should represent the formula and chirality, but what about the (possibly fractional) salt form? or the formulation? what about radioisotopes and their decay rates? subnuclear particles etc. I understand pragmatic stopping points for modeling these issues, but I don’t know how to determine principled ones.

It’s reassuring to see a number of researchers finding pragmatically useful parts of the semantic web, without the need for perfect definitions/ontologies. This, to me is the take-home message: there are a number of useful tools and techniques in the semantic web space, don’t be put off by the thought of merging ontologies and developing a grand unified theory of everything.

No comments: