Monday, January 15, 2007

Data Integration and Ontologies

Ontologies

It is useful to think about three types of data integration

  • Type 1. Document level -- the user can determine what documents might have information of interest
  • Type 2. Term level -- the user can build reports using items from multiple documents/systems e.g., each cell in a spreadsheet can come from different systems.
  • Type 3. Inference level -- terms from one or more documents/systems can be combined to derive information (new terms) in the system being examined.

Both the functionality and ontological commitment increases from type 1 to type 3 systems.

The increasing level of ontological commitment as perceived by a user of the system appears as follows

  • Type 1: There is something here which may be meaningful,
  • Type 2: If something does exist it is meaningful
  • Type 3: The implications of a thing’s existence is meaningful

Rather than attempting to determine the costs of these systems a priori let’s look at some examples

The simplest Type 1system involves simply placing documents in a file system; with some attention to naming and structure this system allows the contents to be easily retrieved and accurately assessed. However anecdotal and personal experience shows that the retrievability of the information degrades over time and it does not scale beyond small collections of items. This degradation stems in part from the fact that these systems allow only a single axis of retrieval based upon the heuristics embedded in the (path)name of the files.

The next level up in complexity for Type 1 systems is the web and file/url tagging systems. Such systems continue to make few a priori claims for the utility of the retrieved items but the use of search engines and URL tagging allow for multiple axes of queries to be retrieved based upon either the algorithms embedded in the search engines or the tags and the sources of those tags. Local file system supporting tags allow the users to (eventually) retrieve their tag definitions either via introspection or an examination of other documents containing suspected tags.

Some of the terminology limitations of having free text tags are alleviated by the fact that the items being tagged are urls/files and therefore unique and retrievable. Retrieving and examining the tagged information allows one to assess the information content (retrieval power) of each tag in the context of the current search. Tagging has been getting a lot of traction on the web with sites such as del.icio.us http://del.icio.us/ , shadows http://shadows.com/ and flickr http://www.connotea.org/ appearing as popular web tools for gathering and sharing tags (social bookmarking). Similarly modern file systems allow tagging of files, directories and applications for images and other media content allow sorting and management of media files via tags


Type 2 Systems: Term level integration is the provence of what is commonly called enterprise integration, which allow reporting and integration of applications within the enterprise (Enterprise Application Integration -- EAI). In practice achieving integration requires stability of the term referents and their use in communication between systems. The stability is what might be termed “stability of use. Commonly the more general the use the more restricted the interface. This involves a conscious decision to “narrow” the functionality when moving from the internal data model to the published interface, often restricting the interface to data transfer objects with a limited number of attributes.

Examples include Service Oriented Architectures (SOA’s) with well defined semantics and ways of modifying them over time. Changes to public information require either explicit revision control and/or verification with all stakeholders that any changes will operate as expected. In general, the “wider” the interface the more frequently the verification is required, with the concomitant drag on system agility.

Type 3 Systems Semantic/Inference level integration: Allows inference of new data from existing/newly added information e.g., IF A AND B THEN C can be inferred. This can cascade into IF B AND C THEN D etc. This is a very strong ontological commitment that requires understanding the implications of the complete set of constraints and inferential mechanisms in the system. The payoff is substantial in that it becomes possible to infer a great deal from just a few additional pieces of information. This does however represent a significant “widening” of the interface, with potentially severe implications for system verification and evolution.

Practical Implications

In Type 1 systems imply no ontological fanout from a local commitment and so it is possible to spontaneously evolve the “definition in use” (“ in use” signifies that there is no requirement for an analytic definition) since the definitions are mostly manual or derived from manual definitions.

On the other hand, for change to occur in Type 2 and Type 3 systems, the implications of the change must be understood for downstream systems that rely upon this changed information as an integral part of an automated process.

In Type 2 systems there is a restricted ontological commitment which requires that changes be verified with systems that couple with the system being changed. The analysis is restricted since the change occurs through a restricted interface the analysis is similarly constrained. All else being equal the greater the functionality and use of the data, the greater the analysis that must be preformed.

Type 3 systems, with their ability to build inferences upon new data, have the largest analysis burden of any of these systems. This implies that they will be the least amenable to revision. There is a possiblity that the use of ontologies will assure that the implications of changes are a priori accounted for. Thus any changes consistent with the ontology will easily be integrated. The problem is that there is no historical precedent for developing stable systems of this type.

1 comment:

David said...

Interesting post. My firm is just starting to get into RDF (via RDFa in XHMTL). We are also launching a new business document standards body here in Canada (casrai.org) for the research administration domain. I'm going to follow your blog.