Monday, April 20, 2009

Java Concurrency

A predictable side effect of having (way too) many years of experience in Java is that certain "new" features escape your notice. This is particularly true if the IDE's don't pressure you into changing your previously successful, and still functional patterns (the way they do with generics).

I realized this when reading Java Concurrency in Practice. It's a very good book -- I can't say it really opened my eyes on concurrency since I had done some work on multi-master VME based real-time systems years ago, but it is spot on, well written, and a nice refresh. In addition, it made me aware of the thread/concurrency capabilities available in newer versions of Java such as ThreadPoolExecutor

I recently built a file crawler/hash-calculator/storage system as part of my namedData work using an ArrayBlockingQueue and explicitly created threads. ThreadPoolExecutor appeared to allow an easier approach with cleaner shutdown/interrupt semantics.

Java tips has a clear example -- the primary change that I would make to this example is to size the thread pool based upon the number of processors available (on my laptop this returns the number of cores).

It took me less than an hour to make this change, test the code, etc. The final product is a lot cleaner, has better shutdown behavior, and even feels like it runs faster. Definitely the right way to go.

Monday, April 6, 2009

owl:sameAs is a very strong assertion

There's been an interesting discussion on the public-semweb-lifesci mailing list with the subject "blog: semantic dissonance in uniprot" which, appropriately enough, was spurred by a blogpost entitled semantic dissonance in uniprot. This post talks about a uniprot entry which listed a Drosophila (fruit fly) protein sequence as having been isolated from "a young sporophyte contained within a seed."

The point being that although one doesn't find fruit fly genes in plants, following the owl:sameAs link leads directly to that conclusion. This generated a very long, fairly thoughtful and minimally flame based conversation on owl:sameAs and identity in general.

As the discussion progressed, the problem with associating identity across graphs (ontologies/systems of data developed by different organizations) was noted, e.g., (in pseudo annotation) mySystem:itemA owl:sameAs yourSystem:itemX, the issue being that the use of the terms is usually subtly (and often not so subtly) different between the two systems. This problem is especially apparent when making assertions about real objects which exist independently out in the world. For example: "gold" may have a property, but does the property adhere to a single molecule, or a group of gold molecules and if so what characterizes a group of the appropriate size? For example given:
  • A nanotechnology view of gold (still under development)

  • A semiconductor view of gold (probably reasonably well characterized)
  • A jewelry view of gold

what are the precise boundaries of their applicability? The issue doesn't arise in a system developed for nanotechnology, semiconductors, or jewelry. The problems surface only when these systems are linked together.

My thought is that the difficulty centers around the extreme power of owl:sameAs which indicates that things are identical in all contexts. However in the physical world not only is context everything, but context is also inherently incompletely specified.

In practice many of us heuristically treat identity in the physical world as operating as if identity means indistinguishable in this context, with the context being implicitly dependent upon the issue being considered. I would claim that this is the only reasonable way to proceed when reasoning in a practical manner about what is true about particular objects in the world (abstractions can obviously satisfy stronger conditions since they are abstractions -- with the context factored out to any level desired).

In the physical world, we cannot assure that even the ability to track a particular item with unlimited precision would allow us to make statements about that item which would hold through time. For example, although we might make assertions about a particular atom (#0x177FFEAA) of gold and its behavior, some if not all of the assertions may fail under unexpected conditions, e.g., after an event that alters the structure of the nucleus (nuclear collisions, extremely high temperatures etc.). Exhaustively specifying all of these conditions is impractical at best -- which is one of the reasons the phrase ceteris paribus has remained with us for so long.

In my own work, since I never worry about tracking individual atoms. I gravitate toward weak rather than strong assertions of identity, trying to be very attentive to context. This is very much in the spirit of the middle distance as developed in Brian Cantwell Smith's On The Origin of Objects. Smith's point is that our intuitions are well tuned to objects about our size that we interact with frequently. In data integration and architecture work (I had to get there eventually) it implies that integrating across fields that interact to some degree in the "world" is going to be more feasible than integrating across those that don't interact. The give and take of the practical interaction has allowed us to identify the particular features of each item that are important in context.