Tuesday, June 30, 2009

iPhone: changing the way we think

I'm struck by how the iPhone has changed the way we think about what can be done with software based devices assisting us as beings-in-the-world. I'm doing a Heidegger reference here because the iPhone is more than just ubiquitous computing: a device always at my side that could answer those important questions like:
  • Is there good coffee close by?

  • What's the weather going to be like later?

  • How old was Kennedy when he was elected?


Although it certainly is that, it has become a lot more, changing both the economics of software delivery and what it means for software to be delivered.

It's not just that there are a billion apps (or so it seems) in the app store, but the economics of iPhone software is such that a small gaming company can do a novel game e.g., tying rope around wooden blocks, get traction with it and make money. That didn't sound that amazing until I read an interview with the developers in Gamasutra that reminded me how hard it was to make money in computer games pre-iPhone. Not that it is easy now, but compared to the stories I heard when I attended a few Game Developer conferences earlier in the decade, it is trivial. Let's just say that the economics of doing a platform (XBox, Playstation, wii) or PC based game were daunting, to say the least, and the likelihood of getting paid for your game was minimal, even if the game was successful.

The core of the iPhone's difference is as a platform that is easy to use, location aware and ready-to-hand -- more like a hammer than a computer.

As a platform it is sufficiently distinct that it is also effecting the way we think about delivering healthcare. Looking at this list highlights core features that are "new," not "new" in the sense of being completely unheard of, but new in the sense of being practically available for use by the overwhelming bulk of the user community -- sort of like the difference between having a generator kit/knowing about electricity and having an electric grid that you can plug your device into.

As a user-assistant, the iPhone allows me fully exploit the affordances of my current location. I can see where I am on a map, look at overhead imagery of my current neighborhood to see if there is something that I want to photograph, and, if there is, use a small application to grab the geo coordinates so I can later tag the photos I took with my (non-GPS enabled) camera.

The end result is something that is always with you, knows who you are, knows where you are, has connectivity both up (3G/internet) and down (bluetooth to local devices) while providing a simple effective mechanism to easily add functionality in small increments.

I think this makes it the biggest game changer since the rollout of the internet to the general public. However, I also realize that this means that it is time to code up a small test application for the iPhone.

PS: I don't have any experience with the Google android platform or the Palm Pre; these observations may apply equally as well to them.


Monday, June 1, 2009

Linked Data

Finally, thanks to a discussion with Eric Neumann a few weeks ago, I'm beginning to understand what Linked Data is all about. First a caveat -- although I credit Eric for helping me see how linked data fits into what I'm doing, the following interpretation is strictly my own as are errors of omission, commission or orthogonality, although I think my view is supported by the Design Issues document.

The short story is that linked data provides stable identifiers for stuff (a more abstract form of things). These stable identifiers then allow you to say things about this (particular) stuff without necessarily making a strong ontological commitment.

I like this. It provides for interoperability and integration. It does not provide any inference guarantees which is fine by be, and something that I have been advocating for a while. The Linked data site also has links to a number of datasets which publish stable identifiers for useful stuff. The site also gives examples of how to publish your own data.

Hopefully data.gov will provide its data in this form in the near future.



Sunday, May 17, 2009

Wolfram Alpha

Wolfram Alpha is supposed to be launching in the next few days and has been getting a lot of publicity. For background, here's a link to a short YouTube demo of Wolfram Alpha, and a NY Times article and Doug Lenat has a nice post on his impressions.

From what I can see (and I don't have access) even though it doesn't live up to some of the early hype, it achieves a very interesting result: it allows retrieval of general computable information using a simple natural language processing (NLP) interface.

This allows for analysis similar to that permitted by a data warehouse, but within different design space. The design goals of WolframAlpha, unlike those of a data warehouse, preclude prestructuring the data in marts to allow rapid querying of the data in relatively well defined ways. However similar to the mart/warehouse situation you must still provide a speedy response to the quantitative queries to prevent users from drifting away while waiting for an answer.

The question is how is this done? Rumors on the net indicate that the underlying data is an RDF triple store, which makes a lot of sense since RDF Triples constitute a vertical, model free storage approach. In operation, I imagine that the queries provide nice entry points for initiating a spreading-activation fan-out process on the graph. When the activations intersect you can proceed to roll back up to the initiation points suggested by the query, clustering in a bottom-up data-driven fashion along the way. The clustering also affords a natural way to structure the data for presentation to the user.

Although I'll admit that this is just an educated guess as to the mechanism, it does suggest an interesting set of technologies involving fast linking and roll up of data for ad-hoc queries without requiring a lot of effort to tune the data to a specific query.

Generating a set of vetted and annotated data is a different problem, but hopefully would not require a significantly greater level of effort than the ETL portion of current warehousing efforts.

Wolfram Alpha therefore constitutes another factor leading me to be more vertical in my storage designs. In the coming months, I'm hoping to run some benchmarks on production hardware/datasets so as to ground the practicality of this approach and then get permission to publish the results.


Update 18 May 2009: I did try Wolfram Alpha today and it failed on my first try "age distribution of England vs UK," not so much from any idiosyncrasies in parsing my query, but because it appears to be encoded with the identity "England == UK." This just goes to show how important it is to be spot-on with your identity information aka "synonym tables are easy, antonym tables on the other hand......aren't."

Wednesday, May 6, 2009

launchd

I just upgraded to a new laptop (driven mostly by the need for more RAM -- hopefully 6G will be adequate for a couple of years). It got me thinking: even though it's great that the Mac will copy all of your old apps over effortlessly to your new machine, it also happily copies all your old unused cruft over to your new machine, and that's not so great.

So, in the spirit of good hygiene (and H1N1 preparedness), I decided to open up the console and look to see what I might find. I discovered that I had a couple of launchd jobs that referenced executables which didn't exist on my system any more e.g., carbon copy cloner.

I have been able to rid myself of all the launchd issues, by cleaning up the Launchdemons/launchagents under the Library folder but
I still haven't been able to rid myself of all of these
/Applications/Safari.app/Contents/MacOS/Safari[54428]: Warning: accessing obsolete X509Anchors.


This is even after searching the web a couple of times. I think the problem starts up after I open an article from NewsFire, but I'm not completely sure. This is definitely a space in which I believe correlation is not causality.

If anyone has any ideas on how to fix this, I'd appreciate it.

BTW it is really nice to have a built in tool like Console: it is simple and effective with just that little bit of extra functionality (string filtering) that makes all the difference in usability.

Monday, April 20, 2009

Java Concurrency

A predictable side effect of having (way too) many years of experience in Java is that certain "new" features escape your notice. This is particularly true if the IDE's don't pressure you into changing your previously successful, and still functional patterns (the way they do with generics).

I realized this when reading Java Concurrency in Practice. It's a very good book -- I can't say it really opened my eyes on concurrency since I had done some work on multi-master VME based real-time systems years ago, but it is spot on, well written, and a nice refresh. In addition, it made me aware of the thread/concurrency capabilities available in newer versions of Java such as ThreadPoolExecutor

I recently built a file crawler/hash-calculator/storage system as part of my namedData work using an ArrayBlockingQueue and explicitly created threads. ThreadPoolExecutor appeared to allow an easier approach with cleaner shutdown/interrupt semantics.

Java tips has a clear example -- the primary change that I would make to this example is to size the thread pool based upon the number of processors available (on my laptop this returns the number of cores).

It took me less than an hour to make this change, test the code, etc. The final product is a lot cleaner, has better shutdown behavior, and even feels like it runs faster. Definitely the right way to go.

Monday, April 6, 2009

owl:sameAs is a very strong assertion

There's been an interesting discussion on the public-semweb-lifesci mailing list with the subject "blog: semantic dissonance in uniprot" which, appropriately enough, was spurred by a blogpost entitled semantic dissonance in uniprot. This post talks about a uniprot entry which listed a Drosophila (fruit fly) protein sequence as having been isolated from "a young sporophyte contained within a seed."

The point being that although one doesn't find fruit fly genes in plants, following the owl:sameAs link leads directly to that conclusion. This generated a very long, fairly thoughtful and minimally flame based conversation on owl:sameAs and identity in general.

As the discussion progressed, the problem with associating identity across graphs (ontologies/systems of data developed by different organizations) was noted, e.g., (in pseudo annotation) mySystem:itemA owl:sameAs yourSystem:itemX, the issue being that the use of the terms is usually subtly (and often not so subtly) different between the two systems. This problem is especially apparent when making assertions about real objects which exist independently out in the world. For example: "gold" may have a property, but does the property adhere to a single molecule, or a group of gold molecules and if so what characterizes a group of the appropriate size? For example given:
  • A nanotechnology view of gold (still under development)

  • A semiconductor view of gold (probably reasonably well characterized)
  • A jewelry view of gold

what are the precise boundaries of their applicability? The issue doesn't arise in a system developed for nanotechnology, semiconductors, or jewelry. The problems surface only when these systems are linked together.

My thought is that the difficulty centers around the extreme power of owl:sameAs which indicates that things are identical in all contexts. However in the physical world not only is context everything, but context is also inherently incompletely specified.

In practice many of us heuristically treat identity in the physical world as operating as if identity means indistinguishable in this context, with the context being implicitly dependent upon the issue being considered. I would claim that this is the only reasonable way to proceed when reasoning in a practical manner about what is true about particular objects in the world (abstractions can obviously satisfy stronger conditions since they are abstractions -- with the context factored out to any level desired).

In the physical world, we cannot assure that even the ability to track a particular item with unlimited precision would allow us to make statements about that item which would hold through time. For example, although we might make assertions about a particular atom (#0x177FFEAA) of gold and its behavior, some if not all of the assertions may fail under unexpected conditions, e.g., after an event that alters the structure of the nucleus (nuclear collisions, extremely high temperatures etc.). Exhaustively specifying all of these conditions is impractical at best -- which is one of the reasons the phrase ceteris paribus has remained with us for so long.

In my own work, since I never worry about tracking individual atoms. I gravitate toward weak rather than strong assertions of identity, trying to be very attentive to context. This is very much in the spirit of the middle distance as developed in Brian Cantwell Smith's On The Origin of Objects. Smith's point is that our intuitions are well tuned to objects about our size that we interact with frequently. In data integration and architecture work (I had to get there eventually) it implies that integrating across fields that interact to some degree in the "world" is going to be more feasible than integrating across those that don't interact. The give and take of the practical interaction has allowed us to identify the particular features of each item that are important in context.

Monday, March 23, 2009

OSX Performance Analysis: Instruments

I started working with OSX's Instruments performance analysis tool, partly out of curiosity and partly because I had just fixed a performance problem in an application using an ad hoc a priori analysis. It happened to solve the problem, but I have enough experience with performance issues to know that the a priori guess is often wrong.

Instruments is heavily related to dtrace and shares a lot of its core attributes. The key attributes are that it is low overhead and works with (almost) anything running on your systems (OSX apparently has the capability for some applications to turn off monitoring for security/DRM reasons).

There's a lot to like here: you can easily get it up and going on your system and it the analysis section is very user friendly:

blog__instruments_screen.jpg


Especially nice features include
  • Low overhead: the peak CPU usage I saw for the tool was ~ 16%
  • The ability to display exactly what is going on under the read head (the upside down triangle above the graph)
  • Being able to display parameters that you didn't think of turning on during the run. All parameters are captured. The selection only impacts the display -- a godsend for anyone who has had to rerun a test because they forgot to capture a parameter

That said, I couldn't get any particular instrument to focus only on the process specified. As you can see, all of the instruments capture all of the activity, even though they were set to focus on different processes. Additionally, the "default action" kept resetting whenever I dragged a new instrument onto the display.

It is still a very worthwhile tool, but if anyone has any tips as to how to get around these issues, I'd appreciate it.