Sunday, May 17, 2009

Wolfram Alpha

Wolfram Alpha is supposed to be launching in the next few days and has been getting a lot of publicity. For background, here's a link to a short YouTube demo of Wolfram Alpha, and a NY Times article and Doug Lenat has a nice post on his impressions.

From what I can see (and I don't have access) even though it doesn't live up to some of the early hype, it achieves a very interesting result: it allows retrieval of general computable information using a simple natural language processing (NLP) interface.

This allows for analysis similar to that permitted by a data warehouse, but within different design space. The design goals of WolframAlpha, unlike those of a data warehouse, preclude prestructuring the data in marts to allow rapid querying of the data in relatively well defined ways. However similar to the mart/warehouse situation you must still provide a speedy response to the quantitative queries to prevent users from drifting away while waiting for an answer.

The question is how is this done? Rumors on the net indicate that the underlying data is an RDF triple store, which makes a lot of sense since RDF Triples constitute a vertical, model free storage approach. In operation, I imagine that the queries provide nice entry points for initiating a spreading-activation fan-out process on the graph. When the activations intersect you can proceed to roll back up to the initiation points suggested by the query, clustering in a bottom-up data-driven fashion along the way. The clustering also affords a natural way to structure the data for presentation to the user.

Although I'll admit that this is just an educated guess as to the mechanism, it does suggest an interesting set of technologies involving fast linking and roll up of data for ad-hoc queries without requiring a lot of effort to tune the data to a specific query.

Generating a set of vetted and annotated data is a different problem, but hopefully would not require a significantly greater level of effort than the ETL portion of current warehousing efforts.

Wolfram Alpha therefore constitutes another factor leading me to be more vertical in my storage designs. In the coming months, I'm hoping to run some benchmarks on production hardware/datasets so as to ground the practicality of this approach and then get permission to publish the results.

Update 18 May 2009: I did try Wolfram Alpha today and it failed on my first try "age distribution of England vs UK," not so much from any idiosyncrasies in parsing my query, but because it appears to be encoded with the identity "England == UK." This just goes to show how important it is to be spot-on with your identity information aka "synonym tables are easy, antonym tables on the other hand......aren't."

No comments: