Wednesday, July 16, 2008

Structuring Database Tables

Continuing with the what should the storage actually look like line of my last post, I just read a paper entitled Storage and Querying of E-Commerce Data by Agrawal, Somani, and Xu. This paper discusses the trade offs between storing data in "wide" (up to 1K column) tables versus storing the data in sets of narrow (vertical) tables. The data under consideration consists of sparsely populated data sets (lots of null values), varying definitions of "sparse" are used to generate the results.

Their measurements clearly support the overall performance advantage of the narrow table approach despite the complexity of reassembling the data into a "wide" form, when (and if) it is required. Their work has been picked up a bit by the column database and semantic web crowds, but not to the extent that one would expect. I think that this reflects the fact that the datasets were fairly small (1000 cols, 20k rows).

A couple of notes about the results -- they achieved improved performance in the vertical representation despite the fact that they represented everything as an object, key, value triple, and built a translation layer to shield the user from the vertical representation. The triple store was built on DB2

Although these findings are both intriguing and encouraging (from the standpoint of wanting to break information up into its primal entities), I wonder about the scaling behavior of a system structured in this way as it radically increases the number of rows per table. After all, all systems have limits (e.g. MySql, Postgres, Oracle, SQLServer), and more importantly they have optimal operating regions, aka "sweet spots". A few years ago when the largest table in one of my systems reached 100 millions (wide) rows, running even simple queries against that table was painful (which I'm sure could have been alleviated with some clever tuning -- but it was neither the tallest pole in the tent, nor the squeakiest wheel on the cart).

My concern has to do with the risks of getting outside of the "sweet spot" of the systems upon which FDD is being constructed -- I remember back when one of my former employers switched database vendors (a non-trivial project to say the least). With vendor A, we were the customer with the largest DB in their installed base (aka outside the sweet spot). With vendor B, we were a "moderately large" installation, but certainly not in the top 100 (aka inside the sweet spot). The number of bugs which we encountered in Vendor B's database were substantially fewer. I assume that this was because the bugs had already been stumbled upon by the bleeding edge users and fixed by the vendor by the time we would have encountered them.

That to me is the prime reason to try to stay sweet spot. If you don't you're the one finding the new bugs and either fixing them, paying for them to be fixed, or hoping for the vendor to fix them (and developing a deep understanding of where you are on the vendor's list of priority customers).

More on this in my next post.

No comments: