Wednesday, July 30, 2008

richfaces:dataTable:column sortBy

sortBy is a new richfaces capability that I mentioned previously. I've started to incorporate it into my system and what follows are some tips/notes/frustrations.

If the sorting glyphs appear, but nothing happens, the dataTable needs to be surrounded by < h: form> < / h:form> -- sorting doesn't work outside of a form context.

If the glyphs do not appear sortBy may not be being given a valid attribute for sorting. This may be the result of a simple typo e.g., sortBy="{}" rather than sortBy="#{}"

I had some very strange behavior in netbeans/seamonkey with this facility. For example, if I clicked on the sort glyph for id

I got

YES the header of the column changed to "comp_ID form ss" -- even though "comp_ID form ss" no longer appeared in the file.

It used to be there, but I had removed it and performed a "build" in NetBeans (rather than a "clean and build"). I'm not sure about the underlying cause of this, but in my mind the effect straddles the boundary between disconcerting and amusing. The bottom line is that if things start acting strangely do a "clean and build".

All in all sortBy is a real step forward: it works in ajax tabPanels and simplifies the xhtml code a great deal. Despite this warning, sorting has been working for me as expected.

Wednesday, July 16, 2008

Structuring Database Tables

Continuing with the what should the storage actually look like line of my last post, I just read a paper entitled Storage and Querying of E-Commerce Data by Agrawal, Somani, and Xu. This paper discusses the trade offs between storing data in "wide" (up to 1K column) tables versus storing the data in sets of narrow (vertical) tables. The data under consideration consists of sparsely populated data sets (lots of null values), varying definitions of "sparse" are used to generate the results.

Their measurements clearly support the overall performance advantage of the narrow table approach despite the complexity of reassembling the data into a "wide" form, when (and if) it is required. Their work has been picked up a bit by the column database and semantic web crowds, but not to the extent that one would expect. I think that this reflects the fact that the datasets were fairly small (1000 cols, 20k rows).

A couple of notes about the results -- they achieved improved performance in the vertical representation despite the fact that they represented everything as an object, key, value triple, and built a translation layer to shield the user from the vertical representation. The triple store was built on DB2

Although these findings are both intriguing and encouraging (from the standpoint of wanting to break information up into its primal entities), I wonder about the scaling behavior of a system structured in this way as it radically increases the number of rows per table. After all, all systems have limits (e.g. MySql, Postgres, Oracle, SQLServer), and more importantly they have optimal operating regions, aka "sweet spots". A few years ago when the largest table in one of my systems reached 100 millions (wide) rows, running even simple queries against that table was painful (which I'm sure could have been alleviated with some clever tuning -- but it was neither the tallest pole in the tent, nor the squeakiest wheel on the cart).

My concern has to do with the risks of getting outside of the "sweet spot" of the systems upon which FDD is being constructed -- I remember back when one of my former employers switched database vendors (a non-trivial project to say the least). With vendor A, we were the customer with the largest DB in their installed base (aka outside the sweet spot). With vendor B, we were a "moderately large" installation, but certainly not in the top 100 (aka inside the sweet spot). The number of bugs which we encountered in Vendor B's database were substantially fewer. I assume that this was because the bugs had already been stumbled upon by the bleeding edge users and fixed by the vendor by the time we would have encountered them.

That to me is the prime reason to try to stay sweet spot. If you don't you're the one finding the new bugs and either fixing them, paying for them to be fixed, or hoping for the vendor to fix them (and developing a deep understanding of where you are on the vendor's list of priority customers).

More on this in my next post.

Sunday, July 6, 2008

Temporal Data

As part of fleshing out the design for the Flexible Drug Discovery (FDD) platform, I'm deciding upon the level of support for temporal data. The simplest decision, based upon the existing "webtwo" infrastructure, would be to have an insert_time and update_time attribute for each item. I think that these times are the bare minimum required to understand/debug system operation.

However, in the clinical domain I've become accustomed to thinking about the data in terms of what did we know when? so that it is possible to reconstruct the understanding of a trial at a given point in time. This obviously requires much more extensive tracking.

I recently came across an excellent book on the topic: Temporal Data and the Relational Model by Date, Darwin and Lorentzos. It presents a detailed analysis of the issues involved in working with temporal information using a refreshingly simple example consisting of a few tables of data about parts and their suppliers.

Many systems use begin and end dates for each row to track when the data has changed, supporting the type of use most relevant to clinical/scientific analysis. However, this technique does not support some interesting situations in the business domain. For example, p 166 of the book shows that given an item with the attributes: name, status, city answering simple questions such as "how long has a supplier been at that address", or "how long has a supplier had that name" requires a begin/end date for each attribute. Thinking through the implications of this issue results in refactoring the model into irreducible components (aka sixth normal form), as described on p 173.

As implied by the term sixth normal form, using the temporal behavior of the data as a design axis can have extensive implications e.g.,
  • splitting quantities out in a LIMS system
  • splitting out names (especially last names!) in a system that tracks employees, etc..

This implies that it is important to consider the temporal behavior of the data even if a temporal model is not planned for the system as it helps drive scenarios for evaluating the system's response to expected changes e.g., "high flux" items may require optimized interfaces, surface special reporting requirements etc..

Other noteworthy topics in the book include: merging intervals e.g., the two facts that attribute A had value 3 from t1-t3 and has value 3 from t3-now should be merged into a single fact.

There is also a discussion of the time-from/time-to in the persistent store, vs the time-from/time-to in the world, which although important in developing systems requirements, doesn't appear to require analysis different in character from what is conventionally performed. My view is that world and storage times are disjoint. In scientific systems there is rarely a reason to worry about world times -- other than referencing the date upon which an operation was performed.

Again, an interesting read, highly recommended (despite their frequent exhortations on how to read the book e.g., "are definitely meant to be read in sequence as written (p51)" or "note carefully" (carefully is used an inordinately large number of times in the text)).

As storage becomes cheaper, the downside of not having a temporal capability will more frequently exceed its implementation cost.