Tuesday, December 23, 2008

Discussing Architecture

If you're like me, you are forever grappling with finding the right format for discussing architecture with end users.

There are the standard architectural diagrams, but they really don't capture:
  • How the end users relate to the system
  • How all of the components are strung together to support the overall business processes
  • What artifacts (data, documents, etc.) are produced

I think there may finally be a "better mousetrap" available in ArchiMate which has a suitably small number of patterns/best practices that allow you to capture what's important in a large system at the appropriate level of detail. The most important aspect of ArchiMate is that it is a mixed mode modeling language with a distinctive, simple shape for each type of artifact.

ArchiMate allows for seven different types of things: Services, Processes, Organization, Products, Information, Infrastructure, Applications, and Functions.

As of yet I am not completely clear on the breakdown of these categories and to be honest I'm not sure that it really matters. ArchiMate is a communication tool. If it helps you communicate, it has served its purpose. If your way of breaking down the architecture is slightly different from their recommendations (which I strongly feel could use more examples) it may have no more impact than if your class structure for a domain is slightly different from someone else's. Don't get me wrong: I'm a big fan of standards when they have been properly vetted and heavily used, but until they hit that point it is important to be flexible in using them. Flexibility allows a standard to grow and cover a sufficient portion of the domain; otherwise it will wither from lack of use.

The presentation format that speaks to me the most is the layered diagram as shown on p 11(figure 12) of the Enterprise Architecture Development and Modelling paper. See below:


Here's my simplified take for a clinical trial system used by physicians and patients.


What I like about this format is that all of the elements on a particular layer are at the same level of abstraction and are easily placed in relationship to the other few things that are at that same abstraction level. Simultaneously, you can see what supports (and is supported by) a particular component. Each perspective (which appears on the same diagram) only requires concentrating on a small number things at a time and can easily be held in your short term memory.

One of the key things about the layers is that they distinguish externally available from internally consumed data interfaces -- especially highlighting those that cross abstraction boundaries. These external interfaces are distinguished from those that support multiple applications at the same level in the stack. Such "external" interfaces (which are internal to a particular level of abstraction) are more easily altered since they are more tightly coupled organizationally. This nicely foregrounds the implication that care should be taken in designing the external, level crossing APIs and lifecycles since changes to them will be harder to coordinate given the diversity of the interested parties.

Monday, December 8, 2008

Name that data

There is an interesting trend that I feel has the potential to fundamentally shift the way we think about how data is used in networking and applications.

It involves attaching unique names to data, i.e., referring to the data by its SHA1 hash value. Unique identifiers allow a number of things. For example, they allow you to retrieve the data from the network without worrying where it resides on the network, as described in
A New Way to look at Networking in which Van Jacobson discusses breaking out the content of the page from the page itself.

At 42:47 in the talk there's the section on
Dissemination networking

in which
data is requested, by name using any and all means available (IP, VON tunnels, zeroconf addresses, multicast, proxies, etc),
Anything that hears the request and has a valid copy of the data can respond. The returned data is signed and optionally secured, so its integrity and association with the name name can be validated.

Rather than getting all of the data from a particular location e.g., as specified by a URL, we get the identifier for the content of the data and then let the network supply the data from the location "nearest" (using the appropriate distance metric) to its use.

A similar idea exists in ZFS' implementation of a copy-on-write transactional model:
ZFS uses a copy-on-write transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written.

This design allows small changes in large files to be reflected by changing only the blocks that have been altered rather than by rewriting the whole file to a new location. This simplifies backup procedures, reduces R/W bandwidth requirements, etc. Part of what's significant here is that we're dealing with abstractions of the data, e.g., checksums rather than the data itself. If we use cryptographic hash functions rather than checksums the ideas become isomorphic aka, get me this block from wherever it is.

This has a number of potentially interesting applications depending on the granularity of the "named data". As a simple example, answering the obvious: "am I working with the copy of the file that was emailed to me last Friday, or an older version?" Even with coarse grain naming it would be possible to create mashups of music already on a users computer -- just transmitting offsets into and segment durations of existing content gets past DRM issues entirely (morally if not practically-- since much DRM protected media is encoded).

Uniquely naming the content is not just about networking, nor is it just about data but is really about the cross product of the two: the data and its location/retrieval. Data location and retrieval is most of what is involved in computing: unless data is being actively processed by a computational unit (in this case, I'm talking about the integer and floating point units on the chip) the rest of "computation" is about the retrieval and storage of data e.g., do I put this in a frame buffer, in the cloud or in a shredder?

Imagine an ecosystem of data that represents everything that you own/care about -- you could partition this data into multiple overlapping categories based on any number of attributes such as:
  • temporary: make no copies

  • pieces of a larger whole: applications would minimize the number of named datasets that they change when they perform updates e.g., editing out that first minute from your video won't change every block of the video.

  • number of independently survivable copies required, with what longevity?
  • coupling the data to geographic position: How closely should the data follow me as I move around the planet? Can it stay where I put it, or should it go where I go and be available for high bandwidth processing.

This are just my first pass ideas. This concept of divorcing the data off from its location and higher level structural organization opens up the potential for a whole new set of applications which provide enhanced user functionality by pushing a lot of these data management, replication and caching issues deep into the infrastructure.

Monday, November 24, 2008

Semantic Interoperability: from Mashups to Inference

My last post looked at semantic interoperability from the standpoint of the CDISC BRIDG model. Thinking back on it, I found that writing it left me with more questions than I had before I started.

I have to admit to being unclear as to what is meant by “semantic interoperability,” since I have heard it used in a number of different ways depending upon the audience. (apparently I’m not the only one: the wikipedia entry on semantic interoperability has the caveat All or part of this article may be confusing or unclear.)

"Semantic interoperability" puts requirements on the data, on the models, and on the processes of using them. How we respond to those requirements implies different interpretations of what it means to be semantically interoperable.

I think that there are three basic ways of using data that are "semantically interoperable".
  • “hands-off” data integration between designated well curated systems--this is the way in which I think it is used most often.

  • “hands-off” data integration between any systems sharing common identifiers e.g., publish an interface and allow anyone to use it.

    • The dual use of integration with any published interface that provides the data that you're looking for - which I think is less common e.g., I'll use any map, or any book information service rather than Google/Amazon (or Yahoo/BN). I haven't seen this that often and I think it sounds a bit sketchy.

  • Using OWL reasoners etc. for inference across systems to generate new information.

The requirements around these things are pretty different, both in data quality and in the congruence of the requisite component models.

The first “hands-off” data integration between any systems sharing common identifiers doesn't really require any similarity of models other than around the key integration point(s). You need the name of the referent of the data, the name of the data item and the format of the returned data e.g., "The first president of the United States"; "date of birth" returned in ISO 8601 format. Of course the more points that you want to make referenceable between the systems, the more the models have to match e.g., your model of US presidents has to contain a way of dereferencing the person and that person's date of birth. The more independently developed and maintained your systems are, the more quickly you want to start using RDF to give you very stable identifiers for your referents.

If the systems are required to do some curation/analysis of the data, the exported models need to match more closely so that you can derive the correct metrics to perform the analysis and understand the relationships between individual data points. A good example of this comes from Nick Malik who points out
So, if you look in a database and you see a purchase order... has it been approved or not? The answer depends on the business unit that created it.

Your models can be in a number of different forms (UML, OWL, etc.) and be wildly divergent from the underlying reality, but if the delusion is shared you can achieve some synergy.

Inference of course requires (at least) a locally full up OWL ontology since that's the only modelling language that permits inference. Models also have to more closely resemble the shared "current best understanding" of reality (which is of course a moving target in a scientific domain) or the resulting inferences will be worthless, or at best amusing.

However, doing an ontology is a big deal (see The Joy of Ontology by Suzanna Lewis for a discussion). The increment of commitment that we're making here is decidedly non-trivial, especially if the domain that we are trying to model is of substantial size.

I think the clinical trial domain is a good example of substantial size. BRIDG took a long time to do, it is still undergoing revision and does not allow inference. I would argue that given the continued refinement of some of the base terms (sex and gender were recently updated), even if there was an ontology, hands-off inference is not something that lies in the near future, simply because the ground doesn't provide a sufficiently firm foundation.

Just for clarity -- this doesn't mean that turning loose an inferencing bot over a sufficiently sized test set would not yield interesting and perhaps even transformative results. It just means that the inferencing would be part of a web research project rather than a production operation.

I could be wrong on this (and gladly so), but I did live through AI winter an I can no longer utter the phrase "sufficiently smart compiler" without irony.

Tuesday, November 11, 2008

Semantic Interoperability: Adverse Events

When reviewing the Bridg Release 2.0 Static Elements report.RTF I did look in a bit of detail at the adverse event model.

Here's a summary:

The AdverseEvent
class is decribed as having the following connections and attributes


  • Association link from class PerformedProductInvestigation

  • Association link from class Subject

  • Association link to class AEOutcomeAssessmentRelationship
  • Association link to class AECausalityAssessmentRelationship
  • Association link to class AEActionTakenRelationship
  • Generalization link to class PerformedObservationResult

  • Association link from class PerformedProductInvestigation

  • Generalization of PerformedActivity adding an evaluationMethodCode attribute; PerformedActivity captures the duration of the activity

  • Association link from class Subject -- the clinical subject (An entity of interest, either biological or otherwise.)

  • Association link to class AEOutcomeAssessmentRelationship
    links the AE to an observation
    For example, recovered/resolved, recovering/resolving, not recovered/not resolved, recovered/resolved with
    sequelae, fatal or unknown

  • Association link to class AECausalityAssessmentRelationship
    links the ae to an observation.
    For example, when an adverse event occurs, a physician may evaluate interventions that may have caused the
    adverse event.

  • Association link to class AEActionTakenRelationship.
    Specifies the link between an adverse event and the steps performed to address it.
    For example, study dose reduced, protocol treatment change, etc.

  • Generalization link to class PerformedObservationResult
    links all observations/protocol deviations etc together with a report.

The AE itself has the attributes:
  • gradeCode

  • severityCode
  • seriousnessCode

  • occurrencePatternCode
  • unexpectedReasonCode

  • expectedIndicator

  • highlightedIndicator
  • hospitalizationRequiredIndicator

  • onsetDate
  • resolutionDate

The end result is a structure that has a formal relationship that is well thought out, should cover all situations and permits systems to interoperate.

In my mind, this is not the same as assuring semantic interoperabillity. For semantic interoperability to really occur, the grade codes must be comparable across sites, hospitalization criteria must be identical (or at least commensurable) etc.. Achieving this comparability requires continuing education and harmonization efforts, constant feedback of metrics to practitioners etc.. It therefore represents a much higher bar.

This is not in any way a criticism of BRIDG. You need something like BRIDG -- a well vetted industry standard -- to be even be able to begin such an attempt. However, true semantic interoperability involves not only the structure of the data, but the data itself.

Monday, October 27, 2008


Last month I attended the Boston Area CDISC Users group meeting (BACUN). All of the presentations were interesting and useful. However, I found that the one by Lisa Chatterjee on BRIDG stood out as particularly informative.

The BRIDG Domain Analysis Model is a representation of protocol-driven biomedical/clinical research.

One of the goals of the effort is Semantic Interoperability - I don't think that this means that "following the model" guarantees semantic interoperability, but rather that BRIDG constitutes a starting point from which a semantically commensurable system can be built. The bridge team appears to view the model as a foundation for other more problem specific representations (CDISC/HL7 etc.). The idea being that if you can map BRIDG <-> HL7 and BRIDG <-> CSDISC the HL7 <-> CDISC mapping is (relatively) straightforward.

There is no question that BRIDG represents an excellent starting place for using data in an interoperable fashion.
All in all it shows a very inclusive approach
-- and a surprising openness to modifying the model to ameliorate difficulties encountered in use.

The core modeling language is UML and spreadsheets are used to track much of the mapping (there already is a draft version of a spreadsheet that maps the BRIDG R2.0 model to RIM2.18).

From the presentation:


I have to admit that I haven't examined the model in complete detail. However, from what I've seen almost everything that you need for the target domain is there and the level of abstraction feels right: low enough to be relatively easy to implement, but high enough so that you don't get wedged into a corner from the get-go.

I did look in a bit of detail at the adverse event model,which is represented in the Bridg Release 2.0 Static Elements report.RTF.

What we see in Figure 5 : View 4 - Adverse Event is ~ 90% of the complete domain model with a number of classes added in support of recording adverse events and tracking their eventual analysis/resolution.

Since adverse events raise some issues about semantic interoperabiltity that I want to talk about in detail I will cover them in my next post.

BTW on my mac, the only application that could open the .rtf file with the figures was open office 3. MS word 2004 elided the figures.

Monday, October 13, 2008

Seam: the ftl advantage

I finished the first pass schema for the flexible drug discovery framework and pointed the most recent (2.02) GA release of Seam at the database.

My hope was that it would produce pages with the latest ajax-friendly table sorting headers -- sadly it didn't. In addition, this version retains the practice of generating pages that use labels and column headers wired to a particular language (English) rather than allowing the headers to be language dependent messages, even though the pages fully support localization (via language specific message_x.properties files in the resources directory).

I was in the process of fixing these issues manually via emacs using these instructions (I find that emacs makes it easier to select the files that I want to edit -- Netbeans picks up too many files and deselecting 70 or so "extra" files is too much work). While making these changes, I was discussing a new system with one of my clients and recommended that they consider an open source solution since they could change it (or hire somebody to change it) if at some point they encountered something that they didn't like. I thought this better than the alternative of waiting for a potentially unresponsive vendor to pay attention to their problem.

As was saying this, I thought to myself: "Seam is open source, maybe I should try to fix the code rather than edit the result." I'm usually hesitant to go down this path. Sometimes it works, but my a priori estimate is that the process of understanding the code structure, getting the build environment running etc. costs a day (or more) before anything productive comes out the other end.

However, as a proof of principle, I thought I should give it a try and see how it went.

I'm very happy to report that given a combination of good architecture and good tool selection on the part of the Seam team, making these modifications was almost trivial.

Let me explain why, and encourage you to "do this at home."
The core reason is that the seam generate-entities command operates in a number of phases, one of which generates the .xml and .xhtml files using freemarker template files (.ftl). This allows the required changes to be made without either looking at the java involved or developing any understanding of the calling structure, etc.

The .ftl files (in ./seam-gen/view/) are pretty much self-documenting (which is handy, given the level of documentation included in the files) and very easy to change -- errors thrown by the freemarker engine are clear and easy to work with.

A couple of (minor) caveats
  • The .java files in the src/action will need to be removed between runs of seam generate-entities.
  • I think it is a good idea to point the seam generator at a different target directory (specified by ./seam-gen/build.properties -- "workspace.home") while debugging so that you don't accidentally overwrite edits that you've already made (oddly I thought of this BEFORE I ran the generator)

In summary, if you find yourself making a lot of changes to seam generated files, change the .ftl files instead; it can be much, much easier.

Monday, September 29, 2008

Suddenly mysql won't start

Suddenly is a bit of an exaggeration. This happened on my desktop -- I use mysql on my laptop almost daily, but only once every few months my desktop. They are similar environments: latest Mac OSX 10.5 patched intel based macs. The mysql on the desktop was transferred from my previous PowerPc based desktop, but has been used a few times since the transfer.

In any case, the normal startup action
sudo /usr/local/mysql/bin/mysqld_safe

failed with the following output.

Starting mysqld daemon with databases from /usr/local/mysql/data

/usr/local/mysql/bin/mysqld_safe: line 395: /usr/local/var/: Is a directory

/usr/local/mysql/bin/mysqld_safe: line 401: /usr/local/var/: Is a directory

STOPPING server from pid file /usr/local/mysql/data/rdf-8-Tower.local.pid

tee: /usr/local/var/: Is a directory

080910 15:39:05 mysqld ended

tee: /usr/local/var/: Is a directory

At this point I said to myself "well this hasn't been upgraded in a while, I should upgrade mysql" !!bad idea!!

The upgrade didn't solve the problem. After tracing through the script in more detail I did a

sudo ./my_print_defaults

which reminded me that

Default options are read from the following files in the given order:

/etc/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf

/etc/my.cnf contained


log = /usr/local/var/mysqlLOG.log

#If no specific storage engine/table type is defined in an SQL-Create statement the default type will be used.


max_allowed_packet = 16M

#Enter a name for the error log file. Otherwise a default name will be used.


#Enter a name for the slow query log. Otherwise a default name will be used.


Which I changed to


log = /usr/local/var/mysqlLOG.log

#If no specific storage engine/table type is defined in an SQL-Create statement the default type will be used.


max_allowed_packet = 16M

#Enter a name for the error log file. Otherwise a default name will be used.


#Enter a name for the slow query log. Otherwise a default name will be used.


didn't exist, nor did

This allowed the db to start but I couldn't log in with any of the user accounts normally available (including root). It appears that using the mysql-5.0.67-osx10.5-x86.dmg file to update caused the db user information to get hosed.

Fortunately I'm pretty neurotic about backups and so I easily recovered just by copying /usr/local/* from my last backup.

NOTE: this backup was to a separate disk performed using SuperDuper -- TimeMachine isn't going to get files in /usr (which helps explain why the disk space used by TimeMachine is smaller than I expected).

Monday, September 15, 2008

Seam on Amazon EC2

I just completed putting up a demo of my seam work on Amazon Web Services EC2 service. I primarily did this to ground my advocacy of EC2 as a good option for small biotechs that may need occasional bursts of compute power but have neither the cash to buy adequate servers for peak compute load nor the staff to maintain them.

I thought that putting up my jboss/seam/mysql demo would also be sufficiently non-trivial to give me a good feel of what it is like.

There are similarities between EC2 and other virtualization options (EC2 is based on XEN after all).

The core differences in my mind revolve around having S3 as a backing store. Since S3 is on Amazon's servers you need to pay more attention to security keys etc.

My recommendation is the go through the Getting Started Guide -- even to the point of saving a modified image. This will assure that you have the proper accounts set up both on EC2 and S3 and you have a bucket set up on S3 for storing your image.

I found it easiest to create a bucket using the Python examples -- even though I have never done much more than a "hello world" program in python (yes one that consists of
#!/usr/bin/env python
print "Hello World"
). The python code is the most self contained and required the fewest downloads of ancillary libraries in my environment (Apple OSX 10.5.4)

Building and saving an image takes a while -- I would recommend doing a reboot of your virtual machine to make sure that all of the changes take hold (boot processes start as designed etc.).

Not to overstate the obvious, but the image that you start with has a tremendous impact on the time it takes you to get up and running. I settled upon an image that already had mysql5, jboss 4.2.2 installed, and it made things much easier. In general I didn't feel that the images were particularly well documented. Not being that familiar with Fedora I thought that the difference bettween the Fedora-core-4 and fedora-core-8 was how many CPU cores they were optimized for, not the revision number. My initial foray with the fedora-core-4 image stopped when I realized that it had mysql4 rather than mysql 5.

All in all the experience wasn't too bad (I don't think that I could ever call one of these experiences "good" -- if it were good I would just be able to click a "do it" button that would do exactly what I wanted), and would have been much better if I hadn't "lost my keys".

Friday, August 29, 2008

TOGAF and Evaluating Architectures

The old adage is that you always have an enterprise architecture even if you never designed one -- the point being to encourage an organization to spend the time to design one. This is all well and good, but given an ongoing enterprise, what's the best way to determine what enterprise architecture you have, where you want it to go and most importantly, how to get there.

For various reason I've started looking at this issue again and have just refamiliarized myself with TOGAF (The Open Group's Architecture Framework). I had forgotten how much I liked it: it is pragmatic, highly tailorable and focused on open cross-organizational solutions. I'm not going to do a detailed analysis of TOGAF vs. other frameworks -- it's not really my interest as I'm definitely in satisficing mode here. What follows is my (long) elevator pitch of the TOGAF take home message.

The top level graphic of the TOGAF process captures the flavor pretty well


The preliminary phase is key but very easy to overlook. TOGAF suggests that this phase consists of defining the overall objectives and scope

  • Define Objectives
    • Assure that everyone who will be involved in or benefit from this approach is committed to the success of the architectural process

    • Define the architecture principles that will inform the constraints on any architecture work

    • Define the ‘‘architecture footprint’’ for the organization — the people responsible for performing architecture work, where they are located, and their responsibilities

  • Define Scope and Assumptions

    • The business units that are involved

    • The level of detail to be defined

    • The specific architecture domains to be covered (Business, Data, Applications, Technology)

    • The time horizon that should be addressed by the architecture.

What I find attractive about this whole approach is its focus on getting getting buy in from the key players in the organization, defining their roles and developing a shared set of expectations around what's going to be done as part of the architecture effort. The preliminary steps give the team some initial criteria for driving the architectural vision, but then TOGAF immediately requires them to ground it as supporting the needs of the business users. I find this grounding critical; often the business thinks that architecture efforts are worthless and often they are right because the architectural model hasn't been grounded in the business process. Note: in this case "business process" and "user's scientific process" are equivalent.

The key to the success of an architecture effort is to address current pain points as they will be reflected in the business processes that will be in place when the architecture rolls out. Sorry if the tense of the last sentence was a bit torqued. What I mean is that the architecture needs to hit a mark to support business operations as they will be in the future, not as they are now, and that some of the problems that are being experienced now will only be exacerbated by these planned changes.

The dialog with the Business Unit leader sounds something like

We're planning to do more collaborations in the future, but with our current collaborations we have a terrible time registering new users and tracking responses to our questions about the data. However, if we put an architecture in place which uses our new authentication mechanism that supports OpenId it will radically simplify the process of adding new users.

In addition, if we use vendor X's implementation of the Life Sciences Industry Architecture, queries will be automatically tracked.

Our ability to handle more collaborations on the back end is increased as our new system allows us to share extra capacity across multiple business units, thereby sharing the cost of reserve capacity to meet any unanticipated surges in demand.

Such a "political/operational" model for rolling out an architectural analysis implies that everyone who contributes to the effort should get something out of it (this is a goal, but the closer you can come to meeting the goal, the more self-organizing the system becomes).

As one proceeds around the TOGAF loop, you pick and choose what makes sense given the decisions made previously (which of course you are always free to revisit) analyzed to the level of depth that is appropriate.

Think of TOGAF as providing a (partial) checklist of processes to use and things to consider that help you reach the end state of
Boundaryless information flow (tm)


I think of it as being similar in spirit to the way the Software Engineering Institute's Risk Management Taxonomy provides a comprehensive checklist of things to consider when undertaking a project -- it keeps you from forgetting something that would be obvious in retrospect.

An aside:

My favorite quote from one of their pages:

Another company is developing a flight control system. During system integration testing the flight control system becomes unstable because processing of the control function is not quick enough during a specific maneuver sequence.

The instability of the system is not a risk since the event is a certainty - it is a problem.

Thursday, August 14, 2008

Optimization: Premature and otherwise

My post on structuring database tables made me think again about when (and how) to optimize one's code/design. The caveats about premature optimization are well known and well considered, as are the reasons for not following them slavishly.

In my mind the core questions involve "what are we trying to optimize" and "who cares".

My (obvious?) claim is that we should only spend time optimizing things that have impact upon the high level goals for the project. The reasons for performing an optimization should be articulated and evaluated in this framework. Driving optimizations by focusing on the top level goals sounds obvious but the tradeoffs are difficult to make in practice.

There are legitimate tensions about the proper framework relevant to the analysis. That is, it is all well and good to speak of "strategic business goals," etc. but if the product is unusable, the long term strategy doesn't matter. A strategic focus simply assures that long term impacts are also evaluated when an optimization is considered, e.g., an optimization that greatly increases execution efficiency may or many not be appropriate if it increases the complexity of product installation and set up.

For example, the "narrow table" approach that I've been advocating is designed to support deep changes in the business processes and science over the life the system, without necessitating deep changes in the data model. The "strategic horizon" for such a project is > 10 years (that is total system life of > 10 years) with the expectation that the fundamental data model reflected in the narrow tables will be relatively stable during that time. Even in a rapid prototyping environment with one or more iterations shipping each quarter, the essence of the data model should be fairly stable since fundamental changes to the data model induce data migration efforts which distract from product improvements.

The question is: are there inefficiencies caused by the narrow table approach that will make it overly difficult to achieve a usable product in the short term. My current intuition is to go with the narrow table approach and let either materialized views (which in my knowledge are most easily obtained in Oracle), special database jobs, or a data grid provide the optimizations

That said, one of my personal rules of optimization (based on some experience) is that your intuitions are almost always wrong -- if something is taking too long it is usually worthwhile to benchmark it even if you "know the cause of the problem" (unless the putative fix takes substantially less time than doing the benchmark). The corollary being that if you're worried about something taking too long, and think that you "should be OK" set up a test suite at scale to validate your intuitions going forward, so that you can monitor performance as development proceeds.

This principle holds even if the bottleneck is development time: you need to determine if the problem is with the language, the developer, the user/developer interaction, or the developer's manager (e.g., providing more interrupt driven activity than the developer can handle)? As always, measurement is the key.

Wednesday, July 30, 2008

richfaces:dataTable:column sortBy

sortBy is a new richfaces capability that I mentioned previously. I've started to incorporate it into my system and what follows are some tips/notes/frustrations.

If the sorting glyphs appear, but nothing happens, the dataTable needs to be surrounded by < h: form> < / h:form> -- sorting doesn't work outside of a form context.

If the glyphs do not appear sortBy may not be being given a valid attribute for sorting. This may be the result of a simple typo e.g., sortBy="{competition.id}" rather than sortBy="#{competition.id}"

I had some very strange behavior in netbeans/seamonkey with this facility. For example, if I clicked on the sort glyph for id

I got

YES the header of the column changed to "comp_ID form ss" -- even though "comp_ID form ss" no longer appeared in the file.

It used to be there, but I had removed it and performed a "build" in NetBeans (rather than a "clean and build"). I'm not sure about the underlying cause of this, but in my mind the effect straddles the boundary between disconcerting and amusing. The bottom line is that if things start acting strangely do a "clean and build".

All in all sortBy is a real step forward: it works in ajax tabPanels and simplifies the xhtml code a great deal. Despite this warning, sorting has been working for me as expected.

Wednesday, July 16, 2008

Structuring Database Tables

Continuing with the what should the storage actually look like line of my last post, I just read a paper entitled Storage and Querying of E-Commerce Data by Agrawal, Somani, and Xu. This paper discusses the trade offs between storing data in "wide" (up to 1K column) tables versus storing the data in sets of narrow (vertical) tables. The data under consideration consists of sparsely populated data sets (lots of null values), varying definitions of "sparse" are used to generate the results.

Their measurements clearly support the overall performance advantage of the narrow table approach despite the complexity of reassembling the data into a "wide" form, when (and if) it is required. Their work has been picked up a bit by the column database and semantic web crowds, but not to the extent that one would expect. I think that this reflects the fact that the datasets were fairly small (1000 cols, 20k rows).

A couple of notes about the results -- they achieved improved performance in the vertical representation despite the fact that they represented everything as an object, key, value triple, and built a translation layer to shield the user from the vertical representation. The triple store was built on DB2

Although these findings are both intriguing and encouraging (from the standpoint of wanting to break information up into its primal entities), I wonder about the scaling behavior of a system structured in this way as it radically increases the number of rows per table. After all, all systems have limits (e.g. MySql, Postgres, Oracle, SQLServer), and more importantly they have optimal operating regions, aka "sweet spots". A few years ago when the largest table in one of my systems reached 100 millions (wide) rows, running even simple queries against that table was painful (which I'm sure could have been alleviated with some clever tuning -- but it was neither the tallest pole in the tent, nor the squeakiest wheel on the cart).

My concern has to do with the risks of getting outside of the "sweet spot" of the systems upon which FDD is being constructed -- I remember back when one of my former employers switched database vendors (a non-trivial project to say the least). With vendor A, we were the customer with the largest DB in their installed base (aka outside the sweet spot). With vendor B, we were a "moderately large" installation, but certainly not in the top 100 (aka inside the sweet spot). The number of bugs which we encountered in Vendor B's database were substantially fewer. I assume that this was because the bugs had already been stumbled upon by the bleeding edge users and fixed by the vendor by the time we would have encountered them.

That to me is the prime reason to try to stay sweet spot. If you don't you're the one finding the new bugs and either fixing them, paying for them to be fixed, or hoping for the vendor to fix them (and developing a deep understanding of where you are on the vendor's list of priority customers).

More on this in my next post.

Sunday, July 6, 2008

Temporal Data

As part of fleshing out the design for the Flexible Drug Discovery (FDD) platform, I'm deciding upon the level of support for temporal data. The simplest decision, based upon the existing "webtwo" infrastructure, would be to have an insert_time and update_time attribute for each item. I think that these times are the bare minimum required to understand/debug system operation.

However, in the clinical domain I've become accustomed to thinking about the data in terms of what did we know when? so that it is possible to reconstruct the understanding of a trial at a given point in time. This obviously requires much more extensive tracking.

I recently came across an excellent book on the topic: Temporal Data and the Relational Model by Date, Darwin and Lorentzos. It presents a detailed analysis of the issues involved in working with temporal information using a refreshingly simple example consisting of a few tables of data about parts and their suppliers.

Many systems use begin and end dates for each row to track when the data has changed, supporting the type of use most relevant to clinical/scientific analysis. However, this technique does not support some interesting situations in the business domain. For example, p 166 of the book shows that given an item with the attributes: name, status, city answering simple questions such as "how long has a supplier been at that address", or "how long has a supplier had that name" requires a begin/end date for each attribute. Thinking through the implications of this issue results in refactoring the model into irreducible components (aka sixth normal form), as described on p 173.

As implied by the term sixth normal form, using the temporal behavior of the data as a design axis can have extensive implications e.g.,
  • splitting quantities out in a LIMS system
  • splitting out names (especially last names!) in a system that tracks employees, etc..

This implies that it is important to consider the temporal behavior of the data even if a temporal model is not planned for the system as it helps drive scenarios for evaluating the system's response to expected changes e.g., "high flux" items may require optimized interfaces, surface special reporting requirements etc..

Other noteworthy topics in the book include: merging intervals e.g., the two facts that attribute A had value 3 from t1-t3 and has value 3 from t3-now should be merged into a single fact.

There is also a discussion of the time-from/time-to in the persistent store, vs the time-from/time-to in the world, which although important in developing systems requirements, doesn't appear to require analysis different in character from what is conventionally performed. My view is that world and storage times are disjoint. In scientific systems there is rarely a reason to worry about world times -- other than referencing the date upon which an operation was performed.

Again, an interesting read, highly recommended (despite their frequent exhortations on how to read the book e.g., "are definitely meant to be read in sequence as written (p51)" or "note carefully" (carefully is used an inordinately large number of times in the text)).

As storage becomes cheaper, the downside of not having a temporal capability will more frequently exceed its implementation cost.

Monday, June 16, 2008

de novo project retrospective

I’m doing the finishing touches on my “webtwo” application and thought I'd post a quick post mortem on the project.

The core set of functionality was straightforward:


  • Users with system defined logins (not relying on database logins) and system specific roles

  • Competitions consisting of user submissions. Each submission can have multiple submission items consisting of text or images.

  • Everything can be tagged and commented upon.

  • Tags can be reused and applied to any object in the system.

Here are my observations


My initial design had ratings as a separate class/table. I decided to push ratings down into both tag and user_text (the generic data type that holds comments) since, upon further consideration, I thought that the semantics of the term “rating” was different in each case. If the rating is attached to a comment, both the rating and the comment refer to some other item, and the comment explains the rating. On the other hand, if the rating refers to a tag, it reflects how strongly the item reflects the tag term.


Even though tagging and commenting (adding user text) are distinct functionalities, much of the persistence operation is the same. I therefore packed persistence into a single class which persists tags and text in distinct methods. At some point I may factor this into three classes: one for the common persistence core and two for the specific tagging commenting functionality; but this single class is adequate for the current purpose.


As mentioned previously, my initial design allowed fairly large scale image uploads with dynamic resizing for page display. I have since switched to a more conventional architecture in which standard image sizes are generated and stored upon upload. This is a big performance win, and also reduces storage requirements.


Seam is a platform that is evolving in an encouraging way -- when I wanted to add sorting to a table displayed within a ajax tab -- I could not see how to do it but upon further investigation it appears to be supported in an upcoming release.

e.g., in this view I can sort


while in this I cannot

Type X

For this system I had separate "type tables" for each type category e.g., user_text_type holds the allowable types of user_text such as comment, description etc. As I consider building larger systems with similar functionality, I plan to have an item_type table which holds the categories for every table that requires typing. This table would share a similar pattern to that used for user_text (comments, descriptions) and tags which holds the table_name and the id of the referenced object. This will simplify understanding the data model/code and facilitate building curation tools.


One additional note about the seam framework: it sometimes feels like an &rest interface, since it is easy to bookmark access to a particular object etc.. However, it does not do so without “seeing past” its own request: parameters which are not specified for "pass through" in the .page.xml files are stripped out. It is easy enough to add them -- all that is required is to specify a < name="">parameterForPassthough"/>
in the .page.xml for each parameter that should be passed through. This annotation also needs to be in every intervening page.

I'm not particularly bothered by this, I'm just pointing it out. On the good side it forces standard parameter naming so that each page's logic can operate upon standard variable names (assuming all of the necessary parameters have already been incorporated into the .page.xml files) and it prevents the urls from becoming unwieldy. The downside is that all of the intervening pages do have to be modified if a new parameter is required. In balance it seems to be a reasonable decision.

Note: You may ask "when would this apply?". One example: you want to comment upon an item and return to it when the comment is complete. In this case the item/item_id would need to be passed thought the comment-editing-page to the comment-verification/completion-page so that the "done/comment-complete" button can return to the right location.

The only real regret that I have about the project is that I didn't know about the yahoo design patterns earlier.

Friday, June 13, 2008

Yahoo Design Patterns

When visiting Yahoo's site for OmniGraffle stencils, I found that Yahoo also provides a nice set of design patterns.

These Web/Web2.0 patterns are clear, concise and have pointers to how they are used within Yahoo. Note: this doesn't constitute a strong endorsement on my part. My approach to patterns is to broadly survey existing patterns with the goal of settling on one, or worst case to modify/create one using the results of the survey.

BTW OmniGraffle is an amazing drawing program -- I've been using simple/mid-level drawing programs for a while and its interface is in a different league entirely. It does things that I haven't seen before, hadn't thought of previously, but are totally obvious to use and highly useful. If you're on a Mac you owe it to yourself to give it a spin.

Friday, June 6, 2008

debugging hibernate/mysql systems

Just a quick note, on debugging hibernate/mysql systems.

In my ‘webtwo’ system I had the following constraint in the db:

constraint fk_submission_item_image foreign key (image_id) references image_data(id) ON DELETE cascade,

However, image_data didn’t exist.

During operation it caused the error

18:34:56,009 WARN [JDBCExceptionReporter] SQL Error:
1452, SQLState: 23000
18:34:56,009 ERROR [JDBCExceptionReporter] Cannot add
or update a child row: a foreign key constraint
fails (`webtwo_dev/submission_item`, CONSTRAINT
`fk_submission_item_image` FOREIGN KEY (`image_id`)
18:34:56,009 ERROR [AbstractFlushingEventListener]
Could not synchronize database state with session
Could not execute JDBC batch update
at org.hibernate.exception.SQLStateConverter.convert
at org.hibernate.exception.JDBCExceptionHelper.convert

I thought this indicated a problem with my Hibernate/java mapping since I’ve been working with the database for a few months without any errors being thrown by MySQL during db creation. I also thought that I had successfully populated all the tables.

In retrospect this wasn’t the case: I had not populated the tables and it is also not surprising that MySQL didn’t signal an error since I “SET FOREIGN_KEY_CHECKS = 0; “ during table creation -- so, my bad.

The moral is that any unpopulated table may be fundamentally misspecified.

This experience does serve to reinforce my heuristic to populate all the tables via an initial data population script at least for testing purposes. Realistically, this requires two load scripts: one to populate the data required for operation (ie., roles, menu values, etc.), and a second to do a test load to verify database integrity.

Thursday, May 29, 2008

Modularity & Hygiene II

A similar, but distinct situation involves the environment expected by a module when it is activiated:

  • What needs to be set up for the called function
  • What is expected to be unperturbed from a “pristine” environment -- and the definition of that pristine environment.

This places restrictions upon use e.g., the inability to initiate workflows within a page flow for seam.

These restrictions result from an implicit dependency upon the configuration of the calling environment. Since the appropriate configuration is assured if the module was called as planned, there is little checking to verify that the environmental assumptions have been met. Once the nested calling paradigm has been adopted, design choices are biased towards implicit configurations that cannot be easily set without perturbing the calling environment, since B (and A for that matter) can still reference the external environment.

I’ve found a useful way to think of this as being the difference between a linear and a nested calling environment: a “linear environment” would pass parameters in as a single object which contains required values while the nested approach sets up one or more global variables for access by the called functionality.



In the “linear” illustration nothing in the external environment is perturbed, side effects are minimal and any arguments could be copied, modified and passed onto the next module in a sanitary fashion (with the usual caveats around shared “stream-like” objects).

Admittedly there are some times when the nested approach makes the most sense, usually for “stream like” variables, e.g. an initialization step to read in a configuration file, initialize connections etc.. The problem arises when there is no way to spawn a new configuration (or initialize one if you’re called in a different context). In lieu of such an idealized situation, it would be useful, at a minimum, to be able to detect that you’re being called in the wrong context. When it is difficult for a module to decide if is being called in the correct context (or if it elides the context check), it is hard, if not impossible to provide easy to use modules.

Note: I think that the rails trick of overloading the const-missing exception handler is arguably in this space. The magic underlying this functionality wasn’t easy (for me) to find and it surfaced the capability in such a way that it couldn’t be used for my purposes, since it had no introspection capability and only covered the case of exceptions generated during system operation. Note: discovering this also helped answer one aspect of the environment that I had previously found opaque: “why does my new code seem to be loaded in some cases but not in others.” The answer being that the new code (class definitions etc.) was retrieved if the class had not been previously loaded. If it had been previously loaded the const-missing exception would not be generated and the class would not be reloaded.

At some level I have no problem with this implicit ‘environment’ structuring as long as there are ways to determine what environment you’re in and have the ability to spin up a new environment appropriately configured as necessary.

It’s also important to distinguish environment set-up and configuration using well defined files/apis and this is a perfectly reasonable and necessary practice to set up such things as database access, service locations etc in configuration files that are managed by an “environmental processor”

They are usually
  • One shots done at startup, or in response to a specific reconfiguration request.
  • accessed through a well defined API
  • ‘stream-like’ in nature

Again these are issues that I’m experiencing as I spin up my application. I’m not claiming that their solution is either easy or practical, only that it is desirable.

Wednesday, May 14, 2008

Modularity & Hygiene

This post (and the one to follow) discuss the issues of hygiene in coding libraries.
It is prompted by experiences that I’ve had lately in using (primarily xml/jsf) libraries and the pernicious errors that can be introduced by either mixing different libraries or not using them exactly as designed.

A little background: the term hygienic comes from the Lisp community. The short form definition: hygienic code is code that doesn’t have undue interactions with its calling environment.
A lack of hygiene manifests itself in a couple of ways:
  1. Modules trip over each other: The “mix and match” promise of modularity, and widget sets is violated
    • modules work at one point in the development cycle but break after the addition of an apparently unrelated piece of code
    • mixing components from different widget sets requires much care and tweaking, if it can be done at all.
  2. Modules have an implicit order in which they must be called; there is no well defined way to kick off either a new “top level process” or a child process that has its own (unshared) context.

Note: this turned into a pretty long post, so I will cover the second item in a subsequent post.

I’m not claiming that hygienic code is always necessary or even that “hygienic code” == “good code.” Situations vary, and in some cases e.g., device drivers, OS kernels being hygienic may not be worth it. Also as I describe below, I don’t think that completely hygienic systems are currently possible partly due to the limitations of xml.

However, in most situations, the more hygienic the better.

And now on to the discussion.

1 Modules trip over each other
The first issue concerns the problems encountered when modules end up tripping over each other, aka incorporating functionality from one set of module breaks existing functionality. I have seen this mostly in the javascript space (see this post on integrating UI widget sets in seam).

Although I’m working primarily in jboss/seam for this project, it is not a seam issue per se. If anything, the conventions seam uses for its generated code help to alleviate these issues.

As far as I can tell, this “tripping” arises from an inability to cleanly nest scope in the environment. For example, in xml it is hard to insulate oneself from what goes on in the xml around you as exemplified by the inability to nest comments in an xml file. This deficiency, coupled with the fact that many of the widget sets “compile to xml/html” has arguably had the side effect of diminishing concern for hygienic operation within the development community. Combine this with the silent failure aesthetics of JavaScript and you can produce results that are truly painful to debug.

The contrast with Lisp macros (a radically different idea from C macros Hall has a very nice page on the differences plus some nice simple examples of problems and how to avoid them) is striking. Lisp macros represent the most successful instantiation of code generating behavior that I know of. Lisp macros work because of Lisp’s ability to generate new variable names and then bind incoming values to them so they can be used freely. Achieving a similar result without language support for system-wide unique item naming is (very) hard.

The seam/richfaces framework does much to try to minimize this problem e.g., if you look at the source of the page that seam sends to the browser you will see a lot of html with the form id=”competition:tagtDecoration:j_id70” which is a nice try at preventing variable collisions etc.. However, without enforced namespace encapsulation or the use of a system-wide symbol generating facility (e.g., gensym) variable capture is still possible. I also have not been able to find documentation on when new bindings are generated etc. which probably means I’ll have to look at the source someday.

These issues usually occur more often in scripting languages (for the sake of argument I’m including xml, html, xhtml as scripting languages) rather than in compiled ones because compiled languages normally restrict these “global” environment accesses to compile time. Run time access in compiled languages to environment variables is more difficult, and inherently has to address multi-core/multi-processor/multi-systems issues. The end result is that the “external environment” is generally harder to get at in compiled languages and is relegated to “systems level” utilities.

I’m hoping that the next big thing in scripting languages revolves around scoping and environment giving one the ability to specify a particular type of environment for code to run in, set up a safe context for it to run in, etc.

Part II will be covered in my next post

Saturday, April 26, 2008

Hibernate & Java /Hibernate Annotations

Just a quick post on Hibernate ‘@Where’ annotation usage and a note on some odd behavior I’ve seen using Hibernate in Java (my environment NetBeans IDE 6.0 (Build 200711261600) Java: 1.5.0_13; Java HotSpot(TM) Client VM 1.5.0_13-119, System: Mac OS X version 10.5.2 running on i386; jboss: 4.2.2.GA; seam: 2.0.0.GA)

First, annotations:
I had a hard time finding a “pasteable” explanation of the use of @Where
Now that I’ve found how to do it, I thought I’d share.

@OneToMany(cascade = CascadeType.ALL, fetch = FetchType.LAZY, mappedBy = “itemId”)

@Where(clause = “item_table = ‘competition’”)
public Set getItemTagRatings() {
return this.itemTagRatings;

public void setItemTagRatings(Set<> itemTagRatings) {
this.itemTagRatings = itemTagRatings;
The effect of this is to set up an additional condition upon TagRatings, so that only those TagRatings that have the item_table value = ‘competition’ are retrieved (item_table is the name of the column in the db).

The second item is odd. I cannot determine the underlying cause, so I’m just sharing the symptom and a workaround

The loop below would only execute once (and occasionally I would get a javax.faces.FacesException with the message: “#{competitionHome.persist}: java.util.ConcurrentModificationException”)

The concurrent modification exception was not a particular surprise, but the silent (no visible exception) execution of the loop a single time befuddles me. If supressWarnings wasn’t a compile time operation that’s where I’d place the blame but.... (BTW I have no doubt that I'm doing something deeply wrong in the not working loop below, my goal is to highlight the relatively opaque effect of the error)

Note: I designate the loop as ‘working’ since the query string shown will not work as written, it required some changes to work since the #{newName} did not evaluate to the appropriate value.

The loop

for((String newName : newTagNames){
List tags = (List) em.createQuery(“select t from Tag t where lower(t.tagName) = #{newName}“).setMaxResults(pageSize).setFirstResult(page * pageSize).getResultList();

if ((tags == null) || (tags.size() == 0)) {
newTag = new Tag(newName, user);

Breaking the loop up into multiple loops worked. The ‘working’ loop(s)

for ( String newName : newTagNames) {
ArrayList tags = null;
try {
Query q = em.createQuery(“select t from Tag t where lower(t.tagName) = #{newName} “);
q.setMaxResults(pageSize).setFirstResult(page * pageSize);
tags = (ArrayList) q.getResultList();
} catch ( IllegalStateException e) {
log.error(“error getting tags: “ + e);
} catch ( IllegalArgumentException e) {
log.error(“error getting tags: “ + e);
if ((tags == null) || (tags.size() == 0)) {
Tag newTag = new Tag(newName, currentUser);
for ( Tag newTag : newTags) {
for ( Tag newTag : newTags) {

Friday, April 11, 2008

jboss, seam, jbpm

I’ve been working in the jboss/seam framework for a few months now and recently tried to add jbpm workflow functionality to a project. In attempting to do this, I’ve hit some rough patches that have caused me to place the jbpm parts of the project on hold as they were not “mission critical” to this prototype.

Note that these remarks are in the context of a seam 2.0/jboss 4.2x environment (jboss 4.2+ is required by seam 2+), running on OSX 10.5. The jboss/seam/jbpm combination appears to be the source of the problems, OSX 10.5 appears irrelevant.

Seam offers two ways to utilize jbpm in your applications: the first is pageflow; the second is what the documentation terms an overarching business process.

The seam pageflow model is, as one would expect, a way to map desired page to page transitions using an xml file. It is described as being similar to other pageflow definition languages such as spring web flow but built with a conversation model in mind. The idea is that conversations allow better support for random user navigation e.g., hitting the back button. I cannot comment on that claim, but having used pageflows a bit I’ve found them adequately expressive and useful.

The difficulties that I have had are centered around integrating an overarching business process into my seam application. These business processes may include asynchronous server and user processes executing over long periods of time, e.g., weeks. A simple example is the verification of user identity in a web facing system: A user requests an account on the system, and activates the account by responding to an email sent by the system.

The business process requires the system to send an email to the user following the initial request, validate the response or, in the absence of a response, retire the request/send another reminder

The first issue I had in incorporating this workflow is the classic “who’s on top” problem, since I found that pageflows could not initiate workflows.

My prototyping started with a pageflow, as pageflows are the natural frameworks for handling the request for a new user account. The pageflow structure makes it easy to verify that the requested username/password etc satisfy system requirements. If they don’t, the application can be configured to just sit on the request page and not proceed to subsequent pages.

For example, by enclosing the page redirect in a <> construct, the return of a null from the userHome.persist method keeps the application on the page (there is a protocol (not shown) that allows the application to post error messages on the page).

<navigation from-action=?#{userHome.persist}?>
<redirect view-id=?/User.xhtml?/>

However, as mentioned above, a pageflow cannot initiate a workflow, so it is necessary to enclose the pageflow within a workflow to achieve the desired result. This entails changing the preceding page so that it kicks off a jbpm workflow rather than simply going to the next page.

This is the point at which I realized that I would need tools for monitoring and debugging workflows. The jbpm console appears key to this. Disconcertingly, I could not get the console to run with jboss 4.2.2.GA. This appears to be a general problem. In my case, the instructions on the wiki enabled me to get my workflow initiated without it throwing exceptions but I still was unable to get the console working.

One suggested path was to build the console from source. However, I proved to be unable to get it to build from source with the level of effort I was willing to expend. After reading posts such as this, I concluded that building from source may not be the most productive experience.

The bottom line is that you’re in an odd space working with seam 2.0 -- it needs jboss 4.2X but some of the other redhat/jboss tools don’t yet support jboss 4.2. I’m a bit surprised that this is the case: seam 2.0 GA was released 2007-11-01 06:49 and jboss 4.2.2 was released 2007-05-11.

Since I expect this situation to eventually be rectified, and as I mentioned, workflow was not “mission critical” I decided to wait for a working console application to be posted on the jboss.org site.

In fairness I’m working from the community “minimally supported” version -- the situation could be radically different in the supported versions.

Wednesday, March 26, 2008

Taxonomies, Ontologies and the Semantic Web

A couple of weeks ago I attended C-SHALS 2008 (Conference on Semantics in Healthcare and Life Sciences), one aspect of it that I found striking was the number of people who conflated taxonomies with ontologies -- my initial reaction was to want to post a remark about the confusion and highlight the distinctions (see this for a short set of descriptions of these and related terms).

I’ve instead come to view this conflation as reflecting the pragmatic bias of these systems: if the difference between taxonomies and ontologies isn’t apparent to you, the difference doesn’t matter for what you are trying to do (modulo the assumption that the speakers were competent, but that did appear to be the case). The implication is that such systems require no significant machine based inference across organizations. Significant inference, in this context, would involve something beyond the use of term matching to gather locally related terms/individuals (local vis a vis the terms being matched). Note: although I categorize this as being ‘non-signficant’ that’s only from the standpoint of inference -- these systems do cover most of the Business Intelligence/anlaysis use cases being implemented today.

As you might expect, given this characterization, these presentations involved the aggregation of data from multiple sites, using RDF or taxonomies such as Snomed to link data between sites. This is a good thing -- as I’ve mentioned a number of times having stable identifiers across systems is the key to integration. The system presented demonstrated that useful integration is possible even when the same term e.g., the same Snomed terms, have slightly different meanings in the different organizations. (see How Doctors Think for an anecdotal study of physicians classifying patients).

This is an interesting result: although fully vetted, 100% one-to-one mappings would obviously be preferable, in these systems the value of more data outweighs the penalty imposed by increased noise. Rough quick integration is proving more valuable than detailed integration requiring a thorough analysis of all systems used -- probably because the difference between ‘rough, quick’ and ‘thorough, slow’ is measured in months, if not years.

This is related to a discussion at the conference on the contrast between developing ‘problem specific’ ontologies vs. ‘general use’ ontologies. That is: does taking the time to ‘get it right’ add any value? This is roughly equivalent to the old AI scruffy vs. neat distinction.

Although I wouldn’t go so far as to claim that a general purpose ontology is impossible (at least in some limited domain), I am skeptical that it can be achieved. My concern centers around the fact that when you are constructing a general use ontology it is hard to know where to stop e.g., given a small molecule bioactive compound you should represent the formula and chirality, but what about the (possibly fractional) salt form? or the formulation? what about radioisotopes and their decay rates? subnuclear particles etc. I understand pragmatic stopping points for modeling these issues, but I don’t know how to determine principled ones.

It’s reassuring to see a number of researchers finding pragmatically useful parts of the semantic web, without the need for perfect definitions/ontologies. This, to me is the take-home message: there are a number of useful tools and techniques in the semantic web space, don’t be put off by the thought of merging ontologies and developing a grand unified theory of everything.

Wednesday, March 12, 2008

Porting a Ruby on Rails Application to jboss seam

I did finally port my Ruby on Rails application to jboss seam.

The capsule summary is that it took longer than expected (not particularly unusual for software), looks better than it ever did but still needs some performance tuning.

Some specifics

image display
I went with the seam graphicImage tag (note xmlns:s=”http://jboss.com/products/seam/taglib)
< 's:graphicImage' value="”#{artwork.thumb}” rendered=”#{not empty artwork.thumb}” < 's:transformImageSize' height="”50”" maintainratio="”true”">
< '/s:graphicImage'>

which I found to be slow -- much slower than doing a normal html < width="”x”/"> tag. (update -- this is due to the use of the 'transformImageSize' tag -- rdf 24 March 2008)

In the RoR code I conditionally chose one of two versions of the image tag
< %= if(@artwork.width_inches > @artwork.height_inches)
image_string = “< id =""> @artwork.id) + “\” width=100/>”
image_string = “< id =""> @artwork.id) + “\” height=100/>”

which rendered much faster.

The s:graphicImage tag doesn’t appear to be intended for rendering up to 20 images on a page -- my next revision will include the equivalent of the RoR code

data display
I was easily able to get data tables with nicely alternating row colors by adding the following to theme.css (I did it here, since the ‘alternating colors’ should vary with the theme)
.table-even { background-color: #ffffff; } .table-odd { background-color: #eeeeee; }

and then adding the following line to the *List.xhtml files
I did have some minor problems getting it to work since I had overwritten a class that applied to each cell and had given it a background color. I forgot that this cell class would take precedence, but was able to figure it all out with Firebug, an indispensable tool!

seam generator
The generator provided an invaluable starting point and some really nice features e.g., it creates tables with sortable columns for the list view. The table defaults to displaying the ID of nested objects, but it was trivial to change it to display something more appropriate while maintaining the expected sort behavior.

ajax suggestions
My goal was to consistently work within the framework. This occasionally put me at a level of abstraction above which I was comfortable (the operational metaphor being “trying to do X while wearing thick gloves”). This caused me to have a much more difficult time doing a drop down suggestion menu than expected e.g.,

< id="”roleDecoration”" template="”layout/edit.xhtml”">
< name="”label”">role
< value="”#{peopleHome.instance.roles}”">mmediate=”true">

since I was having a bit of a problem finding the exact ‘magic location’ for placing the “” tag relative to the tag, aka it all ‘just works’ if you have everything placed ‘just right

Being at a higher level of abstraction also forced some patterns that I found difficult to work around. For example, in this trail system an Artwork object has a framed? attribute backed by a boolean. The behavior that I wanted in the listing ‘query by example’ code was to either
to return framed artworks if the framed? checkbox was checked
to return all artworks if the framed? checkbox is not checked.
However, I could neither come up with a way to have elements on the restriction list take multiple parameters nor to return different restriction lists depending upon the query

my notes at that point say:
if you do this
List restrictionList;
if ((this.artwork != null)
&& (this.artwork.getShipable() != null)
&& this.artwork.getShipable().booleanValue()) {

restrictionList = Arrays.asList(SHIPABLE_RESTRICTIONS);
else {
restrictionList = Arrays.asList(RESTRICTIONS); }

You break the transaction model

I was able to fix this by adding this line to the RESTRICTIONS
artwork.framed in (true, #{ artworkList.artwork.framed})
which obviously will not generalize beyond the boolean case.


I found that the Hibernate documentation was very useful (when I took the time to read it in detail aka RTFM)
The expression language used is described reasonably well here

When I moved to a new version of richfaces to get suggestion boxes working it broke other minor portions of the page layouts. Although not a big deal, I found it disconcerting. Building this rich functionality in the browser is cool and all, but it feels fragile and is causing me to think about trying out flex (or air, its latest incarnation)

unit testing
I used HTMLUnit since it tests the complete end-to-end interaction.
Although I appreciate the ability to do faster, more thorough testing via mocks, I found that they gave me yet another thing to configure and wouldn’t give me the full end-to-end functionality that I was looking for.

I think that jboss/seam will likely prove useful. I have one other application that I’m building as a precursor to the extensible discovery system My biggest area of concern is the ability to do a good UI in this space which might prompt me to investigate the air/flex framework(s) at some point in the near future.

Tuesday, February 26, 2008

Extensible System: Core Software Requirements

In my last post I promised to detail the software requirements that an extensible system for discovery data shares with other Web 2.0 systems -- here they are:

Workflow Orientation: Support for a complete workflow beyond that which is offered by JSF navigation. This workflow must allow the orchestration of multiple events without requiring additional user interaction. Supported workflows may involve sequenced, conditional interaction with multiple back end systems (obviously existing on multiple platforms). For sanity and maintainability, the workflow language should be BPEL (or an slight extension) and should provide the ability to extend predicates and actions using a well designed and understood language (Java, C# etc.).

Integrateable (mashable) data: The data stored by the application (and the results of any analyses performed on the data) are available for repurposing in other applications. Repurposing should be supported in fine grained manner so as to put as few restrictions as possible upon its use.

This has two implications: the first is stable identifiers; the second is restful interfaces which allow data to be retrieved by referencing a static URL.
Note: Restful interfaces have a number of nice side effects: the seam *From trick mentioned below would be much more difficult without a restful interface. Additionally, the speed of rapid prototyping/development of web pages is greatly increased if one can directly access a ‘deep’ page without having to manually negotiate multiple precursor pages.

Event queues: Good workflow/system interaction is facilitated by message queues for guaranteed message delivery. Queuing systems also provide good interface points for logging and analysis tools.

Rules engines: In its most general form a rules engine is a piece of code that evaluates a set of antecedent-consequent pairs e.g., if antecedent then consequent. Given this abstract definition rules engines need to be distributed at a number of places within the product. I see four distinct areas each with its own role

1: Display within a page e.g., should the particular element be displayed corresponding to the ‘rendered’ predicate in JSF. Rules involve availability of data appropriate for display and authentication/authorization restrictions.

2: Predicates involving page flow (JSF, &REST, etc). Rules involve what page gets displayed next?
The jboss seam pages have a very nice convention in which the presence of a *From attribute/value pair allows an editing action, upon completion, to return to the page from which it was launched. Here is an example using peopleFrom
peopleFrom ? ‘PeopleList’ : peopleFrom}.xhtml”

Which will return the the PeopleList page when the editing action has been completed.
3: Security rules for CRUD operations. Rules involve accessing and modifying data.
4: Back end BPEL operations e.g., if the request has been outstanding for more than a week then notify customer support. Rules involve the overall operation of the system.

Logging: Effective debugging of complex systems requires the ability to gather an integrated log for each activity in the chain of events that produces a given result; supporting this requirement in an operational setting requires that all relevant logs can be time-aligned and assembled into a single report for analysis

Monitoring and management: Business rules should be capable of being extended to monitoring system operation: server load, queue depth, latency etc. allowing the system to be ‘self monitoring.’ The use of a common tool permits the maximal number of people to understand its operation.

In addition, interfaces should be provided to allow information to be updated (recached) without bouncing the server. JMX is a reasonable example see also.

Security: Obviously any enterprise system must provide for some level of security minimally with LDAP support, hopefully with out of the box support for OpenID and SAFE. In practice, I would caution against making security/access overly fine grained since it must support people changing their roles in the organization, changes in business processes etc.. The more fine grained your access model the more thought is required to get it right and the greater the probability of getting it wrong.

I have personally found it useful to distinguish reading, writing, and editing data, opening up the reading and dissemination of the information while restricting writing and editing data to specific tools provided for specific stages of the process.

For example, given a standard lab workflow for data collection, analysis, upload and “publication” (to the persons requesting the tests and then the company at large): there is one tool for collecting, analyzing and uploading the data; there is a second set of tools for integrating and viewing the data in a larger context; and there may be a third set of tools for curating and editing data which has been found discrepant.

This rounds out the software requirements for a practical production system. Although these requirements appear (and are) extensive most, if not all, of them appear in a number of enterprise level toolkits. As I said at the beginning of this post: there are clear best practices. A Powerpoint that covers both these posts is available.