Tuesday, December 23, 2008

Discussing Architecture

If you're like me, you are forever grappling with finding the right format for discussing architecture with end users.

There are the standard architectural diagrams, but they really don't capture:
  • How the end users relate to the system
  • How all of the components are strung together to support the overall business processes
  • What artifacts (data, documents, etc.) are produced

I think there may finally be a "better mousetrap" available in ArchiMate which has a suitably small number of patterns/best practices that allow you to capture what's important in a large system at the appropriate level of detail. The most important aspect of ArchiMate is that it is a mixed mode modeling language with a distinctive, simple shape for each type of artifact.

ArchiMate allows for seven different types of things: Services, Processes, Organization, Products, Information, Infrastructure, Applications, and Functions.

As of yet I am not completely clear on the breakdown of these categories and to be honest I'm not sure that it really matters. ArchiMate is a communication tool. If it helps you communicate, it has served its purpose. If your way of breaking down the architecture is slightly different from their recommendations (which I strongly feel could use more examples) it may have no more impact than if your class structure for a domain is slightly different from someone else's. Don't get me wrong: I'm a big fan of standards when they have been properly vetted and heavily used, but until they hit that point it is important to be flexible in using them. Flexibility allows a standard to grow and cover a sufficient portion of the domain; otherwise it will wither from lack of use.

The presentation format that speaks to me the most is the layered diagram as shown on p 11(figure 12) of the Enterprise Architecture Development and Modelling paper. See below:


Here's my simplified take for a clinical trial system used by physicians and patients.


What I like about this format is that all of the elements on a particular layer are at the same level of abstraction and are easily placed in relationship to the other few things that are at that same abstraction level. Simultaneously, you can see what supports (and is supported by) a particular component. Each perspective (which appears on the same diagram) only requires concentrating on a small number things at a time and can easily be held in your short term memory.

One of the key things about the layers is that they distinguish externally available from internally consumed data interfaces -- especially highlighting those that cross abstraction boundaries. These external interfaces are distinguished from those that support multiple applications at the same level in the stack. Such "external" interfaces (which are internal to a particular level of abstraction) are more easily altered since they are more tightly coupled organizationally. This nicely foregrounds the implication that care should be taken in designing the external, level crossing APIs and lifecycles since changes to them will be harder to coordinate given the diversity of the interested parties.

Monday, December 8, 2008

Name that data

There is an interesting trend that I feel has the potential to fundamentally shift the way we think about how data is used in networking and applications.

It involves attaching unique names to data, i.e., referring to the data by its SHA1 hash value. Unique identifiers allow a number of things. For example, they allow you to retrieve the data from the network without worrying where it resides on the network, as described in
A New Way to look at Networking in which Van Jacobson discusses breaking out the content of the page from the page itself.

At 42:47 in the talk there's the section on
Dissemination networking

in which
data is requested, by name using any and all means available (IP, VON tunnels, zeroconf addresses, multicast, proxies, etc),
Anything that hears the request and has a valid copy of the data can respond. The returned data is signed and optionally secured, so its integrity and association with the name name can be validated.

Rather than getting all of the data from a particular location e.g., as specified by a URL, we get the identifier for the content of the data and then let the network supply the data from the location "nearest" (using the appropriate distance metric) to its use.

A similar idea exists in ZFS' implementation of a copy-on-write transactional model:
ZFS uses a copy-on-write transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written.

This design allows small changes in large files to be reflected by changing only the blocks that have been altered rather than by rewriting the whole file to a new location. This simplifies backup procedures, reduces R/W bandwidth requirements, etc. Part of what's significant here is that we're dealing with abstractions of the data, e.g., checksums rather than the data itself. If we use cryptographic hash functions rather than checksums the ideas become isomorphic aka, get me this block from wherever it is.

This has a number of potentially interesting applications depending on the granularity of the "named data". As a simple example, answering the obvious: "am I working with the copy of the file that was emailed to me last Friday, or an older version?" Even with coarse grain naming it would be possible to create mashups of music already on a users computer -- just transmitting offsets into and segment durations of existing content gets past DRM issues entirely (morally if not practically-- since much DRM protected media is encoded).

Uniquely naming the content is not just about networking, nor is it just about data but is really about the cross product of the two: the data and its location/retrieval. Data location and retrieval is most of what is involved in computing: unless data is being actively processed by a computational unit (in this case, I'm talking about the integer and floating point units on the chip) the rest of "computation" is about the retrieval and storage of data e.g., do I put this in a frame buffer, in the cloud or in a shredder?

Imagine an ecosystem of data that represents everything that you own/care about -- you could partition this data into multiple overlapping categories based on any number of attributes such as:
  • temporary: make no copies

  • pieces of a larger whole: applications would minimize the number of named datasets that they change when they perform updates e.g., editing out that first minute from your video won't change every block of the video.

  • number of independently survivable copies required, with what longevity?
  • coupling the data to geographic position: How closely should the data follow me as I move around the planet? Can it stay where I put it, or should it go where I go and be available for high bandwidth processing.

This are just my first pass ideas. This concept of divorcing the data off from its location and higher level structural organization opens up the potential for a whole new set of applications which provide enhanced user functionality by pushing a lot of these data management, replication and caching issues deep into the infrastructure.