Monday, December 8, 2008

Name that data

There is an interesting trend that I feel has the potential to fundamentally shift the way we think about how data is used in networking and applications.

It involves attaching unique names to data, i.e., referring to the data by its SHA1 hash value. Unique identifiers allow a number of things. For example, they allow you to retrieve the data from the network without worrying where it resides on the network, as described in
A New Way to look at Networking in which Van Jacobson discusses breaking out the content of the page from the page itself.

At 42:47 in the talk there's the section on
Dissemination networking

in which
data is requested, by name using any and all means available (IP, VON tunnels, zeroconf addresses, multicast, proxies, etc),
Anything that hears the request and has a valid copy of the data can respond. The returned data is signed and optionally secured, so its integrity and association with the name name can be validated.

Rather than getting all of the data from a particular location e.g., as specified by a URL, we get the identifier for the content of the data and then let the network supply the data from the location "nearest" (using the appropriate distance metric) to its use.

A similar idea exists in ZFS' implementation of a copy-on-write transactional model:
ZFS uses a copy-on-write transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written.

This design allows small changes in large files to be reflected by changing only the blocks that have been altered rather than by rewriting the whole file to a new location. This simplifies backup procedures, reduces R/W bandwidth requirements, etc. Part of what's significant here is that we're dealing with abstractions of the data, e.g., checksums rather than the data itself. If we use cryptographic hash functions rather than checksums the ideas become isomorphic aka, get me this block from wherever it is.

This has a number of potentially interesting applications depending on the granularity of the "named data". As a simple example, answering the obvious: "am I working with the copy of the file that was emailed to me last Friday, or an older version?" Even with coarse grain naming it would be possible to create mashups of music already on a users computer -- just transmitting offsets into and segment durations of existing content gets past DRM issues entirely (morally if not practically-- since much DRM protected media is encoded).

Uniquely naming the content is not just about networking, nor is it just about data but is really about the cross product of the two: the data and its location/retrieval. Data location and retrieval is most of what is involved in computing: unless data is being actively processed by a computational unit (in this case, I'm talking about the integer and floating point units on the chip) the rest of "computation" is about the retrieval and storage of data e.g., do I put this in a frame buffer, in the cloud or in a shredder?

Imagine an ecosystem of data that represents everything that you own/care about -- you could partition this data into multiple overlapping categories based on any number of attributes such as:
  • temporary: make no copies

  • pieces of a larger whole: applications would minimize the number of named datasets that they change when they perform updates e.g., editing out that first minute from your video won't change every block of the video.

  • number of independently survivable copies required, with what longevity?
  • coupling the data to geographic position: How closely should the data follow me as I move around the planet? Can it stay where I put it, or should it go where I go and be available for high bandwidth processing.

This are just my first pass ideas. This concept of divorcing the data off from its location and higher level structural organization opens up the potential for a whole new set of applications which provide enhanced user functionality by pushing a lot of these data management, replication and caching issues deep into the infrastructure.

No comments: