Wednesday, December 17, 2014

Twitter Hashtags: the ideal knowledge management tool?

I loosely define knowledge management as the ability to find (project) relevant information from other projects current or historic (I’m also using the term information very loosely here: I don’t mean information in the information theoretic sense, but in the colloquial sense: unvetted data). 

Relevance is hard: it’s easy enough to create PDFs and dump them into a document store. That renders them findable; but finding the relevant document (or portion thereof) is difficult-to-impossible in most corporate environments. This is partly because there isn't that much linking going on, so you can’t to page rank tricks, and partly because the data being addressed isn’t commensurable. 

The two axes of finding and relevance should be correlated, but aren’t in practice. This is because the terms used for retrieval are “the same” but incommensurable between the documents. The search terms either 
  • Are not specific enough, 
  • Have changed through usage, or 
  • the background assumptions underlying them have changed over time. 

The latter is especially true in clinical trials, e.g., can you compare results between two studies: one of which had a two-value (M/F) dropdown for sex while the other had a three-value dropdown (M/F/U)? The clear answer is maybe. The more maybes your searches turns up, the less incentive there is to even bother looking.

Similar situations occur when the context of the documents change e.g.,
  • SOP’s change
  • Standards for acceptable data change
  • Drug target changes disease area
  • etc.

My experience in watching scientific systems develop/evolve (devolve?) over time is that you start off with a small ball of coherent thought. This coherence is the result of a good development process: one that the vets and resolves differences in vocabulary use with all the stakeholders in the project. 

Such balls of coherence are inherently unstable: without strong curation, changes in personnel and funding levels cause entropic decline. Of course, curatorial work is hard to protect when funding gets reduced (and funding always eventually gets reduced), since its loss rarely has a negative impact on short term corporate goals. 

This is even true in regulated environments: the regulatory goal is to assure that all the data submitted is coherent. assuring that the data is coherent with other submissions is less important.

The result is that these carefully crafted balls of coherence end up going from this coherent set
to this fuzzy idea

Twitter hashtags don’t seem to do that, worst case they seem to go from


 two temporally displaced uses of the same tag that designate different sets of things, which may, or may not be correlated.

There appear to be three main causes for this:
  1. Each tagged tweet only has a few tags, usually only one (there’s only 140 characters/tweet. If you’re not careful, your tweet will consist entirely of tags).
  2. I’ve seen that people using a hashtag search twitter for other tweets that use the same hashtag, if there’s a conflict the situation often autocorrects. 
  3. The combination of these two factors assures that the tag targets a specific sense that the community has in its head at a particular time. Once the sense of the items starts to shift, a new tag must be used, or members the community cannot easily identify the relevant content.

The net result is something that ’s very computationally friendly: hashtags are tightly targeted -- if our search goal encompasses more than one hashtag it is easy to group items with different hashtags together by setting up a “synonym table” that gathers all the results with hashtags A or B or E. On the other hand, standard document search in a corporate environment is more of a situation where A, B and E were all marked A (same term, different contexts) and now we need to somehow distinguish them. The phrase "near impossible" comes to mind. 

This makes me think of a tagging UI, which allows up to 80 characters of # indicated tags, and dynamically shows the search results of these tags against the relevant content repository, cutting results when the tag hasn't been used for a period of time.