Monday, December 20, 2010

Adding Structure to Data

This post demonstrates how to add further structure to data after the initial items have been (uniquely) identified and committed to your persistent store. The core idea here is that once you have items uniquely identified, you can overlay a structure (or any number of structures) upon them as desired.


These structural overlays can also be made to interact as much (or as little) as necessary to address the question currently under consideration.


For example, given four cell lines (this is taken from a brochure from the Charles River labs web site):


Cell LineSpeciesOrganID
SW780 Human Bladder CL-1
Hep3B Human Liver CL-2
B16 Mouse Skin CL-3
Madison109 Murine Lung CL-4


(note: not all relationships need to be specifically listed in the parent table)


This gives us the following structure:



Upon which we can overlay a set of relationship showing the source organ




Or source species 



Now, lets say we add a new cell line 


Cell LineSpeciesOrganID
SW780-1 Human Bladder CL-5


Giving us



We may later realize that CL-5 was derived from CL-1 and just use a separate parent child relationship table to store the information 


Parent Child Relationship
CL-1 CL-5 "derived"


(note: "Root" cell lines are those that do not appear in the Child column or do not appear in this table at all) 




This sort of thing can be generally extended and need not be a strict tree: 


Parent Parent Table Child Child Table Relationship
CL-1 Cell_Line CL-5 Cell_Line fusion-parent
CL-2 Cell_Line CL-5 Cell_Line fusion-parent


Obviously the richer the relationship, the more likely you are to move to a table specifically designed to capture that information. 


Mixture Component Mixture Component Table Mixture Mixture Table Amt
C-1 Compound C-9 Compound 0.1
C-1 Compound C-9 Compound 0.1
R-2 Reagent C-9 Compound 0.1


As these structures build up it is easy to then interrogate the information about our available cells. 


Query: What mammalian cell lines do we have? Procedure: Traverse from the mammalian node and collect all cell line instances




Query: What cell lines are derived from C-1? Procedure: Find cell lines derived from C-1, find cell-lines derived from them (recursively), collect all cell line instances.



The overall pattern is pretty straightforward and is can be processed with standard graph algorithms


See also: Considerations in developing a middle distance ontology