Monday, December 20, 2010

Adding Structure to Data

This post demonstrates how to add further structure to data after the initial items have been (uniquely) identified and committed to your persistent store. The core idea here is that once you have items uniquely identified, you can overlay a structure (or any number of structures) upon them as desired.

 

These structural overlays can also be made to interact as much (or as little) as necessary to address the question currently under consideration.

 

For example, given four cell lines (this is taken from a brochure from the Charles River labs web site):

 

Cell LineSpeciesOrganID
SW780 Human Bladder CL-1
Hep3B Human Liver CL-2
B16 Mouse Skin CL-3
Madison109 Murine Lung CL-4

 


(note: not all relationships need to be specifically listed in the parent table)

 

This gives us the following structure:

Initial_cell_lines.png

 

Upon which we can overlay a set of relationship showing the source organ
 

 

Graph_organ.png

 

Or source species 

Graph_species_partial.png

 

Now, lets say we add a new cell line 

 

Cell LineSpeciesOrganID
SW780-1 Human Bladder CL-5

 

Giving us

Graph_Added_cell.png

 

We may later realize that CL-5 was derived from CL-1 and just use a separate parent child relationship table to store the information 

 

Parent Child Relationship
CL-1 CL-5 "derived"

 

(note: "Root" cell lines are those that do not appear in the Child column or do not appear in this table at all) 

 

Graph_Added_relationship.png

 

This sort of thing can be generally extended and need not be a strict tree: 

 

Parent Parent Table Child Child Table Relationship
CL-1 Cell_Line CL-5 Cell_Line fusion-parent
CL-2 Cell_Line CL-5 Cell_Line fusion-parent

 

Obviously the richer the relationship, the more likely you are to move to a table specifically designed to capture that information. 

 

Mixture Component Mixture Component Table Mixture Mixture Table Amt
C-1 Compound C-9 Compound 0.1
C-1 Compound C-9 Compound 0.1
R-2 Reagent C-9 Compound 0.1

 

As these structures build up it is easy to then interrogate the information about our available cells. 

 

Query: What mammalian cell lines do we have? Procedure: Traverse from the mammalian node and collect all cell line instances

 

Graph_species.png

 

Query: What cell lines are derived from C-1? Procedure: Find cell lines derived from C-1, find cell-lines derived from them (recursively), collect all cell line instances.

Graph_derivation_only.png

 

The overall pattern is pretty straightforward and is can be processed with standard graph algorithms

 

See also: Considerations in developing a middle distance ontology


No comments: