Data Subset Replication Use Cases

Allen Luniewski luniew at almaden.ibm.com
Wed May 4 15:55:31 CDT 2005


At the April 27 Data Architecture call, I volunteered to write a use case 
for replication of subsets of data.  Below is my attempt to capture the 
basic idea in a few examples..

Allen


Data Subset Replication

Replication of entire objects (e.g., a file, an entire database) is a 
natural, and obvious, place to start considering replication.  However 
there is a real need to replicate subsets of data.  Here are a number of 
motivating examples (use cases):

Company A has an employee database containing all information about its 
employees.  The database is multiple terabytes in size and updates are 
frequent.  Suppose that the payroll information is contained in a single 
table in that database.  The payroll department needs to have fast access 
to the payroll table.  This table is only a few tens of gigabytes in size 
and updates are infrequent.  Replicating just this subset reduces storage 
consumption on the payroll system, reduces the bandwidth used to maintain 
the replica and reduces the processing power used to create and handle 
updates to the payroll system.  This is an example of replicating a single 
table instead of an entire database.

A Life Sciences example.  Suppose that there is a large file that 
describes the entire human genome held on some server at UCLA.  This file 
is multiple terabytes in size.  A researcher in Paris desires to perform 
some computation on genes 14 through 19.  For efficiency the data being 
processed must be at a server in Paris.  Instead of expending considerable 
resource to move the entire file, relatively small amounts of resource are 
used to move just that portion of the file containing genes 14 through 19. 
 Now suppose that a second researcher in Paris desires to perform a 
computation on genes 18 and 19.  Instead of moving the entire file, or 
even moving the subset of the file containing genes 18 and 19, that 
researcher can reuse the partial replica already held in Parsis since it 
contains genes 14 through 19 and the genes of interest are a subset of 
those.

A final example from the database world.  Suppose that a multi-site 
hospital keeps a database of its patients and that the database contains 
all patient information including voluminous information such as x-ray 
images and MRI scans.  Thus it is very large.  One of the hospitals in the 
system, located in Boston, is very specialized - it only sees local 
elderly cancer patients.  This hospital needs to have a replica of the 
database that contains only those patients - replicating the entire 
database is, as above, a waste of precious resources.  So the hospital 
needs a replica that contains only those patient records that match a 
query that might look something like: "Patient.age > 65 AND 
DISTANCE(Patient.address,  "Boston") < '50 miles'  AND (Patient.illnesses 
INCLUDES "cancer" OR Patient.past Illnesses INCLUDES "cancer")"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/ogsa-d-wg/attachments/20050504/710dc850/attachment.htm 


More information about the ogsa-d-wg mailing list