Data Subset Replication Use Cases
Allen Luniewski
luniew at almaden.ibm.com
Wed May 4 15:55:31 CDT 2005
At the April 27 Data Architecture call, I volunteered to write a use case
for replication of subsets of data. Below is my attempt to capture the
basic idea in a few examples..
Allen
Data Subset Replication
Replication of entire objects (e.g., a file, an entire database) is a
natural, and obvious, place to start considering replication. However
there is a real need to replicate subsets of data. Here are a number of
motivating examples (use cases):
Company A has an employee database containing all information about its
employees. The database is multiple terabytes in size and updates are
frequent. Suppose that the payroll information is contained in a single
table in that database. The payroll department needs to have fast access
to the payroll table. This table is only a few tens of gigabytes in size
and updates are infrequent. Replicating just this subset reduces storage
consumption on the payroll system, reduces the bandwidth used to maintain
the replica and reduces the processing power used to create and handle
updates to the payroll system. This is an example of replicating a single
table instead of an entire database.
A Life Sciences example. Suppose that there is a large file that
describes the entire human genome held on some server at UCLA. This file
is multiple terabytes in size. A researcher in Paris desires to perform
some computation on genes 14 through 19. For efficiency the data being
processed must be at a server in Paris. Instead of expending considerable
resource to move the entire file, relatively small amounts of resource are
used to move just that portion of the file containing genes 14 through 19.
Now suppose that a second researcher in Paris desires to perform a
computation on genes 18 and 19. Instead of moving the entire file, or
even moving the subset of the file containing genes 18 and 19, that
researcher can reuse the partial replica already held in Parsis since it
contains genes 14 through 19 and the genes of interest are a subset of
those.
A final example from the database world. Suppose that a multi-site
hospital keeps a database of its patients and that the database contains
all patient information including voluminous information such as x-ray
images and MRI scans. Thus it is very large. One of the hospitals in the
system, located in Boston, is very specialized - it only sees local
elderly cancer patients. This hospital needs to have a replica of the
database that contains only those patients - replicating the entire
database is, as above, a waste of precious resources. So the hospital
needs a replica that contains only those patient records that match a
query that might look something like: "Patient.age > 65 AND
DISTANCE(Patient.address, "Boston") < '50 miles' AND (Patient.illnesses
INCLUDES "cancer" OR Patient.past Illnesses INCLUDES "cancer")"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/ogsa-d-wg/attachments/20050504/710dc850/attachment.htm
More information about the ogsa-d-wg
mailing list