[saga-rg] proposal for extended file IO

Andre Merzky andre at merzky.net
Mon Jun 13 13:54:49 CDT 2005


Hi Jon, 

Quoting [Jon MacLaren] (Jun 13 2005):
> Cc: Hartmut Kaiser <HartmutKaiser at t-online.de>,
> 	'Simple API for Grid Applications WG' <saga-rg at ggf.org>
> From: Jon MacLaren <maclaren at cct.lsu.edu>
> Subject: Re: [saga-rg] proposal for extended file IO
> Date: Mon, 13 Jun 2005 10:11:57 -0500
> To: Andre Merzky <andre at merzky.net>
> 
> 
>> On Jun 13, 2005, at 10:00 AM, Andre Merzky wrote:
>> 
>> Asynchroneousity (eng?) would be provided via the task
>> interface, as before (pseudocode):
>> sync:  File file (url);
>>        file.read (len, buff, &ret_len);
>>
>> async: File            file (url);
>>        FileTaskFactory ftf  = file.createTaskFactory ();
>>        Task            task = ftf.read (len, buff);
>>
>>        task.run  ();
>>        // do some other stuff here
>>        task.wait (&ret_len);
>>
>> There are more methods on the Task interface, for non
>> blocking checks etc.  The task models holds for all saga
>> objects basically, so would also cover the eRead and eWrite
>> calls.
> 
> But this could take a *long* time, e.g. hours (you have to sort through 1TB
> of data, which is on a disk).  How would a client be able to tell what was
> going on?  

Yes, that can take a long time.  Hoever, the tasks have a
state attached, they are either:

  Pending
  Running
  Finished
  Cancelled

That state can be queried, so you know at least if the task
is still alive.  I could imagine specific tasks to give more
detaild state or progress information, but thats not
specified in the strawman currently.  For example, we have
been discussing progress of file transfer: would be nice if
the task tells you how much of the file is transfered, or
even with what throughput.  But that falls more into the
domain of monitoring, which was left out of the strawman
intentionally, for now.

Is that what you would expect in terms of feedback?  If not,
can you give an example?

> Can I distinguish between:
>
> a) The remote service is preparing the data for me
>
> b) The network connection to the service has suddenly slowed down or broken,
>    and the data can't get through.
>
> I think if your API looks like:
>
> 1. PrepareData
>
> 2. GetData

I am not sure if that would make much difference: if
PrepareData takes some hours, you are back to the original
problem, aren't you?  Or do I misunderstand something?

Also, if your prepared data is large, or the network is
slow, the read can still need a long time - same situation
again...

Also, you would semantically tie two calls together.  For
example:

  file.prepare ("hyperslab", "([2,3,4][5,6,7])");
  file.read    (20, buffer, &out);

What does 20 mean?  Its specific to the hyperslab, the user
has to put the data together into a convenient structure.
Alternative:

  file.prepare ("hyperslab", "([2,3,4][5,6,7])");
  file.read    ("hyperslab", "([1,3,4][5,6,7])");
  file.read    ("hyperslab", "([2,3,4][5,6,7])"); 
  
  // I know the hs spec is wrong, but YOU know what I mean,
  // right ;-)

Hmm, again, maybe I totally misunderstand you...


> then people are more likely to expect that the data preparation is going to
> take a while.
> 
> I'm not sure that just allowing the first read to take an hour is going to
> encourage people to build clients that can cope well with this.  I'd hit
> <CTRL-C> if I didn't have a better idea of what was going on.

If the first preperation takes an hour...?

The again, middleware like data cutter can benefit from
preprocessed data (do indexing before, or create octree
structure before) - that could be done by creating a task
beforehand, which prepares the data, and then do the read
afterwards.  Would that do what you need?

  // warning: Pseudo Pseudo Code...
  Job  job  ("host_A", "/bin/subsample /data/hige_file_A /tmp/small_file_B");

  // wait for job completion
  // read prepared data
  File file ("gridftp://host_A//tmp/small_file_B");
  file.read (100, buffer, &out);
  

Cheers, Andre.


> Jon.



-- 
+-----------------------------------------------------------------+
| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
| Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
| De Boelelaan 1083a                | www:  http://www.merzky.net |
| 1081 HV Amsterdam, Netherlands    |                             |
+-----------------------------------------------------------------+





More information about the saga-rg mailing list