[saga-rg] proposal for extended file IO

Mon Jun 13 15:10:32 CDT 2005

>> But this could take a *long* time, e.g. hours (you have to sort  
>> through 1TB
>> of data, which is on a disk).  How would a client be able to tell  
>> what was
>> going on?
>>
>
> Yes, that can take a long time.  Hoever, the tasks have a
> state attached, they are either:
>
>   Pending
>   Running
>   Finished
>   Cancelled
>
> That state can be queried, so you know at least if the task
> is still alive.  I could imagine specific tasks to give more
> detaild state or progress information, but thats not
> specified in the strawman currently.  For example, we have
> been discussing progress of file transfer: would be nice if
> the task tells you how much of the file is transfered, or
> even with what throughput.  But that falls more into the
> domain of monitoring, which was left out of the strawman
> intentionally, for now.
>
> Is that what you would expect in terms of feedback?  If not,
> can you give an example?
>

It's not a question about functionality.  More a comment about  
language design, and semantics.  You are potentially hiding a large  
amount of processing behind a file read.  I don't find that  
intuitive.  Should I put code around all eReads to allow for this?

With the explicit prepare, I might send a message to a service to so  
the prepare, then start/queue a batch job once the processing was  
complete.  If I am sitting on a file read for an hour on a  
supercomputer, it's expensive.  That's why I think the decoupling is  
better.

But I suppose that I could implement the decoupled prepare/read  
outside of the SAGA API, which is maybe where it belongs.  And the  
API you have is certainly fine for smaller files.

Perhaps that is what you are suggesting at the end of your reply....

> <snip>
> If the first preperation takes an hour...?
>
> The again, middleware like data cutter can benefit from
> preprocessed data (do indexing before, or create octree
> structure before) - that could be done by creating a task
> beforehand, which prepares the data, and then do the read
> afterwards.  Would that do what you need?
>
>   // warning: Pseudo Pseudo Code...
>   Job  job  ("host_A", "/bin/subsample /data/hige_file_A /tmp/ 
> small_file_B");
>
>   // wait for job completion
>   // read prepared data
>   File file ("gridftp://host_A//tmp/small_file_B");
>   file.read (100, buffer, &out);

I guess we are agreeing...

Jon.