[saga-rg] proposal for extended file IO

Hartmut Kaiser HartmutKaiser at t-online.de
Mon Jun 13 02:04:44 CDT 2005


 
Andre Merzky wrote:

> consider following use case for remote IO.  Given a large 
> binary 2D field on a remote host, the client wans to access a 
> 2D sub portion of that field.  Dependend on the remote file 
> layout, that requires usually more than one read operation, 
> since the standard read (offset, length) is agnostic to the 2D layout.
> 
> For more complex operations (subsampling, get a piece of a 
> jpg file), the number of remote operations grow very fast.
> Latency then stringly discourages that type of remote IO.
> 
> For that reason, I think that the remote file IO as specified 
> by SAGA's Strawman as is will only be usable for a limited 
> and trivial set of remote I/O use cases.
> 
> There are three (basic) approaches:
> 
>   A) get the whole thing, and do ops locally
>      Pro: - one remote op, 
>           - simple logic
>           - remote side doesn't need to know about file
>             structure
>           - easily implementable on application level
>      Con: - getting the header info of a 1GB data file comes
>             with, well, some overhead ;-)
> 
>   B) clustering of calls: do many reads, but send them as a
>      single request.
>      Pro: - transparent to application
>           - efficient
>      Con: - need to know about dependencies of reads
>             (a header read needed to determine size of
>             field), or included explicite 'flushes'
>           - need a protocol to support that
>           - the remote side needs to support that
> 
>   C) data specific remote ops: send a high level command,
>      and get exactly what you want.
>      Pro: - most efficient
>      Con: - need a protocol to support that
>           - the remote side needs to support that _specific_
>             command
> 
> The last approach (C) is what I have best experiences with.
> Also, that is what GridFTP as a common file access protocol 
> supports via ERET/ESTO operations.
> 
> I want to propose to include a C-like extension to the File 
> API of the strawman, which basically maps well to GridFTP, 
> but should also map to other implementations of C.

Agreed here.

> That extension would look like:
> 
>       void lsEModes   (out array<string,1> emodes   );
>       void eWrite      (in  string          emode,
>                         in  string          spec,
>                         in  string          buffer
>                         out long            len_out  );
>       void eRead       (in  string          emode,
>                         in  string          spec,
>                         out string          buffer, 
>                         out long            len_out  );
> 
>       - hooks for gridftp-like opaque ERET/ESTO features
>       - spec:  string for pattern as in GridFTP's ESTO/ERET
>       - emode: string for ident.  as in GridFTP's ESTO/ERET
> 
> EMode:        a specific remote I/O command supported
> lsEModes:     list the EModes available in this implementation
> eRead/eWrite: read/write data according to the emode spec
> 
> Example (in perl for brevity):
> 
>   my $file   = SAGA::File new 
> ("http://www.google.com/intl/en/images/logo.gif");
>   my @emodes = $file->lsEModes ();
> 
>   if ( grep (/^jpeg_block$/, @emodes) )
>   {
>     my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
>   }
> 
> I would discourage support for B, since I do not know any 
> protocoll supporting that approach efficiently, and also it 
> needs approximately the same infrastructure setup as C.
> 
> As A is easily implementable on application level, or within 
> any SAGA implementation, there is no need for support on API 
> level -- however, A is insufficient for all but some trivial cases.

This approach is very generic on the API level (that's good) but requires
exact agreement on the used command syntax for the client and the server,
which may get problematic. If we go this route we will definitely end up
specifying at least a minimal command subset to be supported by the
eRead/eWrite commands. 

I simply fear we'll have the same problems we have with the GAT today. The
GAT API is in principle usable in a broad range of use cases based on a
generic API. The genericity is ensured by using key/value tables in the API
itself, allowing quick adaptation to any concrete need. The problem is the
missing specification of these key/value pairs which makes it difficult to
achieve reusability.

Regards Hartmut






More information about the saga-rg mailing list