[saga-rg] Re: proposal for extended file IO - summary

Thorsten Schuett schuett at zib.de
Mon Jun 20 02:00:19 CDT 2005


Of course, I like the idea adding pattern reads to saga. ;-)

At the same time I have the feeling that there must be second document. 
Something like the "The Annotated SAGA Reference Manual", a tutorial or 
sample apps written in SAGA. On the one hand you should document the ideas 
behind the API (why did you include readE, .... ) and on the other hand you 
should show how to solve common problems ("see how easy it is to create a 
module for server-side data processing in SAGA").

Thorsten

On Friday 17 June 2005 21:34, Andre Merzky wrote:
> Hi List,
>
> I went through the IO thread again, and also had a chat with
> John Shalf, and I'd like to summarize the outcome of the
> discussion.  Please consider that as a joint proposal of
> John and me for inclusion in the file IO methods.
>
>   Observations:
>
>    - normal read/write has severe drawbacks on remote IO, if
>      used extensively, both sync and async
>
>    - external preprocessing of data for read can be accomplisehd
>      by spawning preprocessing jobs
>
>    - async is well covered by the task model
>
>    - there exists various approaches to improve throughput
>      for IO intensive apps, amongst them:
>
>      - (A) gather/scatter (see readv (2)
>      - (B) FALLS (regular paterns on binary data)
>      - (C) eRead (see ERET/ESTO in gridftp)
>
>   Remarks:
>
>    - the options A, B and C show increasing powerfull
>      expressions, but also require increasing concertation
>      between client and server side.
>
>    - A is, being POSIX, well known
>
>    - B maps to hyperslabs pretty well, a seemingly common
>      access pattern
>
>    - C maps GridFTP, a commonly used protocol, very well
>
>   Proposal:
>
>    - There seem advantages to A, B and C.  Also, the need
>      for more than simple read seems obvious.  Hence we
>      propose to include A, B and C into the SAGA API.
>
>      void readV       (in  array<ivec>     ivec,
>                        out array<string>   buffers  );
>      void writeV      (in  array<ivec>     ivec,
>                        in  array<string>   buffers  );
>
>      void readP       (in  pattern         pattern,
>                        out string          buffer,
>                        out long            len_out  );
>      void writeP      (in  pattern         pattern,
>                        in  string          buffer,
>                        out long            len_out  );
>
>      void lsEModes    (out array<string,1> emodes   );
>      void readE       (in  string          emode,
>                        in  string          spec,
>                        out string          buffer,
>                        out long            len_out  );
>      void writeE      (in  string          emode,
>                        in  string          spec,
>                        in  string          buffer,
>                        out long            len_out  );
>
> We think that adding the 7 calls does not bloat the API (although increases
> the file method number significantly), but will make the API much more
> usable for the targeted use cases.
>
> Please comment :-)
>
> Cheers, Andre.
>
> Quoting [Andre Merzky] (Jun 12 2005):
> > Hi again,
> >
> > consider following use case for remote IO.  Given a large
> > binary 2D field on a remote host, the client wans to access
> > a 2D sub portion of that field.  Dependend on the remote
> > file layout, that requires usually more than one read
> > operation, since the standard read (offset, length) is
> > agnostic to the 2D layout.
> >
> > For more complex operations (subsampling, get a piece of a
> > jpg file), the number of remote operations grow very fast.
> > Latency then stringly discourages that type of remote IO.
> >
> > For that reason, I think that the remote file IO as
> > specified by SAGA's Strawman as is will only be usable for a
> > limited and trivial set of remote I/O use cases.
> >
> > There are three (basic) approaches:
> >
> >   A) get the whole thing, and do ops locally
> >      Pro: - one remote op,
> >           - simple logic
> >           - remote side doesn't need to know about file
> >             structure
> >           - easily implementable on application level
> >      Con: - getting the header info of a 1GB data file comes
> >             with, well, some overhead ;-)
> >
> >   B) clustering of calls: do many reads, but send them as a
> >      single request.
> >      Pro: - transparent to application
> >           - efficient
> >      Con: - need to know about dependencies of reads
> >             (a header read needed to determine size of
> >             field), or included explicite 'flushes'
> >           - need a protocol to support that
> >           - the remote side needs to support that
> >
> >   C) data specific remote ops: send a high level command,
> >      and get exactly what you want.
> >      Pro: - most efficient
> >      Con: - need a protocol to support that
> >           - the remote side needs to support that _specific_
> >             command
> >
> > The last approach (C) is what I have best experiences with.
> > Also, that is what GridFTP as a common file access protocol
> > supports via ERET/ESTO operations.
> >
> > I want to propose to include a C-like extension to the File
> > API of the strawman, which basically maps well to GridFTP,
> > but should also map to other implementations of C.
> >
> > That extension would look like:
> >
> >       void lsEModes   (out array<string,1> emodes   );
> >       void eWrite      (in  string          emode,
> >                         in  string          spec,
> >                         in  string          buffer
> >                         out long            len_out  );
> >       void eRead       (in  string          emode,
> >                         in  string          spec,
> >                         out string          buffer,
> >                         out long            len_out  );
> >
> >       - hooks for gridftp-like opaque ERET/ESTO features
> >       - spec:  string for pattern as in GridFTP's ESTO/ERET
> >       - emode: string for ident.  as in GridFTP's ESTO/ERET
> >
> > EMode:        a specific remote I/O command supported
> > lsEModes:     list the EModes available in this implementation
> > eRead/eWrite: read/write data according to the emode spec
> >
> > Example (in perl for brevity):
> >
> >   my $file   = SAGA::File new
> > ("http://www.google.com/intl/en/images/logo.gif"); my @emodes =
> > $file->lsEModes ();
> >
> >   if ( grep (/^jpeg_block$/, @emodes) )
> >   {
> >     my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
> >   }
> >
> > I would discourage support for B, since I do not know any
> > protocoll supporting that approach efficiently, and also it
> > needs approximately the same infrastructure setup as C.
> >
> > As A is easily implementable on application level, or within
> > any SAGA implementation, there is no need for support on API
> > level -- however, A is insufficient for all but some trivial
> > cases.
> >
> > Comments welcome :-))
> >
> > Cheers, Andre.





More information about the saga-rg mailing list