[saga-rg] proposal for extended file IO

Tue Jun 14 01:25:18 CDT 2005

On Jun 13, 2005, at 11:40 AM, Andre Merzky wrote:

> Hi John,
>
> Quoting [John Shalf] (Jun 13 2005):
>>
>> Hi Andre,
>> I think there is a 4th possibility.  If each of the I/O operations can
>> be requested asynchronously, then you can get the same net effect as
>> the ERET/ESTO functionality of the GridFTP.
>
> I disagree.  You can hide latency to some extend, but your
> throughput suffers, utterly.

If you do a full gather-scatter I/O, then this is true (the length of 
the request equals the size of the data item returned).  Even in such a 
case, as long as the number of outstanding requests matches the 
bandwidth-delay product of the network channel (as per Little's Law), 
you still achieve full throughput.  However, the e-modes approach is 
equally bad because it simply pushes an enormous amount of complexity 
down a layer into the implementation.  I'm not sure which is worse.

So the concerns I have are as follows
	1) The negotiations to find out the available eModes seems to require 
some complex modules be installed on both the client and the server 
side of a system.  One would hope that you could implement the 
capabilities you need using a smaller subset of elemental operations.  
For instance the stdio readv() and pread() functionality to describe 
gather/scatter type operations.
	2) The implementation looks way too close to one particular data 
transport implementation.  I'm not convinced it is the best thing out 
there for gather-scatter I/O over a high-latency interface.  Again, I'd 
be interested in seeing the advantages/disadvantages of something 
related to the POSIX/XPG gather/scatter I/O implementation.  They would 
cover Jon's case.
	3) Are the EModes() guaranteed to be stateless?  In the JPEG_Block 
example you provide, its not clear what the side-effects are with 
regard to the file pointer.  If some EModes() have side-effects on the 
file pointer state, whereas others do not, its going to be impossibly 
messy.

So my example wasn't very well thought out, but the higher-level point 
I was trying to make is that I think there are more general ways to 
encode data layout descriptions for remote patterned or gather-scatter 
I/O operations than e-modes.  The arbitraryiness of the modes and their 
associated string parsing adds a sort of complexity that is a bit 
daunting at first blush.

> Imagine Jons use case (its a worst case scenario really): You
> have a remote HDF5 file, and want a hyperslab.  Really worst
> case is you want every second data item.
>
> Now, if you rely on read as is, you have to send one read
> request for every single data item you want to read.  If you
> interleave them asynchroneously, you get reaonable latency,
> but your throughput is, well, close to zero.

If the number of outstanding requests (in terms of bytes) is equal to 
the bandwidth-delay product of the connection, then you will reach 
peak.  Sadly, the way I posed the solution would die from the excessive 
overhead of launching the threads.

> If you want to optimize your buffersize, you have to read
> more than one data item CONSECUTIVELY.  Since the use case
> says you are interested in every second data item, you
> effectively have to read ALL data.

You would definitely not want to read them consecutively -- You'd want 
to read all of the data items you need concurrently (thereby 
necessitating that the file pointer offset be encoded in each request). 
   I do agree with you that that my off-the cuff proposal for launching 
one async task per data item is not practica due to the excessive 
software overhead.  However, I don't see why you cannot launch as many 
concurrent requests as you need to satisfy Little's Law.

> Same holds if you want every 10th data item - only the ratio
> gets even worse.
> So, interleaving works only efficently for sufficiently
> _large_ independent read request (then its perfect of
> course).

That is curious... Interleaving on vector machines is used for 
precisely the opposite purpose (for hundreds of very small independent 
read requests). Latency hiding and throughput are intimitely connected.

I would expect that all of the read requests for a hyperslab are 
independent provided the file pointer state is encoded in the request.  
This is precisely what the readv()/pread() does.

Should we find some case that causes problems for a readv/pread model?  
The hyperslabbing is clearly not one of those cases.

> I think the task model and the proposed eRead model are
> orthogonal.  The task model provides you asynchroneousity,
> the eRead provides you efficiency (throughput).

Pipelining is used to achieve throughput.  Pipelining is achieved via 
concurrent async operations.  I agree that launching one task per byte 
is going to be inefficient, but it is inefficient because of the 
software overhead of launching a new task (not because async 
request/response is inefficient).  SCSI disk interfaces and DDR DRAMs 
depend on submitting async requests for data that get fulfilled later 
(sometimes out-of-order).  They are achieving this goal of throughput 
using a far simpler model than ERET/ESTO.  Its worth looking at simpler 
models for defining deeply pipelined remote gather/scatter operations.

> Also, as a side note: I know about some of the dicussions
> the GridFTP folx had about efficient remote file IO.  They
> have been similar to this one, and the ERET/ESTO model was
> the finally agreed on.

I'm not sure if the ERET/ESTO solves the problem at hand. The 
complexity has been pushed to a different layer of the software stack.

> Cheers, Andre.
>> The only modification that would be useful to add to the tasking
>> interface is a notion of "readFrom()" and "writeTo()" which allows you
>> to specify the file offset together with the read.  Otherwise, the
>> statefulness of the read() call would make the entire "task" interface
>> useless with respect to file I/O.
>>
>> -john
>>
>> On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
>>> Hi again,
>>>
>>> consider following use case for remote IO.  Given a large
>>> binary 2D field on a remote host, the client wans to access
>>> a 2D sub portion of that field.  Dependend on the remote
>>> file layout, that requires usually more than one read
>>> operation, since the standard read (offset, length) is
>>> agnostic to the 2D layout.
>>>
>>> For more complex operations (subsampling, get a piece of a
>>> jpg file), the number of remote operations grow very fast.
>>> Latency then stringly discourages that type of remote IO.
>>>
>>> For that reason, I think that the remote file IO as
>>> specified by SAGA's Strawman as is will only be usable for a
>>> limited and trivial set of remote I/O use cases.
>>>
>>> There are three (basic) approaches:
>>>
>>>  A) get the whole thing, and do ops locally
>>>     Pro: - one remote op,
>>>          - simple logic
>>>          - remote side doesn't need to know about file
>>>            structure
>>>          - easily implementable on application level
>>>     Con: - getting the header info of a 1GB data file comes
>>>            with, well, some overhead ;-)
>>>
>>>  B) clustering of calls: do many reads, but send them as a
>>>     single request.
>>>     Pro: - transparent to application
>>>          - efficient
>>>     Con: - need to know about dependencies of reads
>>>            (a header read needed to determine size of
>>>            field), or included explicite 'flushes'
>>>          - need a protocol to support that
>>>          - the remote side needs to support that
>>>
>>>  C) data specific remote ops: send a high level command,
>>>     and get exactly what you want.
>>>     Pro: - most efficient
>>>     Con: - need a protocol to support that
>>>          - the remote side needs to support that _specific_
>>>            command
>>>
>>> The last approach (C) is what I have best experiences with.
>>> Also, that is what GridFTP as a common file access protocol
>>> supports via ERET/ESTO operations.
>>>
>>> I want to propose to include a C-like extension to the File
>>> API of the strawman, which basically maps well to GridFTP,
>>> but should also map to other implementations of C.
>>>
>>> That extension would look like:
>>>
>>>      void lsEModes   (out array<string,1> emodes   );
>>>      void eWrite      (in  string          emode,
>>>                        in  string          spec,
>>>                        in  string          buffer
>>>                        out long            len_out  );
>>>      void eRead       (in  string          emode,
>>>                        in  string          spec,
>>>                        out string          buffer,
>>>                        out long            len_out  );
>>>
>>>      - hooks for gridftp-like opaque ERET/ESTO features
>>>      - spec:  string for pattern as in GridFTP's ESTO/ERET
>>>      - emode: string for ident.  as in GridFTP's ESTO/ERET
>>>
>>> EMode:        a specific remote I/O command supported
>>> lsEModes:     list the EModes available in this implementation
>>> eRead/eWrite: read/write data according to the emode spec
>>>
>>> Example (in perl for brevity):
>>>
>>>  my $file   = SAGA::File new
>>> ("http://www.google.com/intl/en/images/logo.gif");
>>>  my @emodes = $file->lsEModes ();
>>>
>>>  if ( grep (/^jpeg_block$/, @emodes) )
>>>  {
>>>    my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
>>>  }
>>>
>>> I would discourage support for B, since I do not know any
>>> protocoll supporting that approach efficiently, and also it
>>> needs approximately the same infrastructure setup as C.
>>>
>>> As A is easily implementable on application level, or within
>>> any SAGA implementation, there is no need for support on API
>>> level -- however, A is insufficient for all but some trivial
>>> cases.
>>>
>>> Comments welcome :-))
>>>
>>> Cheers, Andre.
>>>
>>>
>>> -- 
>>> +-----------------------------------------------------------------+
>>> | Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
>>> | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
>>> | Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
>>> | De Boelelaan 1083a                | www:  http://www.merzky.net |
>>> | 1081 HV Amsterdam, Netherlands    |                             |
>>> +-----------------------------------------------------------------+
>>>
>
>
>
> -- 
> +-----------------------------------------------------------------+
> | Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
> | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
> | Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
> | De Boelelaan 1083a                | www:  http://www.merzky.net |
> | 1081 HV Amsterdam, Netherlands    |                             |
> +-----------------------------------------------------------------+