[saga-rg] proposal for extended file IO

Tue Jun 14 02:03:59 CDT 2005

Quoting [John Shalf] (Jun 14 2005):
> 
> On Jun 13, 2005, at 11:40 AM, Andre Merzky wrote:
> 
> >Hi John,
> >
> >Quoting [John Shalf] (Jun 13 2005):
> >>
> >>Hi Andre,
> >>I think there is a 4th possibility.  If each of the I/O operations can
> >>be requested asynchronously, then you can get the same net effect as
> >>the ERET/ESTO functionality of the GridFTP.
> >
> >I disagree.  You can hide latency to some extend, but your
> >throughput suffers, utterly.
> 
> If you do a full gather-scatter I/O, then this is true (the length of 
> the request equals the size of the data item returned).  Even in such a 
> case, as long as the number of outstanding requests matches the 
> bandwidth-delay product of the network channel (as per Little's Law), 
> you still achieve full throughput.  However, the e-modes approach is 
> equally bad because it simply pushes an enormous amount of complexity 
> down a layer into the implementation.  I'm not sure which is worse.

:-)

> So the concerns I have are as follows
> 	1) The negotiations to find out the available eModes seems to 
> 	require some complex modules be installed on both the client and the server 
>   side of a system.

Only potentially complex, but yes, thats right.

>   One would hope that you could implement the 
>   capabilities you need using a smaller subset of elemental operations.  
>   For instance the stdio readv() and pread() functionality to describe 
>   gather/scatter type operations.

Its just not always possible.  For example, eread would
allow you to specify a subset of a jpeg image.  That cannot
be expressed as read operations at all.  

OTOH, one can argue that such operations allow for even more
semantic uncertainty...  From application point of view its
very useful.

> 	2) The implementation looks way too close to one particular data 
> transport implementation.  I'm not convinced it is the best thing out 
> there for gather-scatter I/O over a high-latency interface.  Again, I'd 
> be interested in seeing the advantages/disadvantages of something 
> related to the POSIX/XPG gather/scatter I/O implementation.  They would 
> cover Jon's case.

It looks close to GridFTP, granted, but the idea _is_
generic.  Basically it says: describe you request as a
string (== opaque), and you get what you want.

I can't see that throughput really would peak - see below.

> 	3) Are the EModes() guaranteed to be stateless?  In the JPEG_Block 
> example you provide, its not clear what the side-effects are with 
> regard to the file pointer.  If some EModes() have side-effects on the 
> file pointer state, whereas others do not, its going to be impossibly 
> messy.

Yes, emodes are supposed to be stateless.  They don't
respect and don't move the file pointer.  That would be
messy indeed (jpeg).

> So my example wasn't very well thought out, but the higher-level point 
> I was trying to make is that I think there are more general ways to 
> encode data layout descriptions for remote patterned or gather-scatter 
> I/O operations than e-modes.  The arbitraryiness of the modes and their 
> associated string parsing adds a sort of complexity that is a bit 
> daunting at first blush.

Ha :-)  As I learned first of ERET, I was afraid that people
start to send large XML formatted data requests, and found
that idea terrible - sounds like an abuse of a data access
thing for a multi purpose protocoll.

OTOH, its ease of use for HDF5 hyperslabs is utterly
convincing I think.  The _request_ size is a sring
describing a hyperslab.  Whatever intelligence you have in
gather-scatter I/O, the request size for a hyperslab van
easily match the size of the hyperslab itself, or exceed it
(readv needs one offset and one length to read a single byte.  
If your data are scattered bytewise...)

> >Imagine Jons use case (its a worst case scenario really): You
> >have a remote HDF5 file, and want a hyperslab.  Really worst
> >case is you want every second data item.
> >
> >Now, if you rely on read as is, you have to send one read
> >request for every single data item you want to read.  If you
> >interleave them asynchroneously, you get reaonable latency,
> >but your throughput is, well, close to zero.
> 
> If the number of outstanding requests (in terms of bytes) is equal to 
> the bandwidth-delay product of the connection, then you will reach 
> peak.  Sadly, the way I posed the solution would die from the excessive 
> overhead of launching the threads.

I am sure that does not scale.  If a hyperslab describes one
megabyte of scattered data (in byte granularity, say:
subsample a 3D scalar field for lowres volrendering), then I
have 1 million read/write requests on the wire, each one
with its protocoll overhead, processing overhead etc etc.
Mathematically you might be right, but in praxis that won't
do any good I think.

eread has one small request, and one large response.
If an implementation thinkgs that the large response is
better spit up (udpblast), then fine.  Thats possible.  The
way around is not possible (or much harder).

> >If you want to optimize your buffersize, you have to read
> >more than one data item CONSECUTIVELY.  Since the use case
> >says you are interested in every second data item, you
> >effectively have to read ALL data.
> 
> You would definitely not want to read them consecutively -- You'd want 
> to read all of the data items you need concurrently (thereby 
> necessitating that the file pointer offset be encoded in each request). 
>   I do agree with you that that my off-the cuff proposal for launching 
> one async task per data item is not practica due to the excessive 
> software overhead.  However, I don't see why you cannot launch as many 
> concurrent requests as you need to satisfy Little's Law.

Little's Law does not really apply I think.  It assumes that
the items on the wire are identical, and require same time.
If yoy read byte wise, that just doesn't apply anymore: the
overhead gets larger than the payload.  So, the law of
course holds, but its applied to different entities...

> >Same holds if you want every 10th data item - only the ratio
> >gets even worse.
> >So, interleaving works only efficently for sufficiently
> >_large_ independent read request (then its perfect of
> >course).
> 
> That is curious... Interleaving on vector machines is used for 
> precisely the opposite purpose (for hundreds of very small independent 
> read requests). Latency hiding and throughput are intimitely connected.
> 
> I would expect that all of the read requests for a hyperslab are 
> independent provided the file pointer state is encoded in the request.  
> This is precisely what the readv()/pread() does.
> 
> Should we find some case that causes problems for a readv/pread model?  
> The hyperslabbing is clearly not one of those cases.

It IS!  Again, for a hyperslab requesting every other byte
from a file, you send two bytes as request: one for offset,
one for length.  Additionally, you have some overhead for
protocol.  Additionally, you force the remote side to
process the request like this: You cannot use hdf5 for
efficient hyperslab IO, but have to use read/seek.
Additionally, the response is equally bloated, because you
need to separate the individual response blocks again (that
can be avoided by matchin the repsonse to the original
request I guess).

If you want a more obvious example: jpeg subset.  Its
impossible to express in gather-scatter IO.

> >I think the task model and the proposed eRead model are
> >orthogonal.  The task model provides you asynchroneousity,
> >the eRead provides you efficiency (throughput).
> 
> Pipelining is used to achieve throughput.  Pipelining is achieved via 
> concurrent async operations.  I agree that launching one task per byte 
> is going to be inefficient, but it is inefficient because of the 
> software overhead of launching a new task (not because async 
> request/response is inefficient).  SCSI disk interfaces and DDR DRAMs 
> depend on submitting async requests for data that get fulfilled later 
> (sometimes out-of-order).  They are achieving this goal of throughput 
> using a far simpler model than ERET/ESTO.  Its worth looking at simpler 
> models for defining deeply pipelined remote gather/scatter operations.
> 
> >Also, as a side note: I know about some of the dicussions
> >the GridFTP folx had about efficient remote file IO.  They
> >have been similar to this one, and the ERET/ESTO model was
> >the finally agreed on.
> 
> I'm not sure if the ERET/ESTO solves the problem at hand. The 
> complexity has been pushed to a different layer of the software stack.

Yes, right!  Thats the point: it allows to push semantic
information to a level where it can be efficiently be used.
All other approaches I know strip the semantic information,
and boil the request down to generic small ops (as readv).

Really, I do not know _any_ implementation which can do
subsampling on remote data efficiently with small ops as
request, instead of a _semantic_ description of the
subsampling.

Chees, Andre :-))

> >Cheers, Andre.
> >>The only modification that would be useful to add to the tasking
> >>interface is a notion of "readFrom()" and "writeTo()" which allows you
> >>to specify the file offset together with the read.  Otherwise, the
> >>statefulness of the read() call would make the entire "task" interface
> >>useless with respect to file I/O.
> >>
> >>-john
> >>
> >>On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
> >>>Hi again,
> >>>
> >>>consider following use case for remote IO.  Given a large
> >>>binary 2D field on a remote host, the client wans to access
> >>>a 2D sub portion of that field.  Dependend on the remote
> >>>file layout, that requires usually more than one read
> >>>operation, since the standard read (offset, length) is
> >>>agnostic to the 2D layout.
> >>>
> >>>For more complex operations (subsampling, get a piece of a
> >>>jpg file), the number of remote operations grow very fast.
> >>>Latency then stringly discourages that type of remote IO.
> >>>
> >>>For that reason, I think that the remote file IO as
> >>>specified by SAGA's Strawman as is will only be usable for a
> >>>limited and trivial set of remote I/O use cases.
> >>>
> >>>There are three (basic) approaches:
> >>>
> >>> A) get the whole thing, and do ops locally
> >>>    Pro: - one remote op,
> >>>         - simple logic
> >>>         - remote side doesn't need to know about file
> >>>           structure
> >>>         - easily implementable on application level
> >>>    Con: - getting the header info of a 1GB data file comes
> >>>           with, well, some overhead ;-)
> >>>
> >>> B) clustering of calls: do many reads, but send them as a
> >>>    single request.
> >>>    Pro: - transparent to application
> >>>         - efficient
> >>>    Con: - need to know about dependencies of reads
> >>>           (a header read needed to determine size of
> >>>           field), or included explicite 'flushes'
> >>>         - need a protocol to support that
> >>>         - the remote side needs to support that
> >>>
> >>> C) data specific remote ops: send a high level command,
> >>>    and get exactly what you want.
> >>>    Pro: - most efficient
> >>>    Con: - need a protocol to support that
> >>>         - the remote side needs to support that _specific_
> >>>           command
> >>>
> >>>The last approach (C) is what I have best experiences with.
> >>>Also, that is what GridFTP as a common file access protocol
> >>>supports via ERET/ESTO operations.
> >>>
> >>>I want to propose to include a C-like extension to the File
> >>>API of the strawman, which basically maps well to GridFTP,
> >>>but should also map to other implementations of C.
> >>>
> >>>That extension would look like:
> >>>
> >>>     void lsEModes   (out array<string,1> emodes   );
> >>>     void eWrite      (in  string          emode,
> >>>                       in  string          spec,
> >>>                       in  string          buffer
> >>>                       out long            len_out  );
> >>>     void eRead       (in  string          emode,
> >>>                       in  string          spec,
> >>>                       out string          buffer,
> >>>                       out long            len_out  );
> >>>
> >>>     - hooks for gridftp-like opaque ERET/ESTO features
> >>>     - spec:  string for pattern as in GridFTP's ESTO/ERET
> >>>     - emode: string for ident.  as in GridFTP's ESTO/ERET
> >>>
> >>>EMode:        a specific remote I/O command supported
> >>>lsEModes:     list the EModes available in this implementation
> >>>eRead/eWrite: read/write data according to the emode spec
> >>>
> >>>Example (in perl for brevity):
> >>>
> >>> my $file   = SAGA::File new
> >>>("http://www.google.com/intl/en/images/logo.gif");
> >>> my @emodes = $file->lsEModes ();
> >>>
> >>> if ( grep (/^jpeg_block$/, @emodes) )
> >>> {
> >>>   my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
> >>> }
> >>>
> >>>I would discourage support for B, since I do not know any
> >>>protocoll supporting that approach efficiently, and also it
> >>>needs approximately the same infrastructure setup as C.
> >>>
> >>>As A is easily implementable on application level, or within
> >>>any SAGA implementation, there is no need for support on API
> >>>level -- however, A is insufficient for all but some trivial
> >>>cases.
> >>>
> >>>Comments welcome :-))
> >>>
> >>>Cheers, Andre.
> >>>
> >>>
> >>>-- 
> >>>+-----------------------------------------------------------------+
> >>>| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
> >>>| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
> >>>| Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
> >>>| De Boelelaan 1083a                | www:  http://www.merzky.net |
> >>>| 1081 HV Amsterdam, Netherlands    |                             |
> >>>+-----------------------------------------------------------------+
> >>>
> >
> >
> >
> >-- 
> >+-----------------------------------------------------------------+
> >| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
> >| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
> >| Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
> >| De Boelelaan 1083a                | www:  http://www.merzky.net |
> >| 1081 HV Amsterdam, Netherlands    |                             |
> >+-----------------------------------------------------------------+

-- 
+-----------------------------------------------------------------+
| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
| Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
| De Boelelaan 1083a                | www:  http://www.merzky.net |
| 1081 HV Amsterdam, Netherlands    |                             |
+-----------------------------------------------------------------+