[saga-rg] proposal for extended file IO
John Shalf
jshalf at lbl.gov
Tue Jun 14 01:25:18 CDT 2005
On Jun 13, 2005, at 11:40 AM, Andre Merzky wrote:
> Hi John,
>
> Quoting [John Shalf] (Jun 13 2005):
>>
>> Hi Andre,
>> I think there is a 4th possibility. If each of the I/O operations can
>> be requested asynchronously, then you can get the same net effect as
>> the ERET/ESTO functionality of the GridFTP.
>
> I disagree. You can hide latency to some extend, but your
> throughput suffers, utterly.
If you do a full gather-scatter I/O, then this is true (the length of
the request equals the size of the data item returned). Even in such a
case, as long as the number of outstanding requests matches the
bandwidth-delay product of the network channel (as per Little's Law),
you still achieve full throughput. However, the e-modes approach is
equally bad because it simply pushes an enormous amount of complexity
down a layer into the implementation. I'm not sure which is worse.
So the concerns I have are as follows
1) The negotiations to find out the available eModes seems to require
some complex modules be installed on both the client and the server
side of a system. One would hope that you could implement the
capabilities you need using a smaller subset of elemental operations.
For instance the stdio readv() and pread() functionality to describe
gather/scatter type operations.
2) The implementation looks way too close to one particular data
transport implementation. I'm not convinced it is the best thing out
there for gather-scatter I/O over a high-latency interface. Again, I'd
be interested in seeing the advantages/disadvantages of something
related to the POSIX/XPG gather/scatter I/O implementation. They would
cover Jon's case.
3) Are the EModes() guaranteed to be stateless? In the JPEG_Block
example you provide, its not clear what the side-effects are with
regard to the file pointer. If some EModes() have side-effects on the
file pointer state, whereas others do not, its going to be impossibly
messy.
So my example wasn't very well thought out, but the higher-level point
I was trying to make is that I think there are more general ways to
encode data layout descriptions for remote patterned or gather-scatter
I/O operations than e-modes. The arbitraryiness of the modes and their
associated string parsing adds a sort of complexity that is a bit
daunting at first blush.
> Imagine Jons use case (its a worst case scenario really): You
> have a remote HDF5 file, and want a hyperslab. Really worst
> case is you want every second data item.
>
> Now, if you rely on read as is, you have to send one read
> request for every single data item you want to read. If you
> interleave them asynchroneously, you get reaonable latency,
> but your throughput is, well, close to zero.
If the number of outstanding requests (in terms of bytes) is equal to
the bandwidth-delay product of the connection, then you will reach
peak. Sadly, the way I posed the solution would die from the excessive
overhead of launching the threads.
> If you want to optimize your buffersize, you have to read
> more than one data item CONSECUTIVELY. Since the use case
> says you are interested in every second data item, you
> effectively have to read ALL data.
You would definitely not want to read them consecutively -- You'd want
to read all of the data items you need concurrently (thereby
necessitating that the file pointer offset be encoded in each request).
I do agree with you that that my off-the cuff proposal for launching
one async task per data item is not practica due to the excessive
software overhead. However, I don't see why you cannot launch as many
concurrent requests as you need to satisfy Little's Law.
> Same holds if you want every 10th data item - only the ratio
> gets even worse.
> So, interleaving works only efficently for sufficiently
> _large_ independent read request (then its perfect of
> course).
That is curious... Interleaving on vector machines is used for
precisely the opposite purpose (for hundreds of very small independent
read requests). Latency hiding and throughput are intimitely connected.
I would expect that all of the read requests for a hyperslab are
independent provided the file pointer state is encoded in the request.
This is precisely what the readv()/pread() does.
Should we find some case that causes problems for a readv/pread model?
The hyperslabbing is clearly not one of those cases.
> I think the task model and the proposed eRead model are
> orthogonal. The task model provides you asynchroneousity,
> the eRead provides you efficiency (throughput).
Pipelining is used to achieve throughput. Pipelining is achieved via
concurrent async operations. I agree that launching one task per byte
is going to be inefficient, but it is inefficient because of the
software overhead of launching a new task (not because async
request/response is inefficient). SCSI disk interfaces and DDR DRAMs
depend on submitting async requests for data that get fulfilled later
(sometimes out-of-order). They are achieving this goal of throughput
using a far simpler model than ERET/ESTO. Its worth looking at simpler
models for defining deeply pipelined remote gather/scatter operations.
> Also, as a side note: I know about some of the dicussions
> the GridFTP folx had about efficient remote file IO. They
> have been similar to this one, and the ERET/ESTO model was
> the finally agreed on.
I'm not sure if the ERET/ESTO solves the problem at hand. The
complexity has been pushed to a different layer of the software stack.
> Cheers, Andre.
>> The only modification that would be useful to add to the tasking
>> interface is a notion of "readFrom()" and "writeTo()" which allows you
>> to specify the file offset together with the read. Otherwise, the
>> statefulness of the read() call would make the entire "task" interface
>> useless with respect to file I/O.
>>
>> -john
>>
>> On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
>>> Hi again,
>>>
>>> consider following use case for remote IO. Given a large
>>> binary 2D field on a remote host, the client wans to access
>>> a 2D sub portion of that field. Dependend on the remote
>>> file layout, that requires usually more than one read
>>> operation, since the standard read (offset, length) is
>>> agnostic to the 2D layout.
>>>
>>> For more complex operations (subsampling, get a piece of a
>>> jpg file), the number of remote operations grow very fast.
>>> Latency then stringly discourages that type of remote IO.
>>>
>>> For that reason, I think that the remote file IO as
>>> specified by SAGA's Strawman as is will only be usable for a
>>> limited and trivial set of remote I/O use cases.
>>>
>>> There are three (basic) approaches:
>>>
>>> A) get the whole thing, and do ops locally
>>> Pro: - one remote op,
>>> - simple logic
>>> - remote side doesn't need to know about file
>>> structure
>>> - easily implementable on application level
>>> Con: - getting the header info of a 1GB data file comes
>>> with, well, some overhead ;-)
>>>
>>> B) clustering of calls: do many reads, but send them as a
>>> single request.
>>> Pro: - transparent to application
>>> - efficient
>>> Con: - need to know about dependencies of reads
>>> (a header read needed to determine size of
>>> field), or included explicite 'flushes'
>>> - need a protocol to support that
>>> - the remote side needs to support that
>>>
>>> C) data specific remote ops: send a high level command,
>>> and get exactly what you want.
>>> Pro: - most efficient
>>> Con: - need a protocol to support that
>>> - the remote side needs to support that _specific_
>>> command
>>>
>>> The last approach (C) is what I have best experiences with.
>>> Also, that is what GridFTP as a common file access protocol
>>> supports via ERET/ESTO operations.
>>>
>>> I want to propose to include a C-like extension to the File
>>> API of the strawman, which basically maps well to GridFTP,
>>> but should also map to other implementations of C.
>>>
>>> That extension would look like:
>>>
>>> void lsEModes (out array<string,1> emodes );
>>> void eWrite (in string emode,
>>> in string spec,
>>> in string buffer
>>> out long len_out );
>>> void eRead (in string emode,
>>> in string spec,
>>> out string buffer,
>>> out long len_out );
>>>
>>> - hooks for gridftp-like opaque ERET/ESTO features
>>> - spec: string for pattern as in GridFTP's ESTO/ERET
>>> - emode: string for ident. as in GridFTP's ESTO/ERET
>>>
>>> EMode: a specific remote I/O command supported
>>> lsEModes: list the EModes available in this implementation
>>> eRead/eWrite: read/write data according to the emode spec
>>>
>>> Example (in perl for brevity):
>>>
>>> my $file = SAGA::File new
>>> ("http://www.google.com/intl/en/images/logo.gif");
>>> my @emodes = $file->lsEModes ();
>>>
>>> if ( grep (/^jpeg_block$/, @emodes) )
>>> {
>>> my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
>>> }
>>>
>>> I would discourage support for B, since I do not know any
>>> protocoll supporting that approach efficiently, and also it
>>> needs approximately the same infrastructure setup as C.
>>>
>>> As A is easily implementable on application level, or within
>>> any SAGA implementation, there is no need for support on API
>>> level -- however, A is insufficient for all but some trivial
>>> cases.
>>>
>>> Comments welcome :-))
>>>
>>> Cheers, Andre.
>>>
>>>
>>> --
>>> +-----------------------------------------------------------------+
>>> | Andre Merzky | phon: +31 - 20 - 598 - 7759 |
>>> | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
>>> | Dept. of Computer Science | mail: merzky at cs.vu.nl |
>>> | De Boelelaan 1083a | www: http://www.merzky.net |
>>> | 1081 HV Amsterdam, Netherlands | |
>>> +-----------------------------------------------------------------+
>>>
>
>
>
> --
> +-----------------------------------------------------------------+
> | Andre Merzky | phon: +31 - 20 - 598 - 7759 |
> | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
> | Dept. of Computer Science | mail: merzky at cs.vu.nl |
> | De Boelelaan 1083a | www: http://www.merzky.net |
> | 1081 HV Amsterdam, Netherlands | |
> +-----------------------------------------------------------------+
More information about the saga-rg
mailing list