[saga-rg] proposal for extended file IO
John Shalf
jshalf at lbl.gov
Mon Jun 13 12:44:04 CDT 2005
On Jun 13, 2005, at 7:49 AM, Jon MacLaren wrote:
> Hi Andre,
> Coincidentally, I'm looking at a very similar thing right now. I'm
> trying to extend an archive which I've been building here at CCT. In
> the archive currently, we have netCDF files, for the coastal modelers,
> which support this kind of subsetting. We also plan to roll out the
> archive to the physicists, who will want to put their huge HD5 files
> in the archive, and then do hyperslabbing on these (essentially some
> kind of subset, but with a cool name).
This is very similar to the SRB-HDF5 archive system that they are
developing at SDSC.
http://hdf.ncsa.uiuc.edu/RFC/hdf5srb/
Integrating_HDF5_with_SRB_ag_talk.ppt
Its very interesting that so many groups are converging on this sort of
file archiving strategy.
> I had imagined passing some specification to the archive, represented
> by attribute/value pairs, along with the LogicalFileName. The service
> on the end to prepare the data for me, and places it in a temporary
> store, and return me the URLs to the prepared file. I would then
> access the file in the normal way.
>
> When your original dataset is 1TB, you have problems. You can't
> simply prepare the data in the time that it takes to do a call and
> reply. You need to go asynchronous. With the solution I've gone for,
> I can simply say "this isn't ready yet, but I'm working on it" rather
> than returning the URLs. The user can check back later (polling), or
> I can tell them when it's ready (notification). Then they access the
> data.
>
> How do you make your proposed eRead operation "go asynchronous" if
> things would take a long time? Or would the first read just hang
> until the data was prepared?
>
> Jon.
>
>
> On Jun 13, 2005, at 5:38 AM, Andre Merzky wrote:
>
>> Hallo Hartmut,
>>
>> Quoting [Hartmut Kaiser] (Jun 13 2005):
>>
>>>
>>> Agreed here.
>>>
>>>
>>>> That extension would look like:
>>>>
>>>> void lsEModes (out array<string,1> emodes );
>>>> void eWrite (in string emode,
>>>> in string spec,
>>>> in string buffer
>>>> out long len_out );
>>>> void eRead (in string emode,
>>>> in string spec,
>>>> out string buffer,
>>>> out long len_out );
>>>>
>>>> - hooks for gridftp-like opaque ERET/ESTO features
>>>> - spec: string for pattern as in GridFTP's ESTO/ERET
>>>> - emode: string for ident. as in GridFTP's ESTO/ERET
>>>>
>>>> EMode: a specific remote I/O command supported
>>>> lsEModes: list the EModes available in this implementation
>>>> eRead/eWrite: read/write data according to the emode spec
>>>>
>>>> Example (in perl for brevity):
>>>>
>>>> my $file = SAGA::File new
>>>> ("http://www.google.com/intl/en/images/logo.gif");
>>>> my @emodes = $file->lsEModes ();
>>>>
>>>> if ( grep (/^jpeg_block$/, @emodes) )
>>>> {
>>>> my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
>>>> }
>>>>
>>>> I would discourage support for B, since I do not know any
>>>> protocoll supporting that approach efficiently, and also it
>>>> needs approximately the same infrastructure setup as C.
>>>>
>>>> As A is easily implementable on application level, or within
>>>> any SAGA implementation, there is no need for support on API
>>>> level -- however, A is insufficient for all but some trivial cases.
>>>>
>>>
>>> This approach is very generic on the API level (that's good) but
>>> requires
>>> exact agreement on the used command syntax for the client and the
>>> server,
>>> which may get problematic. If we go this route we will definitely
>>> end up
>>> specifying at least a minimal command subset to be supported by the
>>> eRead/eWrite commands.
>>>
>>
>> You are right: complexity does not go away magically, but
>> gets moved to the specification of the eModes.
>>
>> As for a minimal set: I do not think that this is necessary
>> - the eMode is SUPPOSED to be application specific. OTOH, a
>> intuitive example usable from some cases may be helpful.
>> GridFTP ERET standard example is partial file access (IIRC:
>> filename, offset, length). That is not very useful for
>> SAGA, since that is already covered by the normal read/write
>> operations.
>>
>>
>>
>>> I simply fear we'll have the same problems we have with the GAT
>>> today. The
>>> GAT API is in principle usable in a broad range of use cases based
>>> on a
>>> generic API. The genericity is ensured by using key/value tables in
>>> the API
>>> itself, allowing quick adaptation to any concrete need. The problem
>>> is the
>>> missing specification of these key/value pairs which makes it
>>> difficult to
>>> achieve reusability.
>>>
>>
>> I absolutely agree that the problem lies right there:
>> semantic overloading of strings. The situation is somewhat
>> better than in GAT though:
>>
>> - the preferences in GAT are really generic, and can be
>> used for anything. The eModes have a very limited
>> scope, and are hence much easier to agree on between
>> different implementations
>>
>> - as the mapping to GridFTP is 1:1, and GridFTP is quite
>> commonly used, so there is at least some other instance
>> to be used for agreement on the modes. Hence, every
>> implementation of a eMode can be expected to do the same
>> thing. At least there is a good chance for that.
>>
>> However, again: you are right. Semantic overloading of
>> strings is not a nice thing to do, and is here only
>> justified by a lack of obvious alternatives.
>>
>> Thanks, Andre.
>>
>>
>>>
>>> Regards Hartmut
>>>
>>>
>>
>>
>>
>> --
>> +-----------------------------------------------------------------+
>> | Andre Merzky | phon: +31 - 20 - 598 - 7759 |
>> | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
>> | Dept. of Computer Science | mail: merzky at cs.vu.nl |
>> | De Boelelaan 1083a | www: http://www.merzky.net |
>> | 1081 HV Amsterdam, Netherlands | |
>> +-----------------------------------------------------------------+
>>
>>
>
More information about the saga-rg
mailing list