[saga-rg] proposal for extended file IO

Mon Jun 13 09:49:59 CDT 2005

Hi Andre,

Coincidentally, I'm looking at a very similar thing right now.  I'm  
trying to extend an archive which I've been building here at CCT.  In  
the archive currently, we have netCDF files, for the coastal  
modelers, which support this kind of subsetting.  We also plan to  
roll out the archive to the physicists, who will want to put their  
huge HD5 files in the archive, and then do hyperslabbing on these  
(essentially some kind of subset, but with a cool name).

I had imagined passing some specification to the archive, represented  
by attribute/value pairs, along with the LogicalFileName.  The  
service on the end to prepare the data for me, and places it in a  
temporary store, and return me the URLs to the prepared file.  I  
would then access the file in the normal way.

When your original dataset is 1TB, you have problems.  You can't  
simply prepare the data in the time that it takes to do a call and  
reply.  You need to go asynchronous.  With the solution I've gone  
for, I can simply say "this isn't ready yet, but I'm working on it"  
rather than returning the URLs.  The user can check back later  
(polling), or I can tell them when it's ready (notification).  Then  
they access the data.

How do you make your proposed eRead operation "go asynchronous" if  
things would take a long time?  Or would the first read just hang  
until the data was prepared?

Jon.

On Jun 13, 2005, at 5:38 AM, Andre Merzky wrote:

> Hallo Hartmut,
>
> Quoting [Hartmut Kaiser] (Jun 13 2005):
>
>>
>> Agreed here.
>>
>>
>>> That extension would look like:
>>>
>>>       void lsEModes   (out array<string,1> emodes   );
>>>       void eWrite      (in  string          emode,
>>>                         in  string          spec,
>>>                         in  string          buffer
>>>                         out long            len_out  );
>>>       void eRead       (in  string          emode,
>>>                         in  string          spec,
>>>                         out string          buffer,
>>>                         out long            len_out  );
>>>
>>>       - hooks for gridftp-like opaque ERET/ESTO features
>>>       - spec:  string for pattern as in GridFTP's ESTO/ERET
>>>       - emode: string for ident.  as in GridFTP's ESTO/ERET
>>>
>>> EMode:        a specific remote I/O command supported
>>> lsEModes:     list the EModes available in this implementation
>>> eRead/eWrite: read/write data according to the emode spec
>>>
>>> Example (in perl for brevity):
>>>
>>>   my $file   = SAGA::File new
>>> ("http://www.google.com/intl/en/images/logo.gif");
>>>   my @emodes = $file->lsEModes ();
>>>
>>>   if ( grep (/^jpeg_block$/, @emodes) )
>>>   {
>>>     my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
>>>   }
>>>
>>> I would discourage support for B, since I do not know any
>>> protocoll supporting that approach efficiently, and also it
>>> needs approximately the same infrastructure setup as C.
>>>
>>> As A is easily implementable on application level, or within
>>> any SAGA implementation, there is no need for support on API
>>> level -- however, A is insufficient for all but some trivial cases.
>>>
>>
>> This approach is very generic on the API level (that's good) but  
>> requires
>> exact agreement on the used command syntax for the client and the  
>> server,
>> which may get problematic. If we go this route we will definitely  
>> end up
>> specifying at least a minimal command subset to be supported by the
>> eRead/eWrite commands.
>>
>
> You are right: complexity does not go away magically, but
> gets moved to the specification of the eModes.
>
> As for a minimal set: I do not think that this is necessary
> - the eMode is SUPPOSED to be application specific.  OTOH, a
> intuitive example usable from some cases may be helpful.
> GridFTP ERET standard example is partial file access (IIRC:
> filename, offset, length).  That is not very useful for
> SAGA, since that is already covered by the normal read/write
> operations.
>
>
>
>> I simply fear we'll have the same problems we have with the GAT  
>> today. The
>> GAT API is in principle usable in a broad range of use cases based  
>> on a
>> generic API. The genericity is ensured by using key/value tables  
>> in the API
>> itself, allowing quick adaptation to any concrete need. The  
>> problem is the
>> missing specification of these key/value pairs which makes it  
>> difficult to
>> achieve reusability.
>>
>
> I absolutely agree that the problem lies right there:
> semantic overloading of strings.  The situation is somewhat
> better than in GAT though:
>
>   - the preferences in GAT are really generic, and can be
>     used for anything.  The eModes have a very limited
>     scope, and are hence much easier to agree on between
>     different implementations
>
>   - as the mapping to GridFTP is 1:1, and GridFTP is quite
>     commonly used, so there is at least some other instance
>     to be used for agreement on the modes.  Hence, every
>     implementation of a eMode can be expected to do the same
>     thing. At least there is a good chance for that.
>
> However, again: you are right.  Semantic overloading of
> strings is not a nice thing to do, and is here only
> justified by a lack of obvious alternatives.
>
> Thanks, Andre.
>
>
>>
>> Regards Hartmut
>>
>>
>
>
>
> -- 
> +-----------------------------------------------------------------+
> | Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
> | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
> | Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
> | De Boelelaan 1083a                | www:  http://www.merzky.net |
> | 1081 HV Amsterdam, Netherlands    |                             |
> +-----------------------------------------------------------------+
>
>