[SAGA-RG] Fwd (mathijs at cs.vu.nl): Re: speeding up (???) Xterior

Sat Oct 17 22:56:20 CDT 2009

Hi Mathijs, 

I'll answer on the list if you don't mind, as several items relate
to earlier discussions we had...

Quoting [Thilo Kielmann] (Oct 15 2009):
> 
> food for thought...
> 
> ----- Forwarded message from Mathijs den Burger <mathijs at cs.vu.nl> -----
> 
> > Subject: Re: speeding up (???) Xterior
> > From: Mathijs den Burger <mathijs at cs.vu.nl>
> > To: Thilo Kielmann <kielmann at cs.vu.nl>
> > Cc: Tudor Zaharia <tudor.zaharia at gmail.com>
> > 
> > On Wed, 2009-10-14 at 22:38 +0200, Thilo Kielmann wrote:
> > 
> > > we just had discussed performance problems with getting file size and
> > > permission for all entries of a directory.
> > > 
> > > It seems like there might be a fast solution we may have overlooked so far:
> > > the directory (defined in namespace??) has methods for getting file size
> > > and permissions with a URL parameter, denoting a file you get via d.list()
> > > 
> > > Andre suggested that d.list may even cache all the infos about its entries
> > > such that d.list() would be the only call talking to the backend.
> > 
> > We did not overlook that. The problem is that 'd.getSize(entryURL)' has
> > to be called for EACH entry you get from d.list(). An adaptor has
> > basically has two options of implementing that:
> > 
> > 1. each getSize(URL) call performs a separate remote operation. That is
> > what all adaptors in Java SAGA currently do (e.g. via an FTP or SSH
> > command). With large remote directories, this results in a DOS attach of
> > the remote server which then shuts you out, or simply becomes
> > unreponsive (we see that with our FTP and SSH adaptors)
> > 
> > 2. the adaptor caches all directory information, including file sizes,
> > modification dates etc. It would then have to perform only one remote
> > operation, and the retrieved info can be reused for subsequent
> > getSize(), list(), getLastModificationDate() etc. calls. However, there
> > is no mechanism in SAGA to invalidate or update such a cache, and it may
> > fill up your memory rather quickly. Also, each adaptor has to
> > reimplement caching. This can be circumvented by letting the engine
> > perform the caching, but again, there is no general mechanism to
> > invalidate or circumvent such a cache.

But well, there are standard ways to invalidate/refresh caches, most
commonly via a time-to-live for the cache.  Even if you set that ttl
to only a couple of seconds, you should see exactly the speedup you
are looking for.  I don't think that this is too complicated,
really (pseudo code):

  file.get_size (url u)
  {
    if ( cache.data.empty ||
         cache.created () - time.now () > 5 )
    { 
      cache.data    = dir.get_sizes (u.get_pwd ());
      cache.created = time.now ();
    }

    return cache.data [u.get_name ()];
  }

As for where to cache: engine, adaptor, or some external lib: that
is a tradeoff which is best decided in your implementation IMHO.

> > The solution would be to have method calls like d.getSize(List<URL>
> > entries). The adaptor can than retrieve the files sizes of all entries
> > as efficiently as possible. The current way of specifying such bulk
> > operations is via TasksContainers, which are very tedious to analyse.

That is an implementation problem.  I don't think we should expose
all kind of calls which are easier/faster to implement on
application level.  Nobody ever claimed SAGA is easy to implement!
;-)

Cheers, Andre.

> > Java SAGA does not do that at all (by default, it simply starts a Thread
> > for each Task: an even more effective DOS attach for many remote
> > directory entries).

-- 
Nothing is ever easy.