[SAGA-RG] SAGA and advert URIs

Tue Sep 8 04:56:12 CDT 2009

Quoting [Bruno Harbulot] (Sep 07 2009):
> 
> Hello,
> 
> Thank you for organising this workshop, it was very interesting and I've 
> enjoyed it. I wasn't sure to which address or mailing list (perhaps SAGA 
> on OGF or SAGA-Users) I should send this e-mail, please feel free to CC 
> it as appropriate.

Happy you liked the workshop! :-)  FWIW, I Cc'ed the ogf
mailing list for the URL discussion - it seems you are
subscribed.

> I'd like to come back on the point I was trying to make about 
> 'advert://' URIs.
> My understanding of how it works is that using 
> "advert://advert-host.example/some/thing" from the API implies that:
> 1. The API will try to find a suitable adapter for this prefix.
> 2. Currently, this adapter is a PostgreSQL client that will try to 
> connect to the PostgreSQL server on this host.
> 3. This PostgreSQL client needs to know the name of the database on the 
> server and its schema. The server needs to be set up accordingly.

Almost.  More precise it is like

  1: our SAGA implementation will forward the URL to any
  adaptor which registered for the advert API (aka which
  implements the advert API).  The adaptors can accept the
  URL, and act on the API call, or refuse to do so.

  2. Currently, this adapter is a PostgreSQL client that
  will try to connect to the PostgreSQL server on *the host
  specified in the URL*.

  3. This PostgreSQL client needs to know the name of the
  database on the server and its schema. The server needs to
  be set up accordingly.  The db name can, however, be
  specified in the URL (advert//host/path?dbname=foo), or
  via daptor config files.

> While this can work at a small scale, there are a number of issues with 
> this approach.
> 
> Firstly, if another adapter exists one day for another DBMS (for example 
> MySQL or Oracle), which one will be used? It's not uncommon to have 
> hosts that run both PostgreSQL and MySQL for example.
> It's a problem similar to letting 'any://' guess the protocol. Although 
> by luck 'ssh://host/file' and 'ftp://host/file' are likely to be the 
> same because the underlying file system structure is the same, a 
> PostgreSQL server and a MySQL server running on the same machine won't 
> have the same data at all.

While this is true, this is considered to be a feature, not
a bug.  Along the same lines one could argue that the 'ftp'
schema for file access is not uniquely specifying the
adaptor to be used.  In fact, 'ftp://' could be accepted by
the gridftp adaptor, but the curl adaptor, and by a
(hypothetical) plain ftp adaptor.  Yes, one or the other may
fail to run the command - then the next in line will be
used.  Adaptor selection can be optimized, by configuration,
by heuristics, or otherwise - but that is an implementation
detail hidden from the application.

> This is in fact already an issue with respect to the PostgreSQL and the 
> SQLite implementations. If a client is configured for using SQLite and 
> another one is configured for using PostgreSQL, they will get mixed up 
> if they try to read from and write to the same advert URI.

The complete url us unique:

  advert://user:pass@host/path?dbname=mydb&dbtype=sqlite3

Yes, the short forms

  any://host/path

is *not* unique - but that is up to the user to use the
convenient short form, or the full form.

> Secondly, I'm not sure how security is configured, but if all the 
> clients are configured to use the same schema name, user name and 
> password. I've just been able to connect to the PostgreSQL database we 
> were using during the tutorial and make a select query, simply using the 
> username and password that are in the README file, in the SAGA source 
> code. This relies on everyone playing nice. Even without malicious 
> intent, accidents happen.

Sure, we are aware of this.  But this is a database we use
for tutorials etc - *real* applications would of course use
a different and more secure setup.

Security credentials would be either specified in the full
URL, as shownb above, or specified via a saga::context which
needs to be added to the saga::session the operation is
supposed to run in:

  saga::context c ("my_postgres_context");
  c.set_attribute ("UserID", "...");
  c.set_attribute ("UserPass", "...");

  saga::session s;
  s.add_context (c);

  saga::advert::entry ad (s, url); /

This code MUST use the context specified above.  And of
course not all adaptors would need to accept that specific
context, but simply would not try to do anything at all.

> Finally, SAGA is an API, but this makes SAGA enter the territory of 
> network protocols. If you addressed the issues above by specifying the 
> database structure and how to query it, you'd end up defining another 
> protocol, which would certainly duplicate the job of protocols that 
> already exist (there are a number of pub/sub protocols, for example one 
> could be using Atom).

No, we do *not* define a protocol. We simply don't  We have
nowehere in our code a protocol definition.  Nor do we
actually talk on byte level on the connection.  We simply
use existing protocols like ftp, the postgres protocol, etc.

We *specify* a protocol to be used, in the URL scheme, or
specify a wildcard (any) to leave the choice of protocol to
the implementation.

> Having a uniform API for a number of protocols is a good idea, but 
> letting the API guess the protocol will undoubtedly lead to some 
> trouble.

Yes, it may - we are aware of that.  That is explicitely
mentioned in the API specification.  In the cases where that
may lead to trouble, users SHOULD explicitely specify
protocols.  However, the SAGA 80:20 rule applies: using the
wildcards seems ok in the vast majority of cases.  We did
not yet have any serious trouble with it.  And if: just
don't use the feature...

> In the case where identifiers are ambiguous and can point to 
> several distinct things, this sounds like a fundamental architectural 
> flaw (once it's released as it's the case for gsiftp URIs, it's almost 
> impossible to fix [*]).

I can give you simplier examples.

  http://host//etc/passwd
  ftp://host//etc/passwd

will usually not refer to the same physical file, but, for
example, to

  file://host//var/http_root/etc/passwd
  file://host//var/pub/etc/passwd

and neither refers to the canonical

  file://host//etc/passwd

Yes, users need to be aware of that.

Best, Andre.

> Best wishes,
> 
> Bruno.
> 
> 
> [*] http://blog.distributedmatter.net/post/2006/12/08/gsiftp-URI-madness

-- 
Nothing is ever easy.