[SAGA-RG] spec

Thilo Kielmann kielmann at cs.vu.nl
Sat Dec 15 08:00:21 CST 2007


I've spent some more time studying RFC's and thinking about this
thread of discussion.

I hope we can at least agree upon our design goals:
whatever we specify into SAGA has to be "simple to use", and as such
has to "do the obvious thing" in its respective context.

The aim of the exercise is to provide "POSIX shell wild cards" for files
(well, actually for name space entries.)
While trying to do so, we came across two sub topics:

a) wild cards possibly in URLs
b) wild cards in strings


About wild cards in URLs.

The valid RFC for URLs is
RFC3986 "Uniform Resource Identifier (URI): Generic Syntax"

It says (Introduction, second paragraph):

   "This document obsoletes [RFC2396], which merged "Uniform Resource
   Locators" [RFC1738] and "Relative Uniform Resource Locators"
   [RFC1808] in order to define a single, generic syntax for all URIs.
   It obsoletes [RFC2732], which introduced syntax for an IPv6 address.
   It excludes portions of RFC 1738 that defined the specific syntax of
   individual URI schemes; those portions will be updated as separate
   documents. ..."

I have checked IETF's site with RFCs and could not find any RFC documents
that would desribe new schemes for "file", "ftp", or "http".
This means, RFC3986 describes the general URI syntax, while the relevant
URL types for us (file, ftp, http) are still valid as described in RFC1738.


Having said this, I made the following two observations:

1. RFC3986 says (Introduction, first paragraph, first sentence):
"A Uniform Resource Identifier (URI) provides a simple and extensible
means for identifying a resource." I'd like to put the emphasis here on
"a resource", rather than "a resource or a group of resources".
Besides, RFC3986 does NOT contain the terms "wild card", nor "wildcard",
not even "pattern".

2. In RFC1738, the character '*' is not required to be used in escape
sequences. (While other special characters from POSIX shell wild cards are).
In a previous discussion we had already ruled out such wild card characters
that would require to be escaped as too complicated and non-obvious to use.
However, the URL schemes for "file", "ftp", and "http" do not define any
wild card patterns. (Only the "news" schema uses the '*' character as a
simple wild card. But this is not relevant for us.)

From both observations I am drawing the conclusion that we MUST NOT use
any wild cards, not even the '*' character in URLs. This is because adding
a wild-card semantics to these URLs would deviate from both the definitions
in RFC3986 and RFC1738, and also from "common use" of URLs, namely for 
"identifying a single resource."


This leaves us with option b) "wild cards in strings".

We do have consensus about using wild cards for name-space entries in
strings. More specifically: in path elements, expressed in strings.
However, we do not yet fully agree on the proposal to limit
these to path elements that are relative to the name space (read: directory)
on which the wild-card enabled functions operate.
Both camps argue with simplicity for the user.

The argument AGAINST restricting strings to relative paths is the possible
confusion of parts of syntactically valid paths (absolute ones) not beeing
valid by the semantical restriction to relative paths.

The argument FOR restricting strings to relative paths is that absolute
paths coincide with URLs and that this would give a second (string) 
representation for URLs, however with wild cards allowed (see discussion
above), having two representations for (almost) the same thing is considered
confusing for the user.

Argument by Andre:

> >>>  tmp/data.bin   <-- relative
> >>>  /tmp/data.bin  <-- absolute

Well, I would say that this "absolute" path still is relative, namely to
the base URI "file://localhost/".
Absolute paths on the same machine form a corner case in grids. Really
"absolute" paths identify the machine on which a file/directory resides.

As pointed out by Ceriel, URI's according to RFC 3986 always contain
absolute paths, especially after "normalization" has been applied. This means,
URI's can not hold relative paths, not in the general case. (And we are 
asking for problems if we require implementations to NEVER normalize a URI...)

This argument goes like:
A string with an absolute path coincides with a URI, where wild cards are
not allowed/desirable. A string with a relative path is "sufficiently 
different" from a URL such that it is obvious for the user where wild cards
are allowed and where they are not (in URLs).


If we agree to restrict strings to relative paths, which use cases are we
missing? What can NOT be expresed then? We can still do the following:

saga::directory dir(url);
dir.copy("sub/*/bla[1-9].doc",target-url);

Which can be, for the running aplication, a third-party copy, honoring 
wild cards.

I can currently not think of any use case where it would be a problem
to first create the dir object first (and instead do the same copy
with two URLs directly, but then on which directory object???)


To summarize:

I hereby propose to limit the use of wild cards to strings, and in there
to relative paths, because this:
- is sufficiently different from absolute URLs to avoid confusion
- is sufficiently expressive



Regards,


Thilo
-- 
Thilo Kielmann                                 http://www.cs.vu.nl/~kielmann/


More information about the saga-rg mailing list