[glue-wg] Some questions... [WAS: choosing XML Document structure for GLUE 2.0 rendering]

Mon Dec 17 09:10:23 CST 2007

Hi Sergio,

I've interleaved my comments below.

For the most part, the comments are mildly in favour of using xml:id; but I'm 
concerned that the primary information XML will be unnecessarily hard for the 
information providers.

On Wednesday 12 December 2007 01:11:15 Sergio Andreozzi wrote:
> Paul Millar ha scritto:
> > First, I see that one of the rules is ID is an element.  [....]
> > [3]   http://www.w3.org/TR/2005/REC-xml-id-20050909
>
> the previous version of the XML rendering proposal had the ID as
> attribute, then after a discussion in the last telecon we agreed to
> change it to element.
> I actually do not have a strong opinion on this.

Yes, I too don't feel this is a big issue.  I think there's an opportunity to 
use an existing standard.  There might get some leverage if GLUE/XML uses the 
attribute-based xml:id ID, but it's certainly not essential.

> As regards your 
> references, the most interesting to me is [3]. I would say that we are
> not reinventing the wheel because we are doing something different.
> [3] defines a way to attach a unique ID (unique within an XML document)
> to an XML element. 

True, although I think the emphasis with xml:id is providing a unique 
reference point in a schema-type invariant way.

GLUE could define an attribute within its namespace (via XSD, as glue:ID, for 
example).  But, by using xml:id, a XML parser (that supports xml:id) can 
infer the attribute has type unique-ID without having to understand the 
definition in the DTD / XSD / RelaxNG etc.  Because of this, simple 
(non-validating) parsers can still identify xml:id as a "global document 
identifier" and treat it accordingly.

The benefit for us is we don't have to provide DTD and XSD and RelaxNG and ... 
for an XML parser to understand what xml:id "means".  The GLUE/XML 
implementation(s) may choose to provide these, but it's optional.

> We are defining a property of a Grid concept (ID) 
> which is supposed to be globally unique and is a URI.From a semantical
> viewpoint they are different. They sit in different namespaces,
> therefore there should be no problem for that (if you see problems,
> please let me know).

I'm not sure I completely follow you here.  The two are separate namespaces, 
but have similar properties.  So, isn't mapping the GLUE ID as xml:id a 
choice GLUE/XML is free to make?

The GLUE Schema's ID attribute ("GLUE-ID" to prevent confusion) is a globally 
unique URI: a unique "name" within any aggregation of valid GLUE.  
(Presumably it's a URI to allow easy delegation of the namespace within a 
distributed community.)  The ID attribute is a required value for certain 
GLUE components (Service, UserDomian, AdminDomain, ...)

The current GLUE/XML mapping (as far as I understand it) provides an XML 
element for major GLUE grid component; in particular, those components that 
require a GLUE-ID are represented as XML elements.

The XML attribute xml:id describes a globally unique string: a unique "name" 
within any aggregation of valid XML.  So, one can injectively map GLUE-ID 
into xml:id; i.e., any valid GLUE-ID can be written as a unique, valid 
xml:id.

Whilst there is no requirement for GLUE/XML to use xml:id (as you say, the two 
are separate), there's also no reason not to.  GLUE/XML mapping is free to 
define that xml:id is to be used or (as currently) to use a schema-specific 
declaration: the ID element.

Here is a list of the advantages and disadvantages I could think of:

Advantages of using xml:id
  o  it's the W3C recommended way of doing "this sort of thing."
  o  ID-like semantics are built into parsers that support xml:id (which might 
not support more general validation),
  o  potential "reuse" of GLUE-ID with other XML software and standards,
  o  There is not GLUE-specific behavior when combining different GLUE XML 
files: no need to hard-coded the value or derive behavior from some 
DTD/XSD/...
  o  ..others? ..

Disadvantages of using xml:id:
  o  the mapping between GLUE-ID and xml:id is no surjective: there are valid 
xml:id values that are not valid GLUE-ID values (does this matter?)
  o  xml:id is an attribute rather than an element.
  o  some issues with Canonical XML (although xml:id considers xml-c14n to be 
broken in this and some other respects)
  o  .. others? ..

> > Is the plan to render (nearly) everything as elements rather than
> > attributes?
>
> in the last telecon, we agreed that we'll use attributes only for
> metadata-like properties (basically CreationTime and Validity, see Sec.
> 4.1 of the spec), while all the rest will be mapped to XML elements.

[Maybe section 4.2 ("metadata"), rather than 4.1.]

> > GLUE has many items have "required" (1) or "optional" (0..1) cardinality
> > and contain no further markup, so I feel they would, for the most part,
> > be better rendered as an XML attributes.
>
> given my experience, this choice is mainly a matter of style. Attributes
> can be only of simple types and single-value.
> Going for elements gives more flexibility for future changes and also is
> probably more usable (people don't have to remember which properties are
> single value, i.e. attributes or multi-value. i.e. elements when writing
> queries).

Sure, this isn't a big deal and is largely a matter of style.  Always using 
elements does tend to inflate the document size, which may matter when 
providing a large amount of information.

There are some GLUE attributes that could probably be rendered as XML 
attributes, but it's no big deal.

> > [Problem with primary producer having to know too much]
>
> the proposal is intended to be used by both primary services (e.g.,
> OGSA-BES, SRM) which want to advertise their characteristics and by
> information services (both primary publishers and aggregators).
> For primary services, the only constraint is to know the ID of their
> AdminDomain. That's all. They are not supposed to publish other
> AdminDomain attributes.

OK, but the example primary document "A" (P.A option, when voting) contained 
more information that this: it showed a complete hierarchy, as if the service 
were alone in the Grid.

> The AdminDomain ID will be used to perform the aggregation at the
> higher-level.
>
> The reason for which I prefer Option A is because it looks easier to
> make queries by AdminDomain (no need for join). And at the aggregation
> level, you have all info under a certain AdminDomain aggregated under a
> single element.

N.B. Here, I'm referring to my option P.O [4], where the primary information 
is presented as a sub-tree of the full GLUE/XML.  This is analogous to how 
DocBook provides aggregation where files may (individually) contain a Book 
(or Article), Part, Chapter, and so on.  Aggregation happens through "other 
means" (with DocBook this is typically via XInclude, with the toy example [4] 
it is included in the XSLT)

[4] http://www.ogf.org/pipermail/glue-wg/2007-December/000249.html

I'm not sure I follow how it is easier to make queries: the queries (against 
the complete, aggregated GLUE/XML infoset) are just as easy.

However, the problem I see with this is that if the storage-element were to 
provide information that is directly queryable (with identical queries as the 
final GLUE/XML) is the info-provider will needs to know its ancestor 
hierarchy (parent, parent's parent, etc); specifically, how many domains (and 
of what type) are "above" it.

For example, suppose a Tier-2 site has three AdminDomains within their 
combined Domain, the final (aggregated) published XML would look like:
	<Grid>
		<Domain>
			<Name>SCOTGRID</Name>
			<Description>Scotland's distributed grid site</Description>

			<!-- Further Domain-level information here -->

			<AdminDomain>
				<Name>SCOTGRID-GLA</Name>
				<Description>The ScotGrid site at University of Glasgow</Description>

				<!-- Further AdminDomain-level information here -->

				<StorageService>
					<!-- Further StorageService information here -->

					<StorageResource>
						<ID>glue://gla.scotgrid.ac.uk/SE</ID>
						<Name>ScotGrid-GLA DPM instance</Name>
						<ImplementationName>DPM</ImplementationName>
						<!-- ...etc... -->
					</StorageResource>
				</StorageService>		
			</AdminDomain>
		</Domain>
	</Grid>

So, if I've understood the primary information "A" option (P.A.) correctly, 
the storage service would publish XML like:
	<Grid>
		<Domain>
			<AdminDomain>
				<StorageService>
					<!-- Further StorageService information here -->

					<StorageResource>
						<ID>glue://gla.scotgrid.ac.uk/SE</ID>
						<Name>ScotGrid-GLA DPM instance</Name>
						<ImplementationName>DPM</ImplementationName>
					</StorageResource>
				</StorageService>		
			</AdminDomain>
		</Domain>
	</Grid>

What's bad here is that the info-provider must know its hierarchy: that it 
inside an AdminDomain, within inside a Domain.  This is ugly; it should not 
need to know this!

In contrast, a Tier-1 site might have no containing Domain.  A storage service 
must then publish information like:    
	<Grid>
		<AdminDomain>
			<StorageService>
				<!-- Storage Service info here -->
			</StorageService>		
		</AdminDomain>
	</Grid>

An alternative (option P.O, see [4]) allows services to provide only the 
information they know (by directly examining the software) and a hint 
(the "site-level" GLUE-ID), this can be avoided.

In fact the "parent" back-link isn't needed: it just makes configuring the 
site-level aggregation a little easier.  One could configure parent-child 
links explicitly (e.g. Services within AdminDomains) and avoid having to 
specify the Parent within the child.

To me, this makes much more sense: each service is (genuinely) providing only 
the information it knows.

Admin sites would aggregate (as with site-level BDIIs currently) and Domains 
then aggregate from multiple AdminSites, as necessary.

> I don't know how MDS 4 performs aggregations at higher 
> level and if this is compatible with its strategies. This is something
> to be investigated.

Yes, it would be interested to compare: I don't know too much about MDS-4

> > As an alternative, suppose One-to-Many relationships be represented as
> > either an XML element hierarchy [...]
>
> yep, this is an option as well. Many options are available. Probably, we
> should make one step back and clarify what we want to optimize.
> In my opinion, we should concetrate on giving the final user the easiest
> and more intuitive way to query the properties.

OK.   I've two additional (friendly) amendments:

  a. adjust this to:
	"[easiest and most intuitive way to query] the final, aggregated GLUE/XML 
Schema."

  b. also add:
	"make it easy for components to provide the necessary information."

> For sure, we need more experience on this with a number of queries to be
> written for different approaches.
> One advantage that I like of option A. is that a query would remain
> valid if you query either the primary source of information or the
> aggregated layer.

Whilst I agree this would be nice, do we have a use-case for users querying 
the primary source of information directly?

I skimmed through the use-case document and searched for keywords 
("primary", "source", "provider", etc..), but couldn't find any requirement 
for end-users to query information providers directly.

Given the flexible hierarchy by (potentially) nesting an AdminDomain within 
multiple Domains, this could be difficult to achieve without requiring that 
primary sources of information know something of the global structure.

> Consider this for instance. A simple XPath to ask for a service which
> type is org.glite.wms part of a certain adminDomain:
>
> /glue:Grid/AdminDomain[ID='urn:admindomain:t1.infn.it']/Service[Type='org.g
>lite.wms']

[sorry, v. minor point: assuming GLUE provides an XML-namespace, wouldn't the 
query have to specify the namespace-uri at each level?
  /glue:Grid/glue:AdminDomain[glue:ID='urn:...']/glue:Service[glue:Type=...
]

> this query works both at the primary source level and aggregated level
> and is also quite simple to me.

Again, do we really need to provide a service where end-users can query the 
information provided by the primary sources in an identical fashion to the 
complete (aggregated) resource?

I understand it would be nice (mostly for debugging reasons), but I don't see 
how this can be done without *every* primary info-provider within a Grid 
knowing (at least something of) the grid structure, in order to provide the 
correct XML documents.  I feel this would be quite an inflexible solution.

> Of course, we need a larger set of queries to be used for evaluation.

I suspect that XPath will be sufficient to query the aggregated GLUE/XML: once 
you get your head around XPath, it's pretty intuitive and friendly.

> > [XML Schema balance...]
>
> we are trying to find the right balance and mainly preserving easy of
> use. In the rules, I mentioned the option of SubstitutionGroups for
> completeness, but this is not the current selected option.
> At the moment, we prefer to go for the annotation option
>

[snip: agreement on simple XML design over complicated, strongly validing 
design]

> Thanks for your constructive feedback. I hope we can dedicate one more
> call before XMas to XML rendering so that we can refine all these
> choices and align about the rationale behind them.
> Please, keep contributing as opinion from different perspectives help us
> to make better choices.

I'll do my best!

Cheers,

Paul.