[DFDL-WG] Action 287: Find a way to handle a variable path step in DFDL expression - Re: DFDL4S use of wildcard/regex in length path expression
Steve Hanson
smh at uk.ibm.com
Tue May 15 08:57:17 EDT 2018
Hi Michele
Any update for the DFDL WG on this action item ?
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Michele Zundo <michele.zundo at esa.int>
To: Steve Hanson <smh at uk.ibm.com>
Cc: "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, Joaquim Oliveira
<joaquim.oliveira at deimos.com.pt>, Maurizio De Bartolomei
<Maurizio.De.Bartolomei at esa.int>, Mike Beckerle
<mbeckerle.dfdl at gmail.com>, dfdl4s at eopp.esa.int
Date: 13/11/2017 19:27
Subject: Re: [DFDL-WG] Action 287: Find a way to handle a variable
path step in DFDL expression - Re: DFDL4S use of wildcard/regex in length
path expression
Dear Steve,
You will have to bear with us. We have been very busy closing the 2017
contract as well as setting up the one for 2018 covering DFDL4S that we
just released fixing quite few problems affecting our users.
We plan in the next month to go through the pending new technical issues
(the one you mention is one of them) to see if/how they can fit in DFDL4S
evolution.
Of interest might also be that we secure funding to implement a C++
version of our library (called UMBRA) to address performance and C++
language compatibility. Will be a native C++ version with associated tool
to work with schemas. Activity will start Jan 2018 for 2 years.
I will come back when we finalise the wildcard/regex story.
Sent from my iPhone
On 13 Nov 2017, at 20:00, Steve Hanson <smh at uk.ibm.com> wrote:
Michele, Joaquim
Any update on this for the DFDL WG call tomorrow (Tuesday 14th) ?
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Michele Zundo <michele.zundo at esa.int>
To: Steve Hanson <smh at uk.ibm.com>
Cc: Mike Beckerle <mbeckerle.dfdl at gmail.com>, "dfdl-wg at ogf.org" <
dfdl-wg at ogf.org>, Maurizio De Bartolomei <Maurizio.De.Bartolomei at esa.int>,
Joaquim Oliveira <joaquim.oliveira at deimos.com.pt>
Date: 04/10/2017 09:02
Subject: Re: [DFDL-WG] Action 287: Find a way to handle a variable
path step in DFDL expression - Re: DFDL4S use of wildcard/regex in length
path expression
Dear Steve,
nice to hear from you.
We are in process of compiling tasks to evolve our DFDL implementation in
2018 and have been
discussing a new task for this. I will check with others and let you know
the status regarding these aspects.
Regards
PS note that the responsible on our contractor side now has now changed
and is Mr. Olivera (in copy to this e-mail)
On 3 Oct 2017, at 19:25 , Steve Hanson <smh at uk.ibm.com> wrote:
Hi Michele
This problem was discussed on the DFDL WG call today. We wondered whether
you had made any progress with your contractor?
We also discussed an alternative solution, which is to use XPath 2.0
wildcards. This works as long as the wildcard can only ever match one
element at that level in the infoset. You would need to change your xsd so
that the variability occurred at a specific level in the model. Your
example might then look something like:
"contentLength(/Packet_Data_Field/*/Packet_Secondary_Header, 'bytes')"
Does that work for you?
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Michele Zundo <michele.zundo at esa.int>
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc: "Rui Mestre \(DME\)" <rui.mestre at deimos.com.pt>, "
dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, Maurizio De Bartolomei <
Maurizio.De.Bartolomei at esa.int>
Date: 26/01/2017 04:36
Subject: Re: [DFDL-WG] Action 287: Find a way to handle a variable
path step in DFDL expression - Re: DFDL4S use of wildcard/regex in
length path expression
Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org>
Dear Mike,
Thanks for the elaborate expkanation and analysis. I will pass this to our
contractor and see what are the possibilities and time scales for any
intervention.
Note that in the past we had discussed the implementation of variabkes in
out parser but put it on hold due to lack of resources.
I will let you and the rest of the WG know the output of thr discussion,
although might take some time.
Regards
Michele Zundo
Sent from my iPhone
On 19 Jan 2017, at 18:16, Mike Beckerle <mbeckerle.dfdl at gmail.com> wrote:
I have analyzed this use of (.*) notation in the
Sentinel2X-bandTMISPData.xsd file.
Below is my discussion of what this is, why it is used, the alternative to
it, and why it is more problematic than it seems at first, as an addition
to DFDL.
There are two instances of (.*) in the Sentinel schema files.
What it expresses is a partial step-name wildcard. This is not a
variation on the XPath * notation, as that matches any step, and can't
match a partial name. In the usage in the Sentinel schema, this wild card
is a regex that matches against a step name, and it matches exactly a
single name. It is not used in a way where it can result in a node-set
instead of a single node.
The question really is why is this (.*) needed - that is what is it trying
to achieve, and whether there is an acceptable alternative for what it is
doing.
I find that this wildcard (.*) is used to achieve a parameterization of
the types TypeISPData and TypeISPData_HKTM. These types are polymorphic,
in that their exact behavior depends on aspects of their surrounding
context. Each of these types incorporates a length of some content that
is outside of its own definition.
As the types are used now, the length comes from a thing outside of them
in the schema that happens to have a particular suffix on its name, which
is "Packet_Secondary_Header" or just "_Secondary_Header".
So these names, while outside of the type, are in some sense being
hard-coded in these types. Even though they are outside of these types,
you cannot change the names of these elements without breaking the ability
of the type to find them via this (.*) notation and a name suffix.
These types, TypeISPData and TypeISPData_HKTM are placed inside other
types, and those are then placed in context with various packet-header
structures. Those structures have various distinguishing prefixes of
sub-elemetnts such as MSI_Packet_Secondary_Header. There are a variety of
different things instead of "MSI_", but there's always something with
suffix "Packet_Secondary_Header" or "Secondary_Header" in it.
The use of this (.*) name wildcard seems convenient, but doesn't offer
anything that isn't better captured by a true parameterization mechanism
which decouples the names used outside the type from those used inside it.
DFDL provides a general mechanism for this sort of parameterization using
variables and dfdl:newVariableInstance.
Let's look at just one of the two instances, TypeISPData.
To use variables, a variable is created which represents this parameter to
the TypeISPData. It is declared in the schema file where TypeISPData is
defined.
<dfdl:defineVariable name="Additional_Content_Length" type="xs:int"/>
The expression within TypeISPData that currently contains
"contentLength(/Packet_Data_Field/(.*)Packet_Secondary_Header, 'bytes')"
That part of the expression instead references the parameter variable
$Additional_Content_Length
At the point of use, where this packet secondary header element is
combined with the element that contains the TypeISPData, at that location,
a dfdl:newVariableInstance is created and bound like so:
...
<xs:element name="MSI_Packet_Secondary_Header" type="TypePacketData_MSI"/>
<xs:sequence>
<xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl">
<dfdl:newVariableInstance
ref="Additional_Content_Length"
value="{ dfdl:contentLength(../MSI_Packet_Secondary_Header, 'bytes')
}"/>
</xs:appinfo></xs:annotation>
<xs:element name="MSI_User_Data_Field" type="TypeUserData"/>
</xs:sequence>
...
This binds the parameter to the needed length around the point of use of
the type that needs it (which is down inside the TypeUserData)
This seems bulky due to the XML and XSD-annotation notational overheads,
but it is just parameter-binding as would occur in an ordinary programming
language when passing an argument.
Using variables decouples the names of the two parts of the schema
entirely. For example, if MSI_Packet_Secondary_Header were to be renamed
to MSI_PSH, it would change only the point of use where the variable
binding occurs, and would not affect the type definition. This also
facilitates testing the types in isolation. It eliiminates the need to
have the complete data structure surrounding them.
Use of (*.) name matching might seem convenient, but really it depends on
an unstated invariant about the way names are chosen in the Sentinel
schema. If the names weren't constructed so uniformly, this notational
trick would be unable to make the necessary distinctions, and you'd have
to fall back on using variables.
So, the above explains an alternative, already in DFDL, that can achieve
the parameterization of types that is needed.
There is an additional issue that makes this (.*) notation problematic.
The issue is thes semantics relative to QNames and ordinary XML namespace
management for name conflict control, The Sentinel schema is all a "no
namespace" schema. In that world a path step is just an identifier with no
structure. However, more generally in DFDL, a path step might be a QName,
and those have a namespace prefix part, and a name part. Name collisions
and conflicts can be managed by use of different target namespaces and
namespace prefixes. Using (.*) prefix regex-matching against QNames is
problematic, as it subverts the ability to use namespace prefixes to
eliminate name conflicts. E.g., does (.*)_Secondary_Header match
"foo:My_Secondary_Header" given that it is in the foo namespace? Matching
against things without taking namespace prefixes into consideration breaks
XSD's ability to use prefixes to manage name collsions. So the answer of
does "(.*)_Secondary_Header" match "foo:My_Secondary_Header" would have to
be that it does not match, and that if one wanted to match that one would
have to write "bar:(.*)_Secondary_Header" where bar prefix is bound to the
same namespace as foo in the schema file where My_Secondary_Header is
defined. This mixture of some plain-textual regex matching, and some
name-prefix-qualified namespace-sensitive matching is not very attractive.
There is certainly nothing like this in XPath.
Last issue is implementation complexity:
So, if one wants to do without the (.*) notation, in terms of making
DFDL4S more complex to implement, you would need to add variables.
Fortunately variables are really one of the simpler things in DFDL to
implement at least for parsing, and is very simple if the dfdl:setVariable
annotation is not implemented since it is not needed in this case.
Each variable has a stack. the dfdl:newVariableInstance pushes a new entry
onto the stack, and in this case, populates it immediately with a value.
Reference to the variable from an expression always takes the value of the
top-of-stack location.
Exiting the scope of the annotation where the dfdl:newVariableInstance
appeared pops the stack.
In summary, given (a) the availability of a robust parameterization
mechanism using parameter variables and newVariableInstance to bind them
(b) the ease of implementing variables, and (c) the complexity of working
out interactions with XML/XSD namespaces/prefixes and name
management,..... given those reasons, I would be disinclined to advocate
something like this (.*) notation to DFDL unless it was first added to
XPath, where all necessary details were worked out for us.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
On Tue, Sep 13, 2016 at 10:18 AM, Michele Zundo <michele.zundo at esa.int>
wrote:
Dear Steve,
sorry for the delay due to Summer break and other projects.
Here is the .zip.
Please note that we realised that one of our previous replies date 25 July
2016 at 17:45:05 GMT+2
was a misunderstanding from our part and not applicable.
Regarding your questions please find answers interleaved below
Michele
This message and any attachments are intended for the use of the addressee
or addressees only.
The unauthorised disclosure, use, dissemination or copying (either in
whole or in part) of its
content is not permitted.
If you received this message in error, please notify the sender and delete
it from your system.
Emails can be altered and their integrity cannot be guaranteed by the
sender.
Please consider the environment before printing this email.
Regards
From: Steve Hanson [smh at uk.ibm.com]
Sent: Tuesday, July 26, 2016 3:20 AM
To: Michele Zundo
Cc: Mike Beckerle; rui.mestre at deimos.com.pt
Subject: Re: Fwd: OGF DFDL WG Call Minutes 2016-07-05
….snip
dfdl:length="{/Packet_Primary_Header/Packet_Data_Length + 1 -
contentLength(
/Packet_Data_Field/(.*)Packet_Secondary_Header, 'bytes') - 2}"
Firstly, contentLength is a DFDL function so it needs to be in the DFDL
namespace, eg, dfdl:contentLength().
Yes agree with you. We will add the dfdl: it in future releases and
modify the applications accordingly.
Secondly, the first argument to dfdl:contentLength() is a path, so you are
effectively still using regular expressions in path steps.
Yes. For now we are using it and expect this to become part of the
standard.
Regards
Steve Hanson
IBM Integration Bus, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Michele Zundo <michele.zundo at esa.int>
To: Steve Hanson/UK/IBM at IBMGB
Cc: Mike Beckerle <mbeckerle at tresys.com>
Date: 25/07/2016 17:10
Subject: Fwd: OGF DFDL WG Call Minutes 2016-07-05
Dear Steve,
Please find below the answer from our developers and example.
Note that we have updated our implementation of DFDL to be as compliant as
we
can at this point in time with the exception noted below.
Michele
Begin forwarded message:
From: "Rui Mestre (DME)" <rui.mestre at deimos.com.pt>
Subject: Re: Fwd: OGF DFDL WG Call Minutes 2016-07-05
Date: 25 July 2016 at 17:45:05 GMT+2
Dear Michele,
I believe that after our DFDL compliance effort the mentioned "use of a
regex in the path step of a DFDL expression" is no longer in place.
Currently the only extension implemented in DFDL4S regarding the use of
regular expressions is that implementation of dfdl:contentLength is
extended to support also regular expressions when specifying the node.
Please find attached a schema file example containing such extension in
the use of dfdl:contentLength.
Best regards,
Rui
Begin forwarded message:
From: Steve Hanson <smh at uk.ibm.com>
Subject: OGF DFDL WG Call Minutes 2016-07-05
Date: 5 July 2016 at 17:49:13 GMT+2
To: dfdl-wg at ogf.org
Cc: "Mike Beckerle" <mbeckerle at tresys.com>, "Michele Zundo" <
michele.zundo at esa.int>
Please find minutes from the above call at
https://redmine.ogf.org/dmsf_files/13537?download=
@Michele - please can you send to the WG a schema that shows your use of a
regex in the path step of a DFDL expression ?
Next call Aug 2nd
Regards
Steve Hanson
Architect, IBM DFDL,
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
-----------------------------------------
Michele Zundo
Head of Ground System Definition and Verification Office
EOP-PEP
European Space Agency, ESTEC
e-mail: michele.zundo at esa.int
This message and any attachments are intended for the use of the addressee
or addressees only.
The unauthorised disclosure, use, dissemination or copying (either in
whole or in part) of its
content is not permitted.
If you received this message in error, please notify the sender and delete
it from your system.
Emails can be altered and their integrity cannot be guaranteed by the
sender.
Please consider the environment before printing this email.
-----------------------------------------
Michele Zundo
Head of Ground System Definition and Verification Office
EOP-PEP
European Space Agency, ESTEC
e-mail: michele.zundo at esa.int
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
#### Sentinel2X-bandTMISPData.xsd moved to MyAttachments Repository V3.8 (
Link) on 23 August 2016 by Steve Hanson.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-----------------------------------------
Michele Zundo
Head of Ground System Definition and Verification Office
EOP-PEP
European Space Agency, ESTEC
e-mail: michele.zundo at esa.int
This message and any attachments are intended for the use of the addressee
or addressees only.
The unauthorised disclosure, use, dissemination or copying (either in
whole or in part) of its
content is not permitted.
If you received this message in error, please notify the sender and delete
it from your system.
Emails can be altered and their integrity cannot be guaranteed by the
sender.
Please consider the environment before printing this email.
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20180515/dc66baf3/attachment-0001.html>
More information about the dfdl-wg
mailing list