[DFDL-WG] Fw: Suggest should be optional feature of DFDL - dfdl:utf16Width='variable' and other corner cases (action 290)
Steve Hanson
smh at uk.ibm.com
Thu Nov 15 08:32:06 EST 2018
Hi Mike
I've been looking into IBM DFDL's treatment of property dfdl:utf16Width.
While we claim to support 'variable' and have a few tests that use this,
there is not the number of tests that I would expect to test the property
fully. The intent to support 'variable' is clear in the code, though; for
example, when parsing we check each char for being part of a surrogate
pair and adjust length accordingly. The code uses java.nio.charset for its
encoders & decoders, which we wrap in our own class which notes whether
utf16 is fixed or variable, but this information is not passed to the
encoder/decoder as there is no way to do so. Hmm. We will add some more
tests and see if everything is behaving.
Back to your original question, should 'variable' be an optional feature
of the spec. I have discussed with implementation team members and we
think that is a sensible thing to do. To handle surrogates does require
extra code to be written, and for a minimal implementation it should not
be necessary to do that.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc: "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date: 04/04/2017 11:56
Subject: Re: [DFDL-WG] Suggest should be optional feature of DFDL -
dfdl:utf16Width='variable' and other corner cases
Some light on action 291 - see the last sentence of this extract from the
original errata document (experience doc 1):
3.9. Section 12.3.5, 7.3.1, 7.3.2. The spec originally allows lengthKind
‘pattern’ to be used when the representation of the current element, or of
a child element, is binary, but imposes restrictions on the encoding that
can be in force.
Clarify that the encoding property must be defined for the element (else
schema definition error), and that a decoding processing error is possible
if the match of the regex encounters data that does not decode in that
encoding, dependent on the setting of encodingErrorPolicy. Remove section
12.3.5.1.
Same clarifications needed for testKind ”pattern” property for asserts and
discriminators.
For consistency, the restriction that a complex element of specified
length and lengthUnits ‘characters’ must have children that are all text
and that have the same encoding as the complex element, is dropped
So that explains how IBM DFDL's error message CTDV1524E came about, it was
policing a restriction in the original GFD.174 spec, a restriction which
no longer exists. IBM DFDL has not yet implemented the erratum. It wasn't
an extra IBM restriction.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc: "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date: 14/09/2016 08:44
Subject: Re: [DFDL-WG] Suggest should be optional feature of DFDL -
dfdl:utf16Width='variable' and other corner cases
Actions 290 and 291 raised to investigate further - see minutes.
Regards
Steve Hanson
IBM Integration Bus, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>
Cc: "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date: 13/09/2016 13:14
Subject: Re: [DFDL-WG] Suggest should be optional feature of DFDL -
dfdl:utf16Width='variable' and other corner cases
Mike
I am assuming that the processing for utf-16 'fixed' or 'variable' is
entirely handled by ICU so there should be no coding overhead.
IBM DFDL works ok for dfdl:lengthKind='explicit' for an element of complex
type with dfdl:lengthUnits='characters' and dfdl:encoding="utf-8". But
there are conditions the content of the complex type must satisfy
otherwise an SDE results, such as:
CTDV1524E : For a complex element, when 'lengthKind' is 'explicit' or
'prefixed', and 'lengthUnits' is characters, all simple child elements
must have text representation, 'lengthUnits' set to 'characters' and the
same encoding.
So we insist that the properties of the children are consistent with the
properties of the parent. If you recall, IBM DFDL does all these kinds of
validation checks in a pre-processing phase.
That seems a pretty sensible rule but I am not sure if the rule appears in
the spec as such - I just had a quick look but didn't spot anything.
So I guess I don't see a need for these things to be optional features?
Regards
Steve Hanson
IBM Integration Bus, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date: 10/08/2016 18:57
Subject: [DFDL-WG] Suggest should be optional feature of DFDL -
dfdl:utf16Width='variable' and other corner cases
Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org>
Given the limited set of required encodings for a conforming DFDL
processor, I believe dfdl:utf16Width='variable' should be an optional
feature.
That's just consistency with what is optional already. But it is also
quite hard to implement.
There are other situations that are very hard to implement, probably never
used by real users, yet which are non optional in the spec:
I would suggest that dfdl:lengthKind='explicit' for elements of complex
type, with dfdl:lengthUnits='characters' and a variable-width encoding
like utf-8 is very problematic to implement. I am pretty sure IBM DFDL has
no implementation of this per email threads, and I know I don't want to
implement this in Daffodil even though we're trying to be very
comprehensive in the implementation eventually.
I think implementations should be free to just not implement this. These
sorts of cases often exist just because we're trying to preserve some
orthogonality of composition in the language. So it's possible to do quite
a few things that probably aren't ever needed by anyone, that reflect
ill-defined data formats, etc.
I'd rather not document a bunch of "non-conformances" for Daffodil or
other implementations for these sorts of things. I'd like to say we don't
implement them, but they're optional, and so that's allowed.
Comments?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20181115/875ef77d/attachment-0001.html>
More information about the dfdl-wg
mailing list