[DFDL-WG] Part 1 - Re: Action 307 - Demonstrate implementation interoperability

Tue Oct 9 09:35:11 EDT 2018

Very helpful Steve H., , thanks.

re: UTF-8 and BOM, for UTF-8, the BOM can be viewed as "just a character",
same as it is in UTF-16BE and UTF-16LE.

Only utf-16 unadorned has to actually look at, and in theory strip the BOM
if found. Nobody is implementing this, and it's not clear it matters much.

Today I know that Daffodil just treats UTF-16 as meaning UTF-16BE.

Hence, I suggest we consider just making BOM processing optional in DFDL
and also make utf-16 (unadorned) optional - takes one small issue off of
being "standard compliant". This leaves the question of what does "utf-16"
unadorned do, and the answer I think is supposed to be guided by BOM, but
if that is unimplemented then the behavior is "implementation defined"
i.e., non-portable.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>

On Tue, Oct 9, 2018 at 6:25 AM Steve Hanson <smh at uk.ibm.com> wrote:

> Mike, responses in-line below.
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        Steve Hanson <smh at uk.ibm.com>
> Cc:        DFDL-WG <dfdl-wg at ogf.org>
> Date:        03/10/2018 23:00
> Subject:        Part 1 - Re: [DFDL-WG] Action 307 - Demonstrate
> implementation interoperability
> ------------------------------
>
>
>
> I'm going to reply to this in a few parts.
>
> With respect to:
> - dfdl:binaryBooleanTrueRep with value empty string
> - dfdl:assert on global element and simple type
> - dfdl:discriminator on global element and simple type
> - Multiple xs:appinfo elements within each xs:annotation element
> I think these are minor non-compliances with the DFDL spec, and for
> interoperability testing we can just revise schemas under test to not use
> these constructs.
>
> SMH: Agree.
>
> With respect to:
> - When parsing, the distinction between an element being 'missing',
> having an 'empty representation' and having an 'absent representation', is
> not in accordance with the specification.
> I think time will tell here, that is, there's nothing we can anticipate
> having to do because of this as yet. If this non-compliance does not cause
> interoperability problems for realistic and published DFDL schemas then I
> wouldn't worry about it. Like IBM DFDL, Daffodil does not implement default
> values during parsing, and that's a likely area where this issue of
> missing/empty/absent has effect on behavior. It is quite possible that
> despite this lack of conformance to the DFDL spec., interoperability
> testing would be successful.
>
> SMH: IBM DFDL gives a runtime SDE when parsing if it a zero-length
> representation is found for an occurrence AND the element has a default
> value That prevents a behaviour change when support for default values when
> parsing is implemented. Suggest Daffodil does same if it does not do so
> already.  With that in place, I think we are ok.
>
> With respect to:
> - When encoding is 'UTF-8' or 'UTF-16', byte order marks are not processed
> Daffodil also does not implement byte-order-mark processing. We can dodge
> this issue entirely if we make the UTF-16 charset (specifically UTF-16
> without the BE or LE suffix) encoding an optional DFDL feature. That
> effectively makes byte-order-mark processing also an optional feature, and
> then both IBM DFDL and Daffodil would be compliant and interoperable.
>
> SMH: UTF8 can also have a BOM so that does not solve the problem entirely.
> Needs some more thought.
>
> With respect to:
> - dfdl:encodingErrorPolicy "replace"
> This one is harder. Daffodil doesn't implement encodingErrorPolicy='error'
> so we have no common ground here for interoperability testing.
> Making the entire encodingErrorPolicy property optional - meaning behavior
> in the presence of encoding errors is implementation specified  - that's
> super undesirable to me.
> I suspect that implementing encodingErrorPolicy 'error' will be necessary
> for Daffodil. If we do that then IBM DFDL can continue to document the lack
> of this missing required feature of DFDL, or we can make 'replace' optional
> in the spec., or IBM could implement 'replace'.
>
> SMH: This is top of the list of missing features for IBM DFDL. I have
> asked in the past if this could be added as it's technically a regression
> when compared to IIB's older text/binary parser (MRM).  I will ask again.
>
> *Additional Non-portable/Problematic Required Features*
>
> I did an analysis of all DFDL properties, and those that must be
> implemented to meet the minimum functionality that is not optional for a
> DFDL implementation per Section 21 of the spec.
> Starting from a list of all DFDL properties, I eliminated any specific to
> unparsing, and then any that aren't relevant given something optional in
> Section 21.
>
> Here are the remaining properties I found. Restrictions on what values of
> these properties are mentioned where their full functionality is considered
> optional:
>
>    - length - integer values only
>    - lengthKind - explicit, implicit only
>    - lengthUnits - bytes or characters only
>    - representation - binary only
>    - byteOrder
>    - alignment - number or 'implicit'
>    - alignmentUnits - bytes only
>    - fillByte
>    - leadingSkip
>    - trailingSkip
>    - encoding - 'UTF-8'', 'UTF-16', 'UTF-16BE', 'UTF-16LE', 'ASCII', and
>    'ISO-8859-1'
>    - encodingErrorPolicy - (Already discussed above, so not further
>    discussed in this section)
>    - utf16Width - because UTF-16 is allowed for encoding, 'variable' is
>    problematic.
>    - textPadKind
>    - textTrimKind
>    - textStringJustification
>    - textStringPadCharacter
>    - binaryNumberRep - binary only
>    - binaryFloatRep - ieee only
>    - binaryBooleanTrueRep
>    - binaryBooleanFalseRep - IBM DFDL doesn't allow empty string for
>    this. (Minor.)
>    - binaryCalendarRep - binarySeconds, binaryMillseconds only
>    - binaryCalendarEpoch
>    - occursCountKind - fixed only
>    - occursCount - integer only
>
> Looking at this list, there is only 1 additional issue to
> portability/interoperability this raises today given what I know about the
> Daffodil implementation and the IBM implementation.
>
> *Issue: utf16Width='variable'*
>
>
> This issue can be addressed with a minor change to the DFDL specification.
>
> When the type is xs:string, lengthUnits is 'characters', then the length
> in characters should take surrogate-pairs found in the UTF-16 data, and
> count those as occupying 1 character.
>
> This utf16Width='variable' feature of DFDL should be optional, as Java
> JVM-based implementations will find this extremely difficult to support,
> since JVM standard string representations cannot represent individual
> characters with code points greater than 0xFFFF occupying 1 location in a
> string.
>
> Daffodil does not implement this 'variable' behavior, and we have no good
> pathway to do so. Hence, prefer to change the DFDL spec to make this
> 'variable'  optional. Only 'fixed' would be required. I could support
> deprecating the whole property even.
> SMH: This is already captured by action 290, which is waiting for me to do
> some tests with IBM DFDL which claims to have implemented this.
>
>
> *Issue: lengthUnits='characters' and variable-width charset encodings*
>
> I believe this is required behavior. I also believe the lack of support
> for this is missing from IBM's list of non-compliances. I recall discussion
> that IBM DFDL requires a fixed width encoding in this situation where
> lengthUnits is 'characters'.  (Please correct me if I am wrong.)
>
> I suggest making this combination an optional feature of the DFDL spec.,
> would resolve the issue.
>
> This complex feature was added to support naive data format conversions
> where data originally had ascii encoding and lengthUnits 'bytes' is changed
> to 'utf-8' with lengthUnits 'characters'.  This is a rational way to
> modernize a data format adding internationalization capability. It however
> requires a significant change in runtime behavior because utf-8 characters
> occupy between 1 and 4 bytes per character.
> SMH: IBM DFDL certainly supports lengthUnits="characters" and
> encoding="UTF-8", which is an example of this.
>
>
> *Optional Features that are Partially Implemented*
>
> The bigger set of concerns for interoperability is the behavior of a DFDL
> processor for features that are optional by strict interpretation of
> Section 21, but are implemented by a specific DFDL implementation, but the
> implementation is partial. This is the subject of other email messages
> however.
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com* <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
>
>
>
> On Tue, Sep 11, 2018 at 11:33 AM Steve Hanson <*smh at uk.ibm.com*
> <smh at uk.ibm.com>> wrote:
> Action 307 was raised recently and first task is for implementations to
> identify which core spec behaviour is not implemented.
>
> * IBM DFDL *
>
> The following is the list of DFDL 1.0 spec core features that IBM DFDL
> does not yet implement.
>
> - dfdl:encodingErrorPolicy "replace"
> - dfdl:binaryBooleanTrueRep with value empty string
> - dfdl:assert on global element and simple type
> - dfdl:discriminator on global element and simple type
> - Multiple xs:appinfo elements within each xs:annotation element
> - When parsing, the distinction between an element being 'missing',
> having an 'empty representation' and having an 'absent representation', is
> not in accordance with the specification.
> - When encoding is 'UTF-8' or 'UTF-16', byte order marks are not processed
>
> The above lists are derived from information at
> *https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/df00150_.htm*
> <https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/df00150_.htm>
> and are those that apply to core spec features.
>
> Regards
>
> Steve Hanson
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> --
>   dfdl-wg mailing list
>   *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>   *https://www.ogf.org/mailman/listinfo/dfdl-wg*
> <https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20181009/126e0e02/attachment-0001.html>