[DFDL-WG] Byte-order-marks - was: Re: Part 1 - Re: Action 307 - Demonstrate implementation interoperability

Mike Beckerle mbeckerle.dfdl at gmail.com
Tue Oct 16 09:56:51 EDT 2018


To avoid changing behavior then, we will probably need a property to turn
on/off the BOM behavior which strips/generates BOM. All implementations of
DFDL that implement text (IBM, and Daffodil) currently do not treat BOMs
specially currently, neither stripping them nor generating them.

I suggest we need byteOrderMarkPolicy="use/ignore", with "ignore" meaning
that the BOM is just treated as a character. Implementation of
byteOrderMarkPolicy="use" would be an optional feature of DFDL. That way
both IBM DFDL and Daffodil can be compliant without implementing this.

(I'd like everything we've collectively been able to live without thus far,
that isn't needed for interoperability testing, to ultimately get onto the
optional features list.)

Or we can simply deprecate the functionality in the spec and say BOMs must
be modeled, and just strike the stuff from the next draft, and post an
example on how to model BOMs. It sounds heavy handed, but nobody has asked
for this feature (on the Daffodil project), and it was put into the DFDL
spec way back in the early days when we expected BOMs to be popular, but
they never caught on.

However, I'll acknowledge that IBM would be in a better position to decide
whether this feature is needed, given more users in Asia and other places
where UTF-16 may be more popular.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>



On Wed, Oct 10, 2018 at 5:24 AM Steve Hanson <smh at uk.ibm.com> wrote:

> I think the main thing with BOMs is that when support is added by an
> implementation, it should not break existing behaviour for a document that
> starts with a BOM.  So if a user had a schema that explicitly modelled the
> BOM, or was treating BOM as a character so it appeared in the infoset, then
> a BOM aware implementation should not suddenly change that.
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        Steve Hanson <smh at uk.ibm.com>
> Cc:        DFDL-WG <dfdl-wg at ogf.org>
> Date:        09/10/2018 14:35
> Subject:        Re: Part 1 - Re: [DFDL-WG] Action 307 - Demonstrate
> implementation interoperability
> ------------------------------
>
>
>
> Very helpful Steve H., , thanks.
>
> re: UTF-8 and BOM, for UTF-8, the BOM can be viewed as "just a character",
> same as it is in UTF-16BE and UTF-16LE.
>
> Only utf-16 unadorned has to actually look at, and in theory strip the BOM
> if found. Nobody is implementing this, and it's not clear it matters much.
>
> Today I know that Daffodil just treats UTF-16 as meaning UTF-16BE.
>
> Hence, I suggest we consider just making BOM processing optional in DFDL
> and also make utf-16 (unadorned) optional - takes one small issue off of
> being "standard compliant". This leaves the question of what does "utf-16"
> unadorned do, and the answer I think is supposed to be guided by BOM, but
> if that is unimplemented then the behavior is "implementation defined"
> i.e., non-portable.
>
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com* <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
>
>
>
> On Tue, Oct 9, 2018 at 6:25 AM Steve Hanson <*smh at uk.ibm.com*
> <smh at uk.ibm.com>> wrote:
> Mike, responses in-line below.
>
> Regards
>
> Steve Hanson
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From:        Mike Beckerle <*mbeckerle.dfdl at gmail.com*
> <mbeckerle.dfdl at gmail.com>>
> To:        Steve Hanson <*smh at uk.ibm.com* <smh at uk.ibm.com>>
> Cc:        DFDL-WG <*dfdl-wg at ogf.org* <dfdl-wg at ogf.org>>
> Date:        03/10/2018 23:00
> Subject:        Part 1 - Re: [DFDL-WG] Action 307 - Demonstrate
> implementation interoperability
> ------------------------------
>
>
>
> I'm going to reply to this in a few parts.
>
> With respect to:
> - dfdl:binaryBooleanTrueRep with value empty string
> - dfdl:assert on global element and simple type
> - dfdl:discriminator on global element and simple type
> - Multiple xs:appinfo elements within each xs:annotation element
> I think these are minor non-compliances with the DFDL spec, and for
> interoperability testing we can just revise schemas under test to not use
> these constructs.
>
> SMH: Agree.
>
> With respect to:
> - When parsing, the distinction between an element being 'missing',
> having an 'empty representation' and having an 'absent representation', is
> not in accordance with the specification.
> I think time will tell here, that is, there's nothing we can anticipate
> having to do because of this as yet. If this non-compliance does not cause
> interoperability problems for realistic and published DFDL schemas then I
> wouldn't worry about it. Like IBM DFDL, Daffodil does not implement default
> values during parsing, and that's a likely area where this issue of
> missing/empty/absent has effect on behavior. It is quite possible that
> despite this lack of conformance to the DFDL spec., interoperability
> testing would be successful.
>
> SMH: IBM DFDL gives a runtime SDE when parsing if it a zero-length
> representation is found for an occurrence AND the element has a default
> value That prevents a behaviour change when support for default values when
> parsing is implemented. Suggest Daffodil does same if it does not do so
> already.  With that in place, I think we are ok.
>
> With respect to:
> - When encoding is 'UTF-8' or 'UTF-16', byte order marks are not processed
> Daffodil also does not implement byte-order-mark processing. We can dodge
> this issue entirely if we make the UTF-16 charset (specifically UTF-16
> without the BE or LE suffix) encoding an optional DFDL feature. That
> effectively makes byte-order-mark processing also an optional feature, and
> then both IBM DFDL and Daffodil would be compliant and interoperable.
>
> SMH: UTF8 can also have a BOM so that does not solve the problem entirely.
> Needs some more thought.
>
> With respect to:
> - dfdl:encodingErrorPolicy "replace"
> This one is harder. Daffodil doesn't implement encodingErrorPolicy='error'
> so we have no common ground here for interoperability testing.
> Making the entire encodingErrorPolicy property optional - meaning behavior
> in the presence of encoding errors is implementation specified  - that's
> super undesirable to me.
> I suspect that implementing encodingErrorPolicy 'error' will be necessary
> for Daffodil. If we do that then IBM DFDL can continue to document the lack
> of this missing required feature of DFDL, or we can make 'replace' optional
> in the spec., or IBM could implement 'replace'.
>
> SMH: This is top of the list of missing features for IBM DFDL. I have
> asked in the past if this could be added as it's technically a regression
> when compared to IIB's older text/binary parser (MRM).  I will ask again.
>
> * Additional Non-portable/Problematic Required Features*
>
> I did an analysis of all DFDL properties, and those that must be
> implemented to meet the minimum functionality that is not optional for a
> DFDL implementation per Section 21 of the spec.
> Starting from a list of all DFDL properties, I eliminated any specific to
> unparsing, and then any that aren't relevant given something optional in
> Section 21.
>
> Here are the remaining properties I found. Restrictions on what values of
> these properties are mentioned where their full functionality is considered
> optional:
>
>    - length - integer values only
>    - lengthKind - explicit, implicit only
>    - lengthUnits - bytes or characters only
>    - representation - binary only
>    - byteOrder
>    - alignment - number or 'implicit'
>    - alignmentUnits - bytes only
>    - fillByte
>    - leadingSkip
>    - trailingSkip
>    - encoding - 'UTF-8'', 'UTF-16', 'UTF-16BE', 'UTF-16LE', 'ASCII', and
>    'ISO-8859-1'
>    - encodingErrorPolicy - (Already discussed above, so not further
>    discussed in this section)
>    - utf16Width - because UTF-16 is allowed for encoding, 'variable' is
>    problematic.
>    - textPadKind
>    - textTrimKind
>    - textStringJustification
>    - textStringPadCharacter
>    - binaryNumberRep - binary only
>    - binaryFloatRep - ieee only
>    - binaryBooleanTrueRep
>    - binaryBooleanFalseRep - IBM DFDL doesn't allow empty string for
>    this. (Minor.)
>    - binaryCalendarRep - binarySeconds, binaryMillseconds only
>    - binaryCalendarEpoch
>    - occursCountKind - fixed only
>    - occursCount - integer only
>
> Looking at this list, there is only 1 additional issue to
> portability/interoperability this raises today given what I know about the
> Daffodil implementation and the IBM implementation.
>
> * Issue: utf16Width='variable'*
>
>
> This issue can be addressed with a minor change to the DFDL specification.
>
> When the type is xs:string, lengthUnits is 'characters', then the length
> in characters should take surrogate-pairs found in the UTF-16 data, and
> count those as occupying 1 character.
>
> This utf16Width='variable' feature of DFDL should be optional, as Java
> JVM-based implementations will find this extremely difficult to support,
> since JVM standard string representations cannot represent individual
> characters with code points greater than 0xFFFF occupying 1 location in a
> string.
>
> Daffodil does not implement this 'variable' behavior, and we have no good
> pathway to do so. Hence, prefer to change the DFDL spec to make this
> 'variable'  optional. Only 'fixed' would be required. I could support
> deprecating the whole property even.
> SMH: This is already captured by action 290, which is waiting for me to do
> some tests with IBM DFDL which claims to have implemented this.
>
>
> * Issue: lengthUnits='characters' and variable-width charset encodings*
>
> I believe this is required behavior. I also believe the lack of support
> for this is missing from IBM's list of non-compliances. I recall discussion
> that IBM DFDL requires a fixed width encoding in this situation where
> lengthUnits is 'characters'.  (Please correct me if I am wrong.)
>
> I suggest making this combination an optional feature of the DFDL spec.,
> would resolve the issue.
>
> This complex feature was added to support naive data format conversions
> where data originally had ascii encoding and lengthUnits 'bytes' is changed
> to 'utf-8' with lengthUnits 'characters'.  This is a rational way to
> modernize a data format adding internationalization capability. It however
> requires a significant change in runtime behavior because utf-8 characters
> occupy between 1 and 4 bytes per character.
> SMH: IBM DFDL certainly supports lengthUnits="characters" and
> encoding="UTF-8", which is an example of this.
>
>
> * Optional Features that are Partially Implemented*
>
> The bigger set of concerns for interoperability is the behavior of a DFDL
> processor for features that are optional by strict interpretation of
> Section 21, but are implemented by a specific DFDL implementation, but the
> implementation is partial. This is the subject of other email messages
> however.
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com* <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <http://www.ogf.org/About/abt_policies.php>
>
>
>
> On Tue, Sep 11, 2018 at 11:33 AM Steve Hanson <*smh at uk.ibm.com*
> <smh at uk.ibm.com>> wrote:
> Action 307 was raised recently and first task is for implementations to
> identify which core spec behaviour is not implemented.
>
> * IBM DFDL *
>
> The following is the list of DFDL 1.0 spec core features that IBM DFDL
> does not yet implement.
>
> - dfdl:encodingErrorPolicy "replace"
> - dfdl:binaryBooleanTrueRep with value empty string
> - dfdl:assert on global element and simple type
> - dfdl:discriminator on global element and simple type
> - Multiple xs:appinfo elements within each xs:annotation element
> - When parsing, the distinction between an element being 'missing',
> having an 'empty representation' and having an 'absent representation', is
> not in accordance with the specification.
> - When encoding is 'UTF-8' or 'UTF-16', byte order marks are not processed
>
> The above lists are derived from information at
> *https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/df00150_.htm*
> <https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/df00150_.htm>
> and are those that apply to core spec features.
>
> Regards
>
> Steve Hanson
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> --
>   dfdl-wg mailing list
>   *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>   *https://www.ogf.org/mailman/listinfo/dfdl-wg*
> <https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20181016/397ca1c5/attachment-0001.html>


More information about the dfdl-wg mailing list