[DFDL-WG] Fw: pattern based lengths - suggested revised language
Steve Hanson
smh at uk.ibm.com
Tue Aug 16 09:25:09 CDT 2011
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 16/08/2011 15:28 -----
From:
Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:
Steve Hanson/UK/IBM at IBMGB
Date:
27/07/2011 15:00
Subject:
Re: pattern based lengths - suggested revised language
I support what you call the conservative approach. I.e. require text when
patterns are used.
On Jul 27, 2011 5:53 AM, "Steve Hanson" <smh at uk.ibm.com> wrote:
> Hi Mike
>
> I don't think we can reduce the wording that much. The second paragraph
> is needed because it covers the binary case, where encoding is not
> actually used.
>
> I think we either need to be conservative and disallow the combination
of
> binary & pattern, or leave the second paragraph as-is and effectively
say
> that if you binary with pattern then that is the behaviour.
>
> If we are to be conservative then:
>
> - For a simple element or simple type, disallow lengthKind="pattern"
with
> binary rep.
>
> - For a complex element with lengthKind = "pattern", all children must
> have lengthUnits = "characters" (so text only) and the encoding of the
> children must be the same as the encoding of the parent. (We already
have
> a similar rule for complex elements with specified length and
lengthUnits
> = "characters").
> We also allow asserts and discriminators to carry patterns which are
> applied straight at the current position in the data stream. It would be
> difficult to police the conservative rules here. But we need to say what
> encoding is used and we currently do not. I would say it must be the
> encoding of the element or group that carries the assert/discriminator.
> I said on the call that we had extended DFDL regular expressions so that
> raw hex bytes could be specified. However I don't see any evidence of
this
> in the DFDL spec. This facility was something we added to IBM MRM for a
> retail format called TLOG which consists of delimited packed decimal
data
> with hex indicator bytes, so we needed a way to match the hex indicator
> bytes as part of the regexp. However, I think this was only necessary
> because MRM has neither speculation nor discriminators, and in a DFDL
> version of TLOG I would use a discriminator. So I think my statement was
> in error, and I don't believe raw hex in DFDL regexps is needed.
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, OGF DFDL Working Group
> IBM SWG, Hursley, UK
> smh at uk.ibm.com
> tel:+44-1962-815848
>
>
>
> From:
> "Mike Beckerle" <mbeckerle.dfdl at gmail.com>
> To:
> Steve Hanson/UK/IBM at IBMGB
> Date:
> 26/07/2011 17:30
> Subject:
> pattern based lengths - suggested revised language
>
>
>
> I suggest this language to tighten up this whole section (replace both
> paragraphs). Given the concerns of Tim, that we make sure DFDL
> implementations don’t have to reimplement regexp matching, I think this
is
> sufficient.
> 1.1.1.1 Based Lengths - Scanability
> Any element (complex, simple text, simple binary) may have a
> dfdl:lengthKind 'pattern'. When an element contains binary data, and
> lengthKind=’pattern’ is used, then it is a schema definition error if
the
> character set encoding is not iso-8859-1.
>
>
> (Possible generalization 1: allow other character sets, e.g.,
iso-8859-15
> as well. This is ok because 8859-15 still maps all 256 codepoints. But
> this is a slippery slope. )
>
> (Possible generalization 2: allow any character set, Ascii, ebcdic,
> utf-16be, etc. Note that using any character encoding other than one
which
> maps a valid character to any 8-bit byte creates ambiguity: e.g, the
> regexp “.” is one where we normally think it means “any character”. But
> do we really mean “any byte” ? If the character set encoding doesn’t
have
> a given byte as a codepoint, then this question really matters.)
>
>
>
>
>
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
>
>
>
>
>
>
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20110816/e27b4d30/attachment-0001.html
More information about the dfdl-wg
mailing list