[DFDL-WG] Fw: pattern based lengths - suggested revised language

Steve Hanson smh at uk.ibm.com
Tue Aug 16 09:25:09 CDT 2011


Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 16/08/2011 15:28 -----

From:
Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:
Steve Hanson/UK/IBM at IBMGB
Date:
27/07/2011 15:00
Subject:
Re: pattern based lengths - suggested revised language



I support what you call the conservative approach. I.e. require text when 
patterns are used.
On Jul 27, 2011 5:53 AM, "Steve Hanson" <smh at uk.ibm.com> wrote:
> Hi Mike
> 
> I don't think we can reduce the wording that much. The second paragraph 
> is needed because it covers the binary case, where encoding is not 
> actually used.
> 
> I think we either need to be conservative and disallow the combination 
of 
> binary & pattern, or leave the second paragraph as-is and effectively 
say 
> that if you binary with pattern then that is the behaviour. 
> 
> If we are to be conservative then: 
> 
> - For a simple element or simple type, disallow lengthKind="pattern" 
with 
> binary rep.
> 
> - For a complex element with lengthKind = "pattern", all children must 
> have lengthUnits = "characters" (so text only) and the encoding of the 
> children must be the same as the encoding of the parent. (We already 
have 
> a similar rule for complex elements with specified length and 
lengthUnits 
> = "characters"). 
> We also allow asserts and discriminators to carry patterns which are 
> applied straight at the current position in the data stream. It would be 

> difficult to police the conservative rules here. But we need to say what 

> encoding is used and we currently do not. I would say it must be the 
> encoding of the element or group that carries the assert/discriminator.
> I said on the call that we had extended DFDL regular expressions so that 

> raw hex bytes could be specified. However I don't see any evidence of 
this 
> in the DFDL spec. This facility was something we added to IBM MRM for a 
> retail format called TLOG which consists of delimited packed decimal 
data 
> with hex indicator bytes, so we needed a way to match the hex indicator 
> bytes as part of the regexp. However, I think this was only necessary 
> because MRM has neither speculation nor discriminators, and in a DFDL 
> version of TLOG I would use a discriminator. So I think my statement was 

> in error, and I don't believe raw hex in DFDL regexps is needed. 
> Regards
> 
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, OGF DFDL Working Group
> IBM SWG, Hursley, UK
> smh at uk.ibm.com
> tel:+44-1962-815848
> 
> 
> 
> From:
> "Mike Beckerle" <mbeckerle.dfdl at gmail.com>
> To:
> Steve Hanson/UK/IBM at IBMGB
> Date:
> 26/07/2011 17:30
> Subject:
> pattern based lengths - suggested revised language
> 
> 
> 
> I suggest this language to tighten up this whole section (replace both 
> paragraphs). Given the concerns of Tim, that we make sure DFDL 
> implementations don’t have to reimplement regexp matching, I think this 
is 
> sufficient.
> 1.1.1.1 Based Lengths - Scanability
> Any element (complex, simple text, simple binary) may have a 
> dfdl:lengthKind 'pattern'. When an element contains binary data, and 
> lengthKind=’pattern’ is used, then it is a schema definition error if 
the 
> character set encoding is not iso-8859-1. 
> 
> 
> (Possible generalization 1: allow other character sets, e.g., 
iso-8859-15 
> as well. This is ok because 8859-15 still maps all 256 codepoints. But 
> this is a slippery slope. ) 
> 
> (Possible generalization 2: allow any character set, Ascii, ebcdic, 
> utf-16be, etc. Note that using any character encoding other than one 
which 
> maps a valid character to any 8-bit byte creates ambiguity: e.g, the 
> regexp “.” is one where we normally think it means “any character”. But 
> do we really mean “any byte” ? If the character set encoding doesn’t 
have 
> a given byte as a codepoint, then this question really matters.)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 

> 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
> 
> 
> 
> 
> 
> 






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20110816/e27b4d30/attachment-0001.html 


More information about the dfdl-wg mailing list