[DFDL-WG] Issues to add to work items list

Wed Jun 15 03:54:26 CDT 2011

Thanks Mike. I have added a couple of comments below...we can discuss 
fully on the calls.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:
Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:
Steve Hanson/UK/IBM at IBMGB
Date:
15/06/2011 01:59
Subject:
Issues to add to work items list

Steve,
Below is the list I have so far of items in the spec that go beyond just 
typos - where re-wording is required or advisable. All but the last of 
these is tied up with the length & delimiters issue, but separate of the 
"known to exist" vs. "missing" topic.

I've not bothered to provide anything about the section on "known to 
exist" and such. That section is already the discussion of a work 
item/topic. 

The last one is just a nit about BOMs.

...mikeb

--------------------------------------------------------------------------------------------

5.2.2      MinLength, MaxLength
These facets are used:
When dfdl:lengthKind=”implicit”. In that case the length is given by the 
value of xs:maxLength. In this case minLength if specified is required to 
be equal to maxLength (schema definition error otherwise).
For validation of variable length string elements.
It is a processing error when a fixed-length string is found to have a 
number of characters not equal to the fixed number. For example, if a 
fixed-length string also has delimiters we might be able to successfully 
separate it from the surrounding elements depending on the delimiter 
specifications; however, if the length of the fixed-length string is not 
equal to the number specified as the fixed length then it is a processing 
error (not simply a validation error).[MB1] 

 [MB1]Contradicts statement that scanning for delimiters is off.(Discussed 
where dfdl:lengthKind=’explicit’ is described)
What is a fixed length string?
Clearly  if it has lengthKind=”explicit” it is fixed length.
What if it has lengthKind=”implicit” and maxLength=”10”. Is that a fixed 
length string which shuts off delimiter scanning also? If so then this 
paragraph is erroneous and misleading.
<SMH>Yes, Tim and I have noted that this paragraph needs revising 
depending on the outcome of action 139 <SMH> 

--------------------------------------------------------------------------------------
9.2 DFDL Syntax Grammar

Change to introduce concept of EnclosedItem or ChildItem (I used 
EnclosedItem below):

Sequence = LeftFraming SequenceContent RightFraming
SequenceContent = [ PrefixSeparator EnclosedItem [ Separator EnclosedItem 
]* PostfixSeparator ] FinalUnusedRegion
EnclosedItem = Element | Array | ComplexContent[MB1] 

Choice = LeftFraming ChoiceContent RightFraming
ChoiceContent = [ EnclosedItem ] FinalUnusedRegion[MB2] 

 [MB1]Refactored to share the EnclosedItem concept to Choices also.
Should perhaps be named ChildItem.
This is useful when discussing how parsing, defaulting, etc. work as well. 

<SMH>Seems fine, it is effectively just a renaming<SMH> 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Table 14 Implicit Alignment in bits
Note: Specifying the implicit alignment in bits does not imply that 
dfdl:lengthUnits 'bits' can be specified for all simple types.[MB1] 

 [MB1]I do not understand this comment.  What exactly is the restriction?
<SMH>It is really saying that alignmentUnits and lengthUnits are 
independent and have their own rules for when they are applicable. <SMH> 

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

lengthKind
Enum
Controls how the representation length of the component is determined.
Valid values are: 'explicit', 'delimited', 'prefixed', 'implicit', 
'pattern', 'endOfParent'
A full description of each enumeration is given in the later sections.
'explicit' means the length of the item is given by the dfdl:length 
property
'delimited' means the item is delimited by a terminator or separator[MB1] 
‘prefixed’ means the length of the item is given by an immediately 
preceding prefix field specified using prefixLengthType. 
‘implicit means the length is to be determined in terms of the type of the 
element and its schema-specified properties if any.
‘pattern’ means the length of the item is given by a regular expression 
specified using the dfdl:lengthPattern property. 
‘endOfParent’ means that the item is terminated by the termination of the 
containing construct. 
Annotation: dfdl:element, dfdl:simpleType 

 [MB1]To me this is a very strong statement.
It means that an outside-in parse is allowed where for a sequence, we can 
scan and determine its end, and then parse the children.
It requires that “scan” is a well-defined concept for the contents of the 
sequence. 
It means there can be nothing inside which requires the suspension of 
scanning. 
It means contained elements that have length explicit and representation 
binary, are simply not allowed. 
<SMH>Not entirely true, we allow some binary types to have delimited 
lengthKind (eg, BCD and Packed Decimals). There are formats out there that 
require this.  <SMH> 
To me it also means you cannot change the character set encoding, or have 
a contained element that itself uses an overlapping set of delimiters with 
the enclosing  group’s delimiters.  – Is that going too far? – 
Delimited is like pattern. It restricts what is in the data stream 
substantially. 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The rules for resolving ambiguity between delimiters are:

1.      When two delimiters have a common prefix, the longest delimiter 
has precedence.
2.      When two delimiters have exactly the same value, the innermost 
(most deeply nested) delimiter has precedence.
3.      When the separator and terminator on a group have the same value, 
the separator has precedence.[MB1] 

 [MB1]By precedence, this must mean it is tried first, but the parser 
might backtrack and assume it to be a terminator instead. 
This seems problematic to me. I’d like to either rule this out and say 
they can’t be the same, or see a use case where this is needed, a 
backtracking parser can parse it, and there’s no more reasonable way to 
structure the schema.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

12.3.5.1      Pattern-Based Lengths  - Scanability
Any element (complex, simple text, simple binary) may have a 
dfdl:lengthKind 'pattern' as long as the bytes in the content region of 
the element are legal in the stated encoding of that element. Where a 
complex element has children with binary representation in practice this 
means an 8-bit ASCII encoding. [MB1] 

 [MB1]Not necessarily ASCII. An 8-bit encoding such that any byte value is 
valid is the real requirement. The point is no single byte value is 
invalid, and no combinations of adjacent byte values are invalid so that 
any binary data won’t trip up a character conversion and subsequent scan. 
Hmmm. The 8-bit character set used must have a transformation into Unicode 
which is bijective and information preserving. I.e., a unique Unicode 
character for each code point, and no “invalid” chracters which have no 
corresponding Unicode value.  However, Ascii-based sets like 8859-1 are 
not strictly speaking, required.
-----------------------------------------------------------------------------------------------------------------------------------------------

12.3.7.1.3    Byte Order Mark
If a byte-order mark codepoint appears at the start of a UTF-8, [MB1] 
UTF-16 or UTF-32 encoded string then the byte-order mark will be included 
as part of the string payload[1]. That is, for the UTF-8, UTF-16 and 
UTF-32 character encodings, a byte-order-mark codepoint is treated as a 
character of the string in DFDL and contributes to the length.
A way of eliminating the byte-order mark so that it does not end up in the 
infoset is that the byte-order mark can be modeled as a separate element 
before the string. This BOM element can be either required or optional 
depending on whether one is expected or optional at the beginning of the 
string. 

[1] Byte-order marks are explicitly stated to be “not characters” in the 
Unicode standard.

 [MB1]No such thing as a BOM codepoint in a UTF-8 string.  A UTF-8 byte 
sequence might encode the character code for a BOM, but this would be a 
meaningless inclusion of a BOM character code in a context where it will 
never be interpreted.
I suggest that we drop the term UTF-8 here, and BOM’s that get encoded 
when they are interpreted as character codes, and translated by the UTF-8 
encoding algorithm into a multi-byte UTF-8 byte sequence, is handled the 
same way as other non-characters, i.e., what do we do when a high or low 
surrogate codepoint is present and we’re to encode as UTF-8. I think the 
answer is we run the UTF-8 encode/decode algorithm, and whatever Unicode 
character code it creates is what it creates, and if that happens to come 
out as any of the non-characters (BOMs, surrogates, others perhaps), so be 
it. 
The topic is about Unicode non-characters, not specifically BOMs.
The general topic is encoding/decoding our infoset Unicode character codes 
which have no real representation in the specified encoding. 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20110615/3de54f66/attachment-0001.html