[DFDL-WG] How to determine the length of an element which has text representation

Tue Nov 24 17:05:03 CST 2009

I disagree with the direction this conversation is taking. I don't think 
we need another property - the dfdl:lengthKind property provides all of 
the control which users require.
 Let's examine how the parser might behave for *all* of the lengthKind 
enumerations:

explicit
The parser extracts a fixed number of characters/bytes from the input 
document as directed by dfdl:length ( which may be a DFDL expression, and 
may resolve to the value of the previous sibling )
prefixed
The parser extracts a fixed number of characters/bytes from the input 
document as directed by the prefix length. Note the similarity with the 
DFDL expression scenario above.
implicit
The parser extracts a fixed number of characters/bytes from the input 
document as directed by the implicit length of the element.
delimited
The parser extracts from the input document all characters between the 
current buffer position and the next unescaped item of in-scope 
terminating markup.
pattern
The parser extracts from the input document all characters which match the 
specified pattern
endOfParent.
The parser extracts from the input document all remaining characters/bytes 
allowed by the representation properties of its parent groups/elements.

I think there is a consistency issue here. We either make the in-scope 
markup apply to *all* lengthKinds ( including prefixed lengths and cases 
where dfdl:length is an expression, which can amount to the same thing ), 
or we limit it to lengthKind="delimited'. Any in-between position needs a 
very good justification.

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

From:
Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:
Alan Powell/UK/IBM at IBMGB
Cc:
Stephanie Fetzer <sfetzer at us.ibm.com>, "dfdl-wg at ogf.org" 
<dfdl-wg at ogf.org>, Tim Kimber/UK/IBM at IBMGB
Date:
24/11/2009 20:52
Subject:
Re: [DFDL-WG] How to determine the length of an element which has text 
representation

I like to use enums instead of booleans, so I suggest this property is 
dfdl:textScanningMode as an enum with current values "scanned" and 
"notScanned", but as an enum we have the ability to add some intelligent 
mixed mode in the future (like "scanExceptFixedLength" - if that proves 
useful)

One thought: we might try to think up terminology that is more 
declarative, less parse centric. These properties about "scanning" would 
affect output direction also, instructing the unparser to not bother 
inserting escape characters if the logical element contains say, the 
parent delimiter.  

I currently proceed under the assumption that not scanning turns off the 
whole lexical analyzer, so escape sequences detected would also be 
considered to be raw string content. You would still convert code points 
to logical characters but characters would not be interpreted as 
delimiters, escapes, quotation marks....

There's lots of potential for schema definition errors here of course. 
E.g., lengthKind='delimited', but textScanningMode="notScanned" clearly 
does not work. 

...mike

On Tue, Nov 24, 2009 at 11:48 AM, Alan Powell <alan_powell at uk.ibm.com> 
wrote:

Stephanie 

4. Have a separated property to 'turn off scanning' for 
dfdl:representation='text' 
5. Introduce a new lengthKind. 'fixedLengthDelimited' 

Alan Powell

MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM     email: alan_powell at uk.ibm.com  
Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898

From: 
Stephanie Fetzer <sfetzer at us.ibm.com> 
To: 
DFDL <mbeckerle.dfdl at gmail.com> 
Cc: 
"dfdl-wg at ogf.org" <dfdl-wg at ogf.org>, Tim Kimber/UK/IBM at IBMGB, 
dfdl-wg-bounces at ogf.org 
Date: 
19/11/2009 15:40
Subject: 
Re: [DFDL-WG] How to determine the length of an element which has       
 text representation

Yes - agreed. It makes sense that for parsing when delimiters are in scope 
that if we hit a non-delimited length that we 'turn off scanning'.  If 
everyone is agreed on that then.. 

The decision to be made here is how we will handle elements with length 
requirements while parsing when delimiters in scope: 

1. We can allow and use dfdl:length for components with 
lengthKind="delimited"...in a check that will occur after the element is 
initially parsed (via delimiter) 
2. We can disallow the use of dfdl:length for components with 
lengthKind="delimited"...and require that any length constraints be placed 
on such components via an assert.  An error or a warning will be generated 
if dfdl:length is defined explicitly on a component with 
lengthKind="delimited" 
3. We can ignore the use of dfdl:length for components with 
lengthKind="delimited"...and require that any length constraints be placed 
on such components via an assert. 

Any other options? Which way are we leaning on this? 

Cheers, 
-Steph 

WebSphere Transformation Extender
Industry Packs - Software Engineer

From: 
DFDL <mbeckerle.dfdl at gmail.com> 
To: 
Tim Kimber <KIMBERT at uk.ibm.com> 
Cc: 
"dfdl-wg at ogf.org" <dfdl-wg at ogf.org> 
Date: 
11/18/2009 08:54 PM 
Subject: 
Re: [DFDL-WG] How to determine the length of an element which has       
 text representation 
Sent by: 
dfdl-wg-bounces at ogf.org

I support tim's view here. There needs to be an idiomatic way to shut off 
scanning. Rep='binary' is much too obscure. 

Question: which other length kinds should switch off scanning? Prefix? 
Implicit? None of these? 

...mikeb 

On Nov 18, 2009, at 12:05 PM, Tim Kimber <KIMBERT at uk.ibm.com> wrote:

I'd like to record what was discussed and raise another point which Alan 
pointed out after meeting, 

Discussions in the meeting 
- dfdl:lengthKind applies only to the element on which it is specified. It 
has no effect whatever on the parsing of child elements/groups. 
- there may be some value in tolerating simple elements of type xs:string 
with dfdl:representation="binary". Might be useful for schemas where 
dfdl:representation="binary" throughout. 
- Currently, the position of the WG is that parsers should *always* scan 
to extract the text representation if there is any terminating markup in 
scope. Even if lengthKind='explicit'. 
- TK proposed the scheme outlined in his previous email, in which 
dfdl:lengthKind alone specifies how the parser should extract the text 
representation. 
If lengthKind="explicit", scanning is switched off and dfdl:length is 
used. If lengthKind="delimited" the text rep is extracted by scanning and 
length is ignored. 
- A refinement was discussed whereby dfdl:length would be checked after a 
scan has been performed if dfdl:lengthKind="delimited". This would make 
the modeling of some common formats simpler, and avoid the need for a 
dfdl:assert to enforce the length constraint. 
- MB raised the possibility that we could actually disallow dfdl:length if 
lengthKind='delimited'. This is the most conservative position, but 
general opinion was that it would be too restrictive. There still might be 
some value in disallowing dfdl:length for other lengthKinds. 

Discussions after the meeting 
- Alan pointed out that lengthKind="explicit" does not necessarily mean 
that the length of the field is fixed. dfdl:length might be specified as a 
DFDL expression. A common reason for doing that would be to obtain the 
element's length from an earlier integer field. As currently specified, if 
there was any markup in scope, the text rep would be extracted by 
scanning. 

Restatement of my position after today's meeting: 
I'm now even more convinced that dfdl:lengthKind="explicit" should switch 
off scanning. Here's why: 
a) The enumerations of lengthKind are explicit, implicit, prefixed, 
delimited,  pattern, endOfParent. The presence of 'delimited' in that list 
means that in some users' minds, the other enumerations are going to be 
interpreted as *alternatives* to 'delimited'. 
b) If there's markup in scope, scanning cannot be switched off by any 
means. Not even by setting lengthKind='explicit' AND obtaining dfdl:length 
from a previous integer field. I think that's very counter-intuitive. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

--
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg--
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg --

 dfdl-wg mailing list
 dfdl-wg at ogf.org

http://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20091124/7f826e05/attachment-0001.html