[DFDL-WG] Clarification needed: pad/trim and delimited length

Steve Hanson smh at uk.ibm.com
Mon Nov 12 12:36:59 EST 2012


Decided on last DFDL WG call to leave the behaviour as currently 
specified, as it is possible to code the current behaviour.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Steve Hanson/UK/IBM
To:     Mike Beckerle <mbeckerle.dfdl at gmail.com>, 
Cc:     dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
Date:   26/10/2012 13:39
Subject:        Re: [DFDL-WG] Clarification needed: pad/trim and delimited 
length


Mike

For escape blocks, the escape start/end character must be the first/last 
character in the text. The order of processing stated in the spec was an 
attempt to handle the situation where the escape start/end character was 
not the first/last character in the text due to padding. So for examples 
like:

Variable length: "aaa,aaaa" ,"bbbbbbb"   ,"ccccccc"
Fixed length: "aaa,aaaa" "bbbbbbb"   "ccccccc"

>From an email discussion several years ago, in the answer to Alan's 
question "Should we only look for escapeStartString at the beginning of 
the data " Mike you said: "I'd prefer that we respect them anywhere, but 
canonical form when generated is at the beginning of the data. However, if 
we want to be more restrictive/conservative for v1.0 I'm fine with that."

I tested IBM DFDL's implementation. The delimited example above (left 
justified) worked ok - the parser recognised the start quote and switched 
on escaping, correctly escaped the comma, then found the end quote and 
switched off escaping, then found the delimiter . With trimKind 'none' it 
issued an error to the effect that there was text between quote and next 
delimiter. With trimKind 'padChar' it worked ok and trimmed off the pad 
before going on to remove the quotes. However when the scenario was 
right-justified, it got it wrong, which I think is your point.

The above order of processing leads to the following behaviour when 
trimKind is 'padChar'. Let's say I am exporting CSV data from Excel:
  Data: xx<sp><sp>      Infoset: xx 
  Data: "x,x<sp><sp>"   Infoset: xx,<sp><sp>
Explanation: The second data is same as the first except I have added in a 
comma, which causes Excel to escape with quotes in its normal way. The 
trimming takes place before escapes removed, so the first data loses the 
spaces while the second keeps them in the infoset. I don't think this is 
what a user would expect. (Note that Excel escapes the whole field).

Seems to me there are competing requirements here, need to decide whether 
they all need to be satisfied by DFDL 1.0.  Several possibilities to 
consider, here's some for starters:
- Keep current rule but only allow it with left-justified fields 
- Keep current rule and trim 'as you go' rather than after extracting the 
data
- Extend current rule and trim before and after escape character removal
- Change rule to trim after escape character removal and handle 
leading/trailing text via delimiters 
- Change rule to trim after escape character removal and allow %WSP; etc 
in escape block start/end strings

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848




From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     dfdl-wg at ogf.org, 
Date:   25/10/2012 13:45
Subject:        [DFDL-WG] Clarification needed: pad/trim and delimited 
length
Sent by:        dfdl-wg-bounces at ogf.org




The spec says that pad characters are removed before escape scheme 
processing. 

However, in delimited context, I can't even determine the length of the 
field to trim off the padding unless I can do the escape scheme 
processing.

This is either a Chicken-Egg, or the algorithm for parsing is 
substantially more complex due to padding.

Comments?

-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20121112/180d09ab/attachment.html>


More information about the dfdl-wg mailing list