[DFDL-WG] Clarification needed: pad/trim and delimited length
Steve Hanson
smh at uk.ibm.com
Mon Nov 12 12:36:59 EST 2012
Decided on last DFDL WG call to leave the behaviour as currently
specified, as it is possible to code the current behaviour.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Steve Hanson/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl at gmail.com>,
Cc: dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
Date: 26/10/2012 13:39
Subject: Re: [DFDL-WG] Clarification needed: pad/trim and delimited
length
Mike
For escape blocks, the escape start/end character must be the first/last
character in the text. The order of processing stated in the spec was an
attempt to handle the situation where the escape start/end character was
not the first/last character in the text due to padding. So for examples
like:
Variable length: "aaa,aaaa" ,"bbbbbbb" ,"ccccccc"
Fixed length: "aaa,aaaa" "bbbbbbb" "ccccccc"
>From an email discussion several years ago, in the answer to Alan's
question "Should we only look for escapeStartString at the beginning of
the data " Mike you said: "I'd prefer that we respect them anywhere, but
canonical form when generated is at the beginning of the data. However, if
we want to be more restrictive/conservative for v1.0 I'm fine with that."
I tested IBM DFDL's implementation. The delimited example above (left
justified) worked ok - the parser recognised the start quote and switched
on escaping, correctly escaped the comma, then found the end quote and
switched off escaping, then found the delimiter . With trimKind 'none' it
issued an error to the effect that there was text between quote and next
delimiter. With trimKind 'padChar' it worked ok and trimmed off the pad
before going on to remove the quotes. However when the scenario was
right-justified, it got it wrong, which I think is your point.
The above order of processing leads to the following behaviour when
trimKind is 'padChar'. Let's say I am exporting CSV data from Excel:
Data: xx<sp><sp> Infoset: xx
Data: "x,x<sp><sp>" Infoset: xx,<sp><sp>
Explanation: The second data is same as the first except I have added in a
comma, which causes Excel to escape with quotes in its normal way. The
trimming takes place before escapes removed, so the first data loses the
spaces while the second keeps them in the infoset. I don't think this is
what a user would expect. (Note that Excel escapes the whole field).
Seems to me there are competing requirements here, need to decide whether
they all need to be satisfied by DFDL 1.0. Several possibilities to
consider, here's some for starters:
- Keep current rule but only allow it with left-justified fields
- Keep current rule and trim 'as you go' rather than after extracting the
data
- Extend current rule and trim before and after escape character removal
- Change rule to trim after escape character removal and handle
leading/trailing text via delimiters
- Change rule to trim after escape character removal and allow %WSP; etc
in escape block start/end strings
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
To: dfdl-wg at ogf.org,
Date: 25/10/2012 13:45
Subject: [DFDL-WG] Clarification needed: pad/trim and delimited
length
Sent by: dfdl-wg-bounces at ogf.org
The spec says that pad characters are removed before escape scheme
processing.
However, in delimited context, I can't even determine the length of the
field to trim off the padding unless I can do the escape scheme
processing.
This is either a Chicken-Egg, or the algorithm for parsing is
substantially more complex due to padding.
Comments?
--
Mike Beckerle | OGF DFDL WG Co-Chair
Tel: 781-330-0412
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20121112/180d09ab/attachment.html>
More information about the dfdl-wg
mailing list