[DFDL-WG] clarification on when escape characters are needed

Steve Hanson smh at uk.ibm.com
Wed Jun 19 10:03:19 EDT 2013


"There is some discussion currently of whether the last child element 
should require escaping of an infix separator"

I am against automatically escaping infix/prefix separators in the last 
child element of a sequence.  In our old MRM parser we had this behaviour, 
and we had several instances where extraneous fields, which were 
(incorrectly) unmodelled, got scooped into the last child of a sequence 
without any warning. In other words the model was wrong but the parser 
just carried on with corrupt data.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848



From:   Mike Beckerle <mbeckerle.dfdl at gmail.com>
To:     "Garriss Jr., James P." <jgarriss at mitre.org>, 
Cc:     "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date:   19/06/2013 14:28
Subject:        Re: [DFDL-WG] clarification on when escape characters are 
needed
Sent by:        dfdl-wg-bounces at ogf.org



The only for-sure fix to this issue I see is that you have to change the 
model. You can't model the = as a separator, because it is legal to appear 
unescaped in the content of the second child element. 

So, you have to model this as a sequence with no separator, of two 
children elements. The first of which has "=" as its terminator. Then 
there is no interpretation of = as framing/syntax in the second element.

There is some discussion currently of whether the last child element 
should require escaping of an infix separator, but regardless of the 
outcome of that, the above technique will work, and is preferred in the 
sense that there's no question about it. 

...mike

On Wed, Jun 19, 2013 at 9:01 AM, Garriss Jr., James P. <jgarriss at mitre.org
> wrote:
This origin of this issue is the Content-Type header in email, where the 
parameters can be quoted, but sometimes are not:
 
Content-Type: text/html; charset=”UTF-8”
Content-Type: text/html; charset=UTF-8
 
This was not a big deal until I ran into a parameter that included an = in 
the value:
 
Content-Type: multipart/alternative; 
boundary="----=_Part_150709_149622714.1370937621731"
 
When confronted with this issue, I was told:
 
>> there's a pretty simple
>> fix: specify an escape scheme that says that anything inside quotes is
>> not a delimiter. And fortunately your DefaultProperties.xsd file
>> actually comes an escape scheme that does exactly that.
>> 
>> So all you have to do is add this:
>> 
>>       dfdl:escapeSchemeRef="DefaultPropertiesEscapeScheme"
>> 
>> to this:
>> 
>>        <xsd:element name="value" type="xsd:string" />
>> 
 
You may well recognize this scheme, as it’s yours:
 
                                    <dfdl:defineEscapeScheme name=
"DefaultPropertiesEscapeScheme">
                                                <dfdl:escapeScheme 
escapeBlockEnd=""" escapeBlockStart="""
                                                            
escapeCharacter=""" escapeEscapeCharacter=""" escapeKind=
"escapeBlock"
                                                            
extraEscapedCharacters=", %#x0D; %#x0A;" generateEscapeBlock="whenNeeded"
                                                > </dfdl:escapeScheme>
                                    </dfdl:defineEscapeScheme>
 
I used this solution for the parameters of the Content-Type header, which 
are key/value pairs.
 
        <xsd:sequence dfdl:separator="=">
            <!-- this init is a workaround for Daffodil 0.10 bug (see 
ContentType element above) -->
            <xsd:element name="key" dfdl:initiator="%WSP*;">
                <xsd:annotation>
                    <xsd:appinfo source="http://www.ogf.org/dfdl/dfdl-1.0/
">
                        <dfdl:assert test="{ dfdl:checkConstraints(.) }" 
message="The parameter key must match one of the values on the enumerated 
list."/>
                    </xsd:appinfo>
                </xsd:annotation>
                <xsd:simpleType>
                    <xsd:restriction base="xsd:string">
                        <xsd:enumeration value="charset"/>
                        <xsd:enumeration value="name"/>
                        <xsd:enumeration value="boundary"/>
                    </xsd:restriction>
                </xsd:simpleType>
            </xsd:element>
            <!-- Daffodil 0.10.1 fails here if there's an = in the value. 
-->
            <xsd:element name="value" type="xsd:string" 
dfdl:escapeSchemeRef="DefaultPropertiesEscapeScheme"/>
        </xsd:sequence>
 
Without the scheme, I get an error.  With it, it works great.
 
So is this an inappropriate use of an escape scheme?  
 
From: Steve Hanson [mailto:smh at uk.ibm.com] 
Sent: Wednesday, June 19, 2013 8:47 AM
To: Garriss Jr., James P.
Cc: dfdl-wg at ogf.org; Mike Beckerle
Subject: RE: [DFDL-WG] clarification on when escape characters are needed
 
James 

I don't see how an escape scheme helps here.  The "f82+=7&%q" is all data, 
there's no escape character. 

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        "Garriss Jr., James P." <jgarriss at mitre.org> 
To:        Steve Hanson/UK/IBM at IBMGB, Mike Beckerle <
mbeckerle.dfdl at gmail.com>, 
Cc:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org> 
Date:        19/06/2013 12:33 
Subject:        RE: [DFDL-WG] clarification on when escape characters are 
needed 




> The DFDL 1.0 spec implies the behaviour where you get… 
  
If this is the direction the WG goes, can you please make this explicit 
rather than implicit?  Using Mike’s excellent example below would go a 
long way to making the issue clear.   
  
As for a solution, would it not be better to use an escape scheme, like 
this? 
  
<sequence dfdl:separator="=" dfdl:separatorPosition="infix">
 <element name="a" type="xs:string"/>
 <element name="b" type="xs:string" 
 dfdl:escapeSchemeRef="DefaultPropertiesEscapeScheme"/>
</sequence> 
  
(Cred to Taylor) 
  
If so, it would be helpful to include that in the example. 
  
From: dfdl-wg-bounces at ogf.org [mailto:dfdl-wg-bounces at ogf.org] On Behalf 
Of Steve Hanson
Sent: Wednesday, June 19, 2013 5:29 AM
To: Mike Beckerle
Cc: dfdl-wg at ogf.org
Subject: Re: [DFDL-WG] clarification on when escape characters are needed 
  
The DFDL 1.0 spec implies the behaviour where you get: 

<a>password</a> 
<b>f82+</b> 

followed by a processing error.  There is no special casing of the last 
element in the group. 

Changing the model to the following achieves the desired infoset: 

<sequence dfdl:separator="=" dfdl:separatorPosition="infix">
 <element name="a" type="xs:string"/>
 <sequence dfdl:separator="">
   <element name="b" type="xs:string"/>
</sequence>
</sequence>


Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848 



From:        Tim Kimber/UK/IBM at IBMGB 
To:        dfdl-wg at ogf.org, 
Date:        19/06/2013 09:37 
Subject:        Re: [DFDL-WG] clarification on when escape characters are 
needed 
Sent by:        dfdl-wg-bounces at ogf.org 





In the IBM implementation we have taken the view that the separator 
defines the format for all of the group's content. That means that all 
separators are counted as being significant, even if they occur within the 
content region of the final group member. 
I agree that other interpretations are possible - the MRM parser in 
earlier versions of WebSphere Message Broker takes an infix separator out 
of scope when it encounters the final declared child of a group. 

I intend to address this point when I write up the rules for matching 
string literals and delimiters. 

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  kimbert at uk.ibm.com
Tel. 01962-816742  
Internal tel. 37246742




From:        Mike Beckerle <mbeckerle.dfdl at gmail.com> 
To:        dfdl-wg at ogf.org, 
Date:        19/06/2013 03:52 
Subject:        [DFDL-WG] clarification on when escape characters are 
needed 
Sent by:        dfdl-wg-bounces at ogf.org 






Suppose I have a sequence. It has an infix separator which is "=".

<sequence dfdl:separator="=" dfdl:separatorPosition="infix">
 <element name="a" type="xs:string"/>
 <element name="b" type="xs:string"/>
</sequence>

Now, consider this data:

password=f82+=7&%q

I want 

<a>password</a>
<b>f82+=7&%q</b>

Notice how the b element contains an '=' which was not escaped in any way 
in the sequence. Element b is statically known to be last, the separator 
is infix; hence, things are unambiguous even if there is no escaping.

However, there is an alternative interpretation, which is that the above 
data should fail, because it produces <a>password</a><b>f82+</b> but then 
does not find the expected stuff next. Rather it finds the '=7&%q' data. 
In other words, the sequence separator divides the sequence content into 3 
content regions, but there aren't 3 things to consume those, so it is a 
processing error. 

Which is correct? 

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg at ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg



-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20130619/1579c517/attachment-0001.html>


More information about the dfdl-wg mailing list