[DFDL-WG] how to trim inside of escape block?

Mike Beckerle mbeckerle.dfdl at gmail.com
Wed Nov 22 12:40:13 EST 2017


Another related problem:

a, b, notList, c, d
a, b, "list1, list2, list3",c,d

Here the 3rd field is a list, comma separated. Quoted if there is more than
one list item.

I think to parse this I have to treat the quotation marks as
initiator/terminator, and set dfdl:separator="", but since the quotes are
optional for the single-list-item case, I'm going to need a choice.

I think the best I can do is

<ignore:ListOf1__XMLSchemaMakesMeHaveThisForUPA/>
<List>notList</List>

and

<ignore:ListOfN__XMLSchemaMakesMeHaveThisForUPA/>
<List>list1</List><List>list2</List><List>list3</List>

as the XML representations.

Are there any better/cleaner solutions?

I did think of this way: (note: I've omitted xs:annotation and xs:appinfo
for brevity), but it isn't exactly "clean".
This is what I call "modeling syntax as data"....

<dfdl:defineVariable name="foundOpenQuote" type="xs:boolean"/>

<xs:group name="optionalOpenQuote">
   <choice>
     <xs:sequence dfdl:initiiator='"'>
          <dfdl:setVariable ref="foundOpenQuote" value="{ fn:true() }"/>
     </xs:sequence>
    <xs:sequence dfdl:initiator=""/>
  </choice>
</xs:group>

<xs:group name="matchingCloseQuote">
   <choice>
     <xs:sequence dfdl:terminator='"'>
          <dfdl:discriminator>{ $foundOpenQuote eq fn:true() }</dfdl:assert>
     </xs:sequence>
    <xs:sequence />
  </choice>
</xs:group>


// The main sequence for the data would then have this as the list element:

<xs:sequence>
   <dfdl:newVariableInstance ref="foundOpenQuote" defaultValue="false"/>
   <xs:sequence dfdl:hiddenGroupRef="optionalOpenQuote"/>
   <xs:sequence dfdl:separator=",">
      <xs:element name="List" type="xs:string" maxOccurs="unbounded"/>
   </xs:sequence>
   <xs:sequence dfdl:hiddenGroupRef="matchingCloseQuote"/>
</xs:sequence>

I'd try this out, except that we haven't got dfdl:newVariableInstance yet.



Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>


On Wed, Nov 22, 2017 at 4:11 AM, Steve Hanson <smh at uk.ibm.com> wrote:

> I don't think there is a way to achieve what you want. As you say,
> trimming pad chars takes precedence over applying escape scheme.
>
> I wondered if you could define the escapeBlockStart and End as "%WSP*;
> and %WSP*;" respectively but the white space entities are not allowed as
> escape character or in escape block start/end.
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, *IBM DFDL*
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> *smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848 <+44%201962%20815848>
> mob:+44-7717-378890 <+44%207717%20378890>
>
>
>
> From:        Mike Beckerle <mbeckerle.dfdl at gmail.com>
> To:        "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
> Date:        22/11/2017 01:28
> Subject:        [DFDL-WG] how to trim inside of escape block?
> Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org>
> ------------------------------
>
>
>
>
> I have a CSV file
>
> Some lines look like this
>
> a,b,"   started with spaces, appearing right after the escape block
> start   ",c,d,e
>
> I reviewed the spec, and I see that pad characters appear outside of the
> quotation marks (escape block start/end).
>
> What I'm trying to do is remove the whitespace after the escape block
> start, and before the escape block end. This is just spurious whitespace,
> appears because some of these CSV files were edited by people.
>
> In my data the quoting characters are not always present. They are only
> there if a comma appears in the data string.
>
> Is there a technique for getting rid of the leading/trailing whitespace
> inside the escape block start/end that I have forgotten?
>
> ...mikeb
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> *www.tresys.com*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.tresys.com&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=vfzt-MyHajT591zYQmbcxckPT-mZLjNRPlTrg8kgRgY&s=vDa_CXvz_6ZAge5Ddy0xcukdYO5ZecWcijrrwh8LCAI&e=>
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the *OGF Intellectual Property Policy*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ogf.org_About_abt-5Fpolicies.php&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=vfzt-MyHajT591zYQmbcxckPT-mZLjNRPlTrg8kgRgY&s=KPFq-Tn_5Fmdo1dbD6fIVEGz348_1uFxuTKdJxqZnqM&e=>
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_
> listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=
> AJa9ThEymJXYnOqu84mJuw&m=vfzt-MyHajT591zYQmbcxckPT-
> mZLjNRPlTrg8kgRgY&s=6PDI_r_U7OUsqAxzv24ZiCuH5zPpWFyzXbneqH1GPXk&e=
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20171122/bbb1a793/attachment.html>


More information about the dfdl-wg mailing list