[DFDL-WG] DFDL Modeling Question

Bradley Sexton bradley.r.sexton at gmail.com
Thu Mar 1 13:05:55 EST 2012


That was the plan, but I'm getting errors when I attempt to parse. When it
runs into two separators in succession (e.g. "{\") I get "An unexpected
non-postfix separator } occurs in a postfix position at offset . . .",
causing it to fail on the first two characters of the file.

The schema is very basic:

<element name="RTF">
    <complexType>
        <sequence dfdl:separator="\ } {">
            <element name="field" type="string" minOccurs="0"
maxOccurs="unbounded"/>
        </sequence>
    </complexType>
</element>

 If I add separators for "{\", "}}", and other combinations the file parses
fine, so I'm guessing there is a value or setting somewhere that I might
have missed or set incorrectly in the format definition that could cause it
to error out instead of seeing an empty record and skipping over it?

It seems so straightforward in theory that it's very frustrating to see it
not working : )

Thanks,
Bradley



On Thu, Mar 1, 2012 at 11:48 AM, Steve Hanson <smh at uk.ibm.com> wrote:

> Hi Bradley
>
> I think this would work. Presumably the controlWord element would be
> minOccurs='0', maxOccurs='unbounded'? If so all occurrences are optional,
> and empty optional elements won't be added to the infoset. So you won't
> have unwanted empty elements in the infoset.
>
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:+44-1962-815848
>
>
>
> From:        Bradley Sexton <bradley.r.sexton at gmail.com>
> To:        Steve Hanson/UK/IBM at IBMGB
> Cc:        dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
> Date:        01/03/2012 14:48
> Subject:        Re: [DFDL-WG] DFDL Modeling Question
>  Sent by:        dfdl-wg-bounces at ogf.org
> ------------------------------
>
>
>
> After some internal discussion I believe we are going to put RTF on the
> shelf for the time being and look at some other formats. One question did
> come up that I was hoping someone here might be able to help with. I was
> asked if there was a way to flat model RTF such that it would work for any
> size file or depth or nested groups, similar to what Steve proposed earlier:
>
>         dfdl:separator="\ }\ }}\ }}}\ {\ }{\ }}{\ }}}{\"
> dfdl:separatorPosition="prefix"
>
> but suitable for any amount of "}" characters before the "\" or "{\". A
> possibility suggested to me was to use:
>
>         dfdl:separator="\ { }"
>
> to consider all instances of these symbols as separators, and in the cases
> such as "}}{\" consider the values in between each character as empty or
> null. If you have any thoughts on this method or alternatives to a general
> flat model they would be greatly appreciated.
>
> Bradley
>
>
>
> On Fri, Feb 24, 2012 at 10:31 AM, Bradley Sexton <*
> bradley.r.sexton at gmail.com* <bradley.r.sexton at gmail.com>> wrote:
> Steve,
>
> The order of nested groups is somewhat fluid in RTF, and my concern is
> whether or not modeling everything completely flat would preserve the
> structure and formatting properly. If you were to modify the text format in
> a file such as inserting a comment a new group is created and any data
> entered within the comment or previously existing text that is highlighted
> by the comment would be moved in new groups to signify their link.
>
> Feel free to put me down for the WG call, just let me know the time and
> call info.
>
> Thanks,
> Bradley Sexton
>
>
>
> On Thu, Feb 23, 2012 at 4:31 PM, Steve Hanson <*smh at uk.ibm.com*<smh at uk.ibm.com>>
> wrote:
> Hi Bradley
>
> Yes dfdl:lengthKind "pattern" is the ideal way to model this.
>
> I'm struggling to find a way to model this that preserves the nested
> groups and separates the trailing data from the control word. However if
> you were prepared to lose the group structure and treat the trailing data
> as part of the control word, then you could model a completely flat
> structure with the various delimiters interpreted as a prefix separator.
>
>         dfdl:separator="\ }\ }}\ }}}\ {\ }{\ }}{\ }}}{\"
> dfdl:separatorPosition="prefix"
>
> That would give you an infoset like:
>
> <file>
>    <controlWord>rtf1</controlWord>
>    <controlWord>ansi</controlWord>
>    <controlWord>ansicpg1252</controlWord>
>    <controlWord>deff0</controlWord>
>    <controlWord>deflang1033</controlWord>
>    <controlWord>fonttbl</controlWord>
>    <controlWord>f0</controlWord>
>    <controlWord>froman</controlWord>
>     <controlWord>fprq2</controlWord>
>     <controlWord>fcharset0 Times New Roman;</controlWord>
>     <controlWord>f1</controlWord>
>     <controlWord>fswiss</controlWord>
>     <controlWord>fcharset0 Arial;</controlWord>
>    <controlWord>*</controlWord>
>    <controlWord>generator Msftedit 5.41.15.1515;</controlWord>
>    <controlWord>viewkind4</controlWord>
>    <controlWord>uc1</controlWord>
>    <controlWord>pard</controlWord>
>    <controlWord>f0</controlWord>
>    <controlWord>fs24 This is an example document of an RTF
> file.</controlWord>
>    <controlWord>f1</controlWord>
>    <controlWord>fs20</controlWord>
>    <controlWord>par</controlWord>
>    <controlWord>*</controlWord>
>    <controlWord>passwordhash 010000004c000000010000000480000050c3. .
> .</controlWord>
> </file>
>
> Not ideal. I'll carry on thinking about the problem.
>
> If you like I'll add you to the invite list for the DFDL WG call next
> Tuesday and we can discuss further?
>
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
> IBM SWG, Hursley, UK*
> **smh at uk.ibm.com* <smh at uk.ibm.com>
> tel:*+44-1962-815848* <%2B44-1962-815848>
>
>
>
> From:        Bradley Sexton <*bradley.r.sexton at gmail.com*<bradley.r.sexton at gmail.com>
> >
> To:        *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
> Date:        23/02/2012 19:07
> Subject:        [DFDL-WG] DFDL Modeling Question
> Sent by:        *dfdl-wg-bounces at ogf.org* <dfdl-wg-bounces at ogf.org>
> ------------------------------
>
>
>
>
> Hello,
>
> I've been looking at modeling Rich Text Format (RTF) files using the IBM
> Message Broker DFDL implementation, and ran into an issue. For some
> background, here's a small example of an RTF file:
>
> {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0
> Times New Roman;}{\f1\fswiss\fcharset0 Arial;}}{\*\generator Msftedit
> 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs24 This is an example document of an
> RTF file.\f1\fs20\par{\*\passwordhash
> 010000004c000000010000000480000050c300001400000010000000f89c360d0c9d360d000000008bc29e2f78a2144122ed68a1701e2ea50bbbbeaf7333c40dfe048ccf55f709b8cc7e8b49}}
>
>
> '\' and '\*\' mark the beginning of control words, and the curly braces
> mark the beginning and end of control groups that contain control words and
> data. My issue is that control words and data do not have suitable
> terminators for parsing. The end of control words is signified by a space
> when trailing data is present, but typically they are ended by '\'
> signalling the beginning of a new word or a curly brace signalling the end
> of the current of beginning of a new control group. Similarly data is
> typically ended by the '}' of the parent control group.
>
> With the exception of a small header the value and placement of control
> words, groups, and data varies by file.
>
> My issue with modeling this is that I was going to use
> dfdl:lengthKind="pattern" in lieu of suitable delimiters, but this feature
> is not implemented by IBM. I'm looking for an alternative way to model the
> data, and was hoping someone on the mailing list might have suggestions. My
> goal is to model control words and groups in as general a manner as
> possible given IBMs implementation restrictions, since RTF has over 1800
> defined control words and gives you the ability to create your own.
>
> Ideal output for the above sample would be something along these lines:
>
> <file>
>    <controlWord>rtf1</controlWord>
>    <controlWord>ansi</controlWord>
>    <controlWord>ansicpg1252</controlWord>
>    <controlWord>deff0</controlWord>
>    <controlWord>deflang1033</controlWord>
>    <controlGroup>
>        <name>fonttbl</name>
>        <controlGroup>
>            <name>f0</name>
>            <controlWord>froman</controlWord>
>            <controlWord>fprq2</controlWord>
>            <controlWord>fcharset0</controlWord>
>            <data>Times New Roman;</data>
>        </controlGroup>
>        <controlGroup>
>            <name>f1</name>
>            <controlWord>fswiss</controlWord>
>            <controlWord>fcharset0</controlWord>
>            <data>Arial;</data>
>        </controlGroup>
>    </controlGroup>
>    <controlGroup>
>        <name>generator</name>
>        <data>Msftedit 5.41.15.1515;</data>
>    </controlGroup>
>    <controlWord>viewkind4</controlWord>
>    <controlWord>uc1</controlWord>
>    <controlWord>pard</controlWord>
>    <controlWord>f0</controlWord>
>    <controlWord>fs24</controlWord>
>    <text>This is an example document of an RTF file.</text>
>    <controlWord>f1</controlWord>
>    <controlWord>fs20</controlWord>
>    <controlWord>par</controlWord>
>    <controlGroup>
>        <name>passwordhash</name>
>        <data>010000004c000000010000000480000050c3. . .</data>
>    </controlGroup>
> </file>
>
> IBM Unsupported Features: *
> **
> http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html
> *<http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html>
>
> I know that's a lot of info out of left field, but I wanted to try and
> explain it as thoroughly as possible to avoid any confusion. Thanks in
> advance for any advice you might have and let me know if I've been unclear
> in any areas.
>
> Bradley Sexton--
>  dfdl-wg mailing list
>  *dfdl-wg at ogf.org* <dfdl-wg at ogf.org>
>  *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
>
>
>
> ------------------------------
> *
> *
>
> *Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> *
>
>
>
>
>
>
> --
>  dfdl-wg mailing list
>  dfdl-wg at ogf.org
>  https://www.ogf.org/mailman/listinfo/dfdl-wg
>
>
>
> ------------------------------
>
> *
> *
>
> *Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> *
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20120301/b06f6f78/attachment-0001.html>


More information about the dfdl-wg mailing list