[DFDL-WG] DFDL Modeling Question

Bradley Sexton bradley.r.sexton at gmail.com
Thu Mar 1 09:47:36 EST 2012


After some internal discussion I believe we are going to put RTF on the
shelf for the time being and look at some other formats. One question did
come up that I was hoping someone here might be able to help with. I was
asked if there was a way to flat model RTF such that it would work for any
size file or depth or nested groups, similar to what Steve proposed earlier:

        dfdl:separator="\ }\ }}\ }}}\ {\ }{\ }}{\ }}}{\"
dfdl:separatorPosition="prefix"

but suitable for any amount of "}" characters before the "\" or "{\". A
possibility suggested to me was to use:

        dfdl:separator="\ { }"

to consider all instances of these symbols as separators, and in the cases
such as "}}{\" consider the values in between each character as empty or
null. If you have any thoughts on this method or alternatives to a general
flat model they would be greatly appreciated.

Bradley



On Fri, Feb 24, 2012 at 10:31 AM, Bradley Sexton <bradley.r.sexton at gmail.com
> wrote:

> Steve,
>
> The order of nested groups is somewhat fluid in RTF, and my concern is
> whether or not modeling everything completely flat would preserve the
> structure and formatting properly. If you were to modify the text format in
> a file such as inserting a comment a new group is created and any data
> entered within the comment or previously existing text that is highlighted
> by the comment would be moved in new groups to signify their link.
>
> Feel free to put me down for the WG call, just let me know the time and
> call info.
>
> Thanks,
> Bradley Sexton
>
>
>
> On Thu, Feb 23, 2012 at 4:31 PM, Steve Hanson <smh at uk.ibm.com> wrote:
>
>> Hi Bradley
>>
>> Yes dfdl:lengthKind "pattern" is the ideal way to model this.
>>
>> I'm struggling to find a way to model this that preserves the nested
>> groups and separates the trailing data from the control word. However if
>> you were prepared to lose the group structure and treat the trailing data
>> as part of the control word, then you could model a completely flat
>> structure with the various delimiters interpreted as a prefix separator.
>>
>>         dfdl:separator="\ }\ }}\ }}}\ {\ }{\ }}{\ }}}{\"
>> dfdl:separatorPosition="prefix"
>>
>> That would give you an infoset like:
>>
>> <file>
>>    <controlWord>rtf1</controlWord>
>>    <controlWord>ansi</controlWord>
>>    <controlWord>ansicpg1252</controlWord>
>>    <controlWord>deff0</controlWord>
>>    <controlWord>deflang1033</controlWord>
>>    <controlWord>fonttbl</controlWord>
>>    <controlWord>f0</controlWord>
>>    <controlWord>froman</controlWord>
>>     <controlWord>fprq2</controlWord>
>>     <controlWord>fcharset0 Times New Roman;</controlWord>
>>     <controlWord>f1</controlWord>
>>     <controlWord>fswiss</controlWord>
>>     <controlWord>fcharset0 Arial;</controlWord>
>>    <controlWord>*</controlWord>
>>    <controlWord>generator Msftedit 5.41.15.1515;</controlWord>
>>    <controlWord>viewkind4</controlWord>
>>    <controlWord>uc1</controlWord>
>>    <controlWord>pard</controlWord>
>>    <controlWord>f0</controlWord>
>>    <controlWord>fs24 This is an example document of an RTF
>> file.</controlWord>
>>    <controlWord>f1</controlWord>
>>    <controlWord>fs20</controlWord>
>>    <controlWord>par</controlWord>
>>    <controlWord>*</controlWord>
>>    <controlWord>passwordhash 010000004c000000010000000480000050c3. .
>> .</controlWord>
>> </file>
>>
>> Not ideal. I'll carry on thinking about the problem.
>>
>> If you like I'll add you to the invite list for the DFDL WG call next
>> Tuesday and we can discuss further?
>>
>> Regards
>>
>> Steve Hanson
>> Architect, Data Format Description Language (DFDL)
>> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
>> IBM SWG, Hursley, UK*
>> **smh at uk.ibm.com* <smh at uk.ibm.com>
>> tel:+44-1962-815848
>>
>>
>>
>> From:        Bradley Sexton <bradley.r.sexton at gmail.com>
>> To:        dfdl-wg at ogf.org
>> Date:        23/02/2012 19:07
>> Subject:        [DFDL-WG] DFDL Modeling Question
>> Sent by:        dfdl-wg-bounces at ogf.org
>> ------------------------------
>>
>>
>>
>> Hello,
>>
>> I've been looking at modeling Rich Text Format (RTF) files using the IBM
>> Message Broker DFDL implementation, and ran into an issue. For some
>> background, here's a small example of an RTF file:
>>
>> {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0
>> Times New Roman;}{\f1\fswiss\fcharset0 Arial;}}{\*\generator Msftedit
>> 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs24 This is an example document of an
>> RTF file.\f1\fs20\par{\*\passwordhash
>> 010000004c000000010000000480000050c300001400000010000000f89c360d0c9d360d000000008bc29e2f78a2144122ed68a1701e2ea50bbbbeaf7333c40dfe048ccf55f709b8cc7e8b49}}
>>
>> '\' and '\*\' mark the beginning of control words, and the curly braces
>> mark the beginning and end of control groups that contain control words and
>> data. My issue is that control words and data do not have suitable
>> terminators for parsing. The end of control words is signified by a space
>> when trailing data is present, but typically they are ended by '\'
>> signalling the beginning of a new word or a curly brace signalling the end
>> of the current of beginning of a new control group. Similarly data is
>> typically ended by the '}' of the parent control group.
>>
>> With the exception of a small header the value and placement of control
>> words, groups, and data varies by file.
>>
>> My issue with modeling this is that I was going to use
>> dfdl:lengthKind="pattern" in lieu of suitable delimiters, but this feature
>> is not implemented by IBM. I'm looking for an alternative way to model the
>> data, and was hoping someone on the mailing list might have suggestions. My
>> goal is to model control words and groups in as general a manner as
>> possible given IBMs implementation restrictions, since RTF has over 1800
>> defined control words and gives you the ability to create your own.
>>
>> Ideal output for the above sample would be something along these lines:
>>
>> <file>
>>    <controlWord>rtf1</controlWord>
>>    <controlWord>ansi</controlWord>
>>    <controlWord>ansicpg1252</controlWord>
>>    <controlWord>deff0</controlWord>
>>    <controlWord>deflang1033</controlWord>
>>    <controlGroup>
>>        <name>fonttbl</name>
>>        <controlGroup>
>>            <name>f0</name>
>>            <controlWord>froman</controlWord>
>>            <controlWord>fprq2</controlWord>
>>            <controlWord>fcharset0</controlWord>
>>            <data>Times New Roman;</data>
>>        </controlGroup>
>>        <controlGroup>
>>            <name>f1</name>
>>            <controlWord>fswiss</controlWord>
>>            <controlWord>fcharset0</controlWord>
>>            <data>Arial;</data>
>>        </controlGroup>
>>    </controlGroup>
>>    <controlGroup>
>>        <name>generator</name>
>>        <data>Msftedit 5.41.15.1515;</data>
>>    </controlGroup>
>>    <controlWord>viewkind4</controlWord>
>>    <controlWord>uc1</controlWord>
>>    <controlWord>pard</controlWord>
>>    <controlWord>f0</controlWord>
>>    <controlWord>fs24</controlWord>
>>    <text>This is an example document of an RTF file.</text>
>>    <controlWord>f1</controlWord>
>>    <controlWord>fs20</controlWord>
>>    <controlWord>par</controlWord>
>>    <controlGroup>
>>        <name>passwordhash</name>
>>        <data>010000004c000000010000000480000050c3. . .</data>
>>    </controlGroup>
>> </file>
>>
>> IBM Unsupported Features:
>> *
>> http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html
>> *<http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html>
>>
>> I know that's a lot of info out of left field, but I wanted to try and
>> explain it as thoroughly as possible to avoid any confusion. Thanks in
>> advance for any advice you might have and let me know if I've been unclear
>> in any areas.
>>
>> Bradley Sexton--
>>  dfdl-wg mailing list
>>  dfdl-wg at ogf.org
>>  https://www.ogf.org/mailman/listinfo/dfdl-wg
>>
>>
>>
>> ------------------------------
>>
>> *
>> *
>>
>> *Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number
>> 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>> *
>>
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20120301/c50ee0d1/attachment.html>


More information about the dfdl-wg mailing list