[DFDL-WG] DFDL Modeling Question

Thu Feb 23 16:31:52 EST 2012

Hi Bradley

Yes dfdl:lengthKind "pattern" is the ideal way to model this.

I'm struggling to find a way to model this that preserves the nested 
groups and separates the trailing data from the control word. However if 
you were prepared to lose the group structure and treat the trailing data 
as part of the control word, then you could model a completely flat 
structure with the various delimiters interpreted as a prefix separator. 

        dfdl:separator="\ }\ }}\ }}}\ {\ }{\ }}{\ }}}{\" 
dfdl:separatorPosition="prefix"

That would give you an infoset like:

<file>
   <controlWord>rtf1</controlWord>
   <controlWord>ansi</controlWord>
   <controlWord>ansicpg1252</controlWord>
   <controlWord>deff0</controlWord>
   <controlWord>deflang1033</controlWord>
   <controlWord>fonttbl</controlWord>
   <controlWord>f0</controlWord>
   <controlWord>froman</controlWord>
    <controlWord>fprq2</controlWord>
    <controlWord>fcharset0 Times New Roman;</controlWord>
    <controlWord>f1</controlWord>
    <controlWord>fswiss</controlWord>
    <controlWord>fcharset0 Arial;</controlWord>
   <controlWord>*</controlWord>
   <controlWord>generator Msftedit 5.41.15.1515;</controlWord>
   <controlWord>viewkind4</controlWord>
   <controlWord>uc1</controlWord>
   <controlWord>pard</controlWord>
   <controlWord>f0</controlWord>
   <controlWord>fs24 This is an example document of an RTF 
file.</controlWord>
   <controlWord>f1</controlWord>
   <controlWord>fs20</controlWord>
   <controlWord>par</controlWord>
   <controlWord>*</controlWord>
   <controlWord>passwordhash 010000004c000000010000000480000050c3. . 
.</controlWord>
</file>

Not ideal. I'll carry on thinking about the problem. 

If you like I'll add you to the invite list for the DFDL WG call next 
Tuesday and we can discuss further?

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh at uk.ibm.com
tel:+44-1962-815848

From:   Bradley Sexton <bradley.r.sexton at gmail.com>
To:     dfdl-wg at ogf.org
Date:   23/02/2012 19:07
Subject:        [DFDL-WG] DFDL Modeling Question
Sent by:        dfdl-wg-bounces at ogf.org

Hello,

I've been looking at modeling Rich Text Format (RTF) files using the IBM 
Message Broker DFDL implementation, and ran into an issue. For some 
background, here's a small example of an RTF file:

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0 
Times New Roman;}{\f1\fswiss\fcharset0 Arial;}}{\*\generator Msftedit 
5.41.15.1515;}\viewkind4\uc1\pard\f0\fs24 This is an example document of 
an RTF file.\f1\fs20\par{\*\passwordhash 
010000004c000000010000000480000050c300001400000010000000f89c360d0c9d360d000000008bc29e2f78a2144122ed68a1701e2ea50bbbbeaf7333c40dfe048ccf55f709b8cc7e8b49}}

'\' and '\*\' mark the beginning of control words, and the curly braces 
mark the beginning and end of control groups that contain control words 
and data. My issue is that control words and data do not have suitable 
terminators for parsing. The end of control words is signified by a space 
when trailing data is present, but typically they are ended by '\' 
signalling the beginning of a new word or a curly brace signalling the end 
of the current of beginning of a new control group. Similarly data is 
typically ended by the '}' of the parent control group.

With the exception of a small header the value and placement of control 
words, groups, and data varies by file.

My issue with modeling this is that I was going to use 
dfdl:lengthKind="pattern" in lieu of suitable delimiters, but this feature 
is not implemented by IBM. I'm looking for an alternative way to model the 
data, and was hoping someone on the mailing list might have suggestions. 
My goal is to model control words and groups in as general a manner as 
possible given IBMs implementation restrictions, since RTF has over 1800 
defined control words and gives you the ability to create your own.

Ideal output for the above sample would be something along these lines:

<file>
   <controlWord>rtf1</controlWord>
   <controlWord>ansi</controlWord>
   <controlWord>ansicpg1252</controlWord>
   <controlWord>deff0</controlWord>
   <controlWord>deflang1033</controlWord>
   <controlGroup>
       <name>fonttbl</name>
       <controlGroup>
           <name>f0</name>
           <controlWord>froman</controlWord>
           <controlWord>fprq2</controlWord>
           <controlWord>fcharset0</controlWord>
           <data>Times New Roman;</data>
       </controlGroup>
       <controlGroup>
           <name>f1</name>
           <controlWord>fswiss</controlWord>
           <controlWord>fcharset0</controlWord>
           <data>Arial;</data>
       </controlGroup>
   </controlGroup>
   <controlGroup>
       <name>generator</name>
       <data>Msftedit 5.41.15.1515;</data>
   </controlGroup>
   <controlWord>viewkind4</controlWord>
   <controlWord>uc1</controlWord>
   <controlWord>pard</controlWord>
   <controlWord>f0</controlWord>
   <controlWord>fs24</controlWord>
   <text>This is an example document of an RTF file.</text>
   <controlWord>f1</controlWord>
   <controlWord>fs20</controlWord>
   <controlWord>par</controlWord>
   <controlGroup>
       <name>passwordhash</name>
       <data>010000004c000000010000000480000050c3. . .</data>
   </controlGroup>
</file>

IBM Unsupported Features:
http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html

I know that's a lot of info out of left field, but I wanted to try and 
explain it as thoroughly as possible to avoid any confusion. Thanks in 
advance for any advice you might have and let me know if I've been unclear 
in any areas.

Bradley Sexton--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20120223/c205d4a4/attachment-0001.html>