[DFDL-WG] DFDL Modeling Question

Bradley Sexton bradley.r.sexton at gmail.com
Thu Feb 23 14:06:29 EST 2012


Hello,

I've been looking at modeling Rich Text Format (RTF) files using the IBM
Message Broker DFDL implementation, and ran into an issue. For some
background, here's a small example of an RTF file:

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0
Times New Roman;}{\f1\fswiss\fcharset0 Arial;}}{\*\generator Msftedit
5.41.15.1515;}\viewkind4\uc1\pard\f0\fs24 This is an example document of an
RTF file.\f1\fs20\par{\*\passwordhash
010000004c000000010000000480000050c300001400000010000000f89c360d0c9d360d000000008bc29e2f78a2144122ed68a1701e2ea50bbbbeaf7333c40dfe048ccf55f709b8cc7e8b49}}

'\' and '\*\' mark the beginning of control words, and the curly braces
mark the beginning and end of control groups that contain control words and
data. My issue is that control words and data do not have suitable
terminators for parsing. The end of control words is signified by a space
when trailing data is present, but typically they are ended by '\'
signalling the beginning of a new word or a curly brace signalling the end
of the current of beginning of a new control group. Similarly data is
typically ended by the '}' of the parent control group.

With the exception of a small header the value and placement of control
words, groups, and data varies by file.

My issue with modeling this is that I was going to use
dfdl:lengthKind="pattern" in lieu of suitable delimiters, but this feature
is not implemented by IBM. I'm looking for an alternative way to model the
data, and was hoping someone on the mailing list might have suggestions. My
goal is to model control words and groups in as general a manner as
possible given IBMs implementation restrictions, since RTF has over 1800
defined control words and gives you the ability to create your own.

Ideal output for the above sample would be something along these lines:

<file>
   <controlWord>rtf1</controlWord>
   <controlWord>ansi</controlWord>
   <controlWord>ansicpg1252</controlWord>
   <controlWord>deff0</controlWord>
   <controlWord>deflang1033</controlWord>
   <controlGroup>
       <name>fonttbl</name>
       <controlGroup>
           <name>f0</name>
           <controlWord>froman</controlWord>
            <controlWord>fprq2</controlWord>
            <controlWord>fcharset0</controlWord>
           <data>Times New Roman;</data>
       </controlGroup>
       <controlGroup>
           <name>f1</name>
            <controlWord>fswiss</controlWord>
            <controlWord>fcharset0</controlWord>
           <data>Arial;</data>
       </controlGroup>
   </controlGroup>
   <controlGroup>
       <name>generator</name>
       <data>Msftedit 5.41.15.1515;</data>
   </controlGroup>
   <controlWord>viewkind4</controlWord>
   <controlWord>uc1</controlWord>
   <controlWord>pard</controlWord>
   <controlWord>f0</controlWord>
   <controlWord>fs24</controlWord>
   <text>This is an example document of an RTF file.</text>
   <controlWord>f1</controlWord>
   <controlWord>fs20</controlWord>
   <controlWord>par</controlWord>
   <controlGroup>
       <name>passwordhash</name>
       <data>010000004c000000010000000480000050c3. . .</data>
   </controlGroup>
</file>

IBM Unsupported Features:
http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/index.jsp?topic=%2Fcom.ibm.dfdl.editor.messagebroker.doc%2Fdf00150_.html

I know that's a lot of info out of left field, but I wanted to try and
explain it as thoroughly as possible to avoid any confusion. Thanks in
advance for any advice you might have and let me know if I've been unclear
in any areas.

Bradley Sexton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20120223/04ff43cb/attachment.html>


More information about the dfdl-wg mailing list