[DFDL-WG] Example schema that illustrates pathological DFDL behaviors

Mike Beckerle mbeckerle at apache.org
Wed Jan 24 13:42:43 PST 2024


I have been concerned for a while that combinations of the features in DFDL
v1.0 can be used in ways that are just far too powerful for what is needed
for data format expression.

For example, if the input is n bits, how large should the corresponding
infoset be when the data is parsed using a DFDL schema? Clearly at least
O(n) since there are n bits and each bit could create an element in the
infoset.

But should it be able to be O(n^2) ?

I can think of no use case for this. Intuition about data formats suggests
that the infoset should always be linearly related in size to the size of
the input data, and that the constant factor on that should be relatively
small.
Furthermore, the time taken to parse data should be linearly related to the
size of the data. I think lots of people are thinking DFDL parsing
should be implementable in a circuit, and the memory requirements are
proportional to the data size.
It doesn't make sense to try to make hardware implementations of DFDL if
that's not going to be the case.

Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and
dfdl:inputValueCalc, can create an infoset from parsing that is
super-exponentially larger than the input data. Yes, it is finite, and
bounded to some function of the size of the input data. But we don't even
have notations for this kind of numeric growth.

The schema I created consumes a 3 bit integer as the data. The size of the
infoset grows like this:

   - input 000 -> infoset size 1
   - input 001 -> infoset size 1
   - input 010 -> infoset size 256
   - input 011 -> infoset size is over 10,000 elements
   - ...
   - input 111 -> infoset size is larger than the number of atoms in the
   universe.

There are TDML tests for this schema that illustrate this (except that last
one)

But it is always finite! No recursion is required to do this. With infinite
space and time available it WILL terminate. I believe the language remains
theoretically decidable.

The README in this schema project discusses this and other pathological
cases, and suggests things we can do to restrict DFDL so that this kind of
pathological behavior is not allowed.

Really we want the largest infoset that can be created from

   - n bits of data
   - a schema of size s

to be O(sn) in size. I have started an argument in the README about why
O(sn) is the right target. Basically there are practical schemas that can
meet O(sn) as a lower bound.

I believe this can be achieved by adding more forward-progress requirements
to DFDL - all arrays should have to make forward progress when parsing, not
just unbounded arrays, and reused groups/types should have to make forward
progress at points of reuse. By doing that, we ensure that to make a giant
infoset, you have to be consuming a giant number of bits.

Right now this stuff is just living in a personal github repo, but I'll
give it an official better home either in DFDLSchemas or as part of Apache
Daffodil, or something later.

If this is of interest, please take a look at this github pull request
below.

---------- Forwarded message ---------
From: Mike Beckerle <notifications at github.com>
Date: Wed, Jan 24, 2024 at 3:50 PM
Subject: [mbeckerle/superexp] Example schema that illustrates pathological
DFDL behaviors (PR #1)
To: mbeckerle/superexp <superexp at noreply.github.com>
Cc: Mike Beckerle <mbeckerle at apache.org>, Your activity <
your_activity at noreply.github.com>


Illustrates DFDL issues where parsing small data can lead to explosively
larger parse output. Super exponentially larger.

The expressive power of DFDL should be restricted so that this is not
possible.

See the README.md for discussion of the issue.
------------------------------
You can view, comment on, or merge this pull request online at:

  https://github.com/mbeckerle/superexp/pull/1
Commit Summary

   - 46dbae6
   <https://github.com/mbeckerle/superexp/pull/1/commits/46dbae6b2dc5acd76d6954109f29cb147d0e68e5>
   First version of this example schema

File Changes

(9 files <https://github.com/mbeckerle/superexp/pull/1/files>)

   - *A* .gitattributes
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393e>
   (5)
   - *M* .gitignore
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947>
   (7)
   - *M* README.md
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5>
   (96)
   - *A* build.sbt
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-5634c415cd8c8504fdb973a3ed092300b43c4b8fc1e184f7249eb29a55511f91>
   (30)
   - *A* project/build.properties
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-cbc85b9df7818a480f215028e98821bd1c83adb304a9f03658595c44dddff754>
   (1)
   - *A* src/exp.dfdl.xsd
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-9751c75e50c64f92bee47a05b0800aa9c5691bcd5c2b1166a1b7dcfa6e5ebc65>
   (84)
   - *A* src/superexp.dfdl.xsd
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-2c057f270cab45f3f0577abc17a3472f2b86cc934f7bb72f58a911d344613572>
   (72)
   - *A* test/TestSuperexp.scala
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-c6b466d6536f0f670f97aeca9a516d93581c3c8327aa6b9f9fec6657a5e6dcf4>
   (35)
   - *A* test/TestSuperexp.tdml
   <https://github.com/mbeckerle/superexp/pull/1/files#diff-f1b4e40b448ec3ab530fab92fdfebb51a705aa9fc69ea118a565ef046b0db8e0>
   (117)

Patch Links:

   - https://github.com/mbeckerle/superexp/pull/1.patch
   - https://github.com/mbeckerle/superexp/pull/1.diff

—
Reply to this email directly, view it on GitHub
<https://github.com/mbeckerle/superexp/pull/1>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALUDAZF7ZM5XWD72HZ4Z5DYQFX2NAVCNFSM6AAAAABCJM6FHOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TSMBYGEYDAOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: <mbeckerle/superexp/pull/1 at github.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 8375 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240124/0a65ad03/attachment.txt>


More information about the dfdl-wg mailing list