[DFDL-WG] Example schema that illustrates pathological DFDL behaviors

Mike Beckerle mbeckerle at apache.org
Wed Feb 7 11:01:16 PST 2024


The example schema that illustrates this super-exponential growth of
infoset size has been moved to
https://github.com/OpenDFDL/examples/tree/master/superexp


On Wed, Jan 24, 2024 at 4:42 PM Mike Beckerle <mbeckerle at apache.org> wrote:

>
> I have been concerned for a while that combinations of the features in
> DFDL v1.0 can be used in ways that are just far too powerful for what is
> needed for data format expression.
>
> For example, if the input is n bits, how large should the corresponding
> infoset be when the data is parsed using a DFDL schema? Clearly at least
> O(n) since there are n bits and each bit could create an element in the
> infoset.
>
> But should it be able to be O(n^2) ?
>
> I can think of no use case for this. Intuition about data formats suggests
> that the infoset should always be linearly related in size to the size of
> the input data, and that the constant factor on that should be relatively
> small.
> Furthermore, the time taken to parse data should be linearly related to
> the size of the data. I think lots of people are thinking DFDL parsing
> should be implementable in a circuit, and the memory requirements are
> proportional to the data size.
> It doesn't make sense to try to make hardware implementations of DFDL if
> that's not going to be the case.
>
> Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and
> dfdl:inputValueCalc, can create an infoset from parsing that is
> super-exponentially larger than the input data. Yes, it is finite, and
> bounded to some function of the size of the input data. But we don't even
> have notations for this kind of numeric growth.
>
> The schema I created consumes a 3 bit integer as the data. The size of the
> infoset grows like this:
>
>    - input 000 -> infoset size 1
>    - input 001 -> infoset size 1
>    - input 010 -> infoset size 256
>    - input 011 -> infoset size is over 10,000 elements
>    - ...
>    - input 111 -> infoset size is larger than the number of atoms in the
>    universe.
>
> There are TDML tests for this schema that illustrate this (except that
> last one)
>
> But it is always finite! No recursion is required to do this. With
> infinite space and time available it WILL terminate. I believe the language
> remains theoretically decidable.
>
> The README in this schema project discusses this and other pathological
> cases, and suggests things we can do to restrict DFDL so that this kind of
> pathological behavior is not allowed.
>
> Really we want the largest infoset that can be created from
>
>    - n bits of data
>    - a schema of size s
>
> to be O(sn) in size. I have started an argument in the README about why
> O(sn) is the right target. Basically there are practical schemas that can
> meet O(sn) as a lower bound.
>
> I believe this can be achieved by adding more forward-progress
> requirements to DFDL - all arrays should have to make forward progress when
> parsing, not just unbounded arrays, and reused groups/types should have to
> make forward progress at points of reuse. By doing that, we ensure that to
> make a giant infoset, you have to be consuming a giant number of bits.
>
> Right now this stuff is just living in a personal github repo, but I'll
> give it an official better home either in DFDLSchemas or as part of Apache
> Daffodil, or something later.
>
> If this is of interest, please take a look at this github pull request
> below.
>
> ---------- Forwarded message ---------
> From: Mike Beckerle <notifications at github.com>
> Date: Wed, Jan 24, 2024 at 3:50 PM
> Subject: [mbeckerle/superexp] Example schema that illustrates pathological
> DFDL behaviors (PR #1)
> To: mbeckerle/superexp <superexp at noreply.github.com>
> Cc: Mike Beckerle <mbeckerle at apache.org>, Your activity <
> your_activity at noreply.github.com>
>
>
> Illustrates DFDL issues where parsing small data can lead to explosively
> larger parse output. Super exponentially larger.
>
> The expressive power of DFDL should be restricted so that this is not
> possible.
>
> See the README.md for discussion of the issue.
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
>   https://github.com/mbeckerle/superexp/pull/1
> Commit Summary
>
>    - 46dbae6
>    <https://github.com/mbeckerle/superexp/pull/1/commits/46dbae6b2dc5acd76d6954109f29cb147d0e68e5>
>    First version of this example schema
>
> File Changes
>
> (9 files <https://github.com/mbeckerle/superexp/pull/1/files>)
>
>    - *A* .gitattributes
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393e>
>    (5)
>    - *M* .gitignore
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947>
>    (7)
>    - *M* README.md
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5>
>    (96)
>    - *A* build.sbt
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-5634c415cd8c8504fdb973a3ed092300b43c4b8fc1e184f7249eb29a55511f91>
>    (30)
>    - *A* project/build.properties
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-cbc85b9df7818a480f215028e98821bd1c83adb304a9f03658595c44dddff754>
>    (1)
>    - *A* src/exp.dfdl.xsd
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-9751c75e50c64f92bee47a05b0800aa9c5691bcd5c2b1166a1b7dcfa6e5ebc65>
>    (84)
>    - *A* src/superexp.dfdl.xsd
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-2c057f270cab45f3f0577abc17a3472f2b86cc934f7bb72f58a911d344613572>
>    (72)
>    - *A* test/TestSuperexp.scala
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-c6b466d6536f0f670f97aeca9a516d93581c3c8327aa6b9f9fec6657a5e6dcf4>
>    (35)
>    - *A* test/TestSuperexp.tdml
>    <https://github.com/mbeckerle/superexp/pull/1/files#diff-f1b4e40b448ec3ab530fab92fdfebb51a705aa9fc69ea118a565ef046b0db8e0>
>    (117)
>
> Patch Links:
>
>    - https://github.com/mbeckerle/superexp/pull/1.patch
>    - https://github.com/mbeckerle/superexp/pull/1.diff
>
>> Reply to this email directly, view it on GitHub
> <https://github.com/mbeckerle/superexp/pull/1>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AALUDAZF7ZM5XWD72HZ4Z5DYQFX2NAVCNFSM6AAAAABCJM6FHOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TSMBYGEYDAOI>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: <mbeckerle/superexp/pull/1 at github.com>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 9081 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240207/31483664/attachment.txt>


More information about the dfdl-wg mailing list