[DFDL-WG] output value and length (was Re: Fw: Notes from 2007-09-12 call)
Steve Hanson
smh at uk.ibm.com
Wed Oct 31 08:33:00 CDT 2007
Mike
Can we discuss your conclusions (reproduced), and also the points raised
by myself and Alan.
--------------------------
Conclusion:
It does appear that we need outputLengthCalc, which is tantamount to
Steve's concerns that we need input and output variants of many
properties. We need to distinguish input length and output length. In the
above example, dfdl:length is input length, and dfdl:outputLengthCalc is
the property name I'm using for an output length.
Perhaps better naming conventions would be
Use dfdl:length when it's symmetric, dfdl:inputLength and
dfdl:outputLength when it's asymetric.
Logical value comes from the representation when parsing unless
dfdl:inputValue (formerly dfdl:inputValueCalc) in which case the logical
value comes from that expression.
Representation comes from the logical value when unparsing unless
dfdl:outputValue is provided (formerly dfdl:outputValueCalc), in which
case representation comes from that computed value instead.
---------------------------
Also, please could you illustrate the issue using a much simpler example
than the box array, eg, a variable length string where its length is given
by a preceding integer field. Length on input is given by the integer,
length on output is given by the actual data. Value of integer is as
supplied on input, value on output is the length of the string. I want to
see whether we really need input & output length properties for a more
typical scenario.
Personally I think we should drop the use of dfdl:length on a sequence for
1.0 period. That precludes support of box arrays, but I don't have a
problem with that as I don't have a real-life use case.
(I would say however that if the only way to model a box array is as
below, then we are asking an awful lot from our audience to be able to
create such a model.)
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848
Mike Beckerle <beckerle at us.ibm.com>
Sent by: dfdl-wg-bounces at ogf.org
20/09/2007 15:02
To
Alan Powell/UK/IBM at IBMGB
cc
dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
Subject
[DFDL-WG] output value and length (was Re: Fw: Notes from 2007-09-12
call)
We have many use cases to work out for the output direction.
E.g., consider a string in utf-8 characters, stored in a box which must be
of N "words" long, i.e., length will be a multiple of 4 bytes long.
Now suppose we have to store the length of the box measured in number of
words, in a field L1. The String is S1.
Some of this stuff might want to be hidden in a real schema, but let's
ignore that for now. So, one might model this without DFDL as:
<sequence>
<element name="L1" type="int" />
<element name="box">
<complexType>
<sequence id="box">
<element name="S1" type="string" />
</sequence>
</complexType>
</element>
</sequence>
So we have the length, a box surrounding the string, and the string S1
itself.
Now we want to annotate this for input parsing. I'm going to leave off all
the dfdl:applies properties to save space:
<sequence>
<element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
/>
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }"
dfdl:lengthUnits="bytes">
<element name="S1" type="string" dfdl:encoding="utf-8"
dfdl:length="fillAvailableSpace" />
</sequence>
</complexType>
</element>
</sequence>
So far so good. The sequence's length is L1 * 4, and the string fills the
space in that sequence.
Now we want to annotate it for output/unparse. First we put in
outputValueCalc on L1. This seems ok.
<sequence>
<element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }"
dfdl:lengthUnits="bytes">
<element name="S1" type="string" dfdl:encoding="utf-8"
dfdl:length="fillAvailableSpace" />
</sequence>
</complexType>
</element>
</sequence>
The above however appears to be circularly defined. The length of the
sequence inside the box element is defined in terms of the value of L1,
and the output value of L1 is defined in terms of the length of element
box. So really we need to distinguish input length and output length
calculations.
So it seems we need dfdl:outputLengthCalc="{ cieling(S1.length('bytes'),
4) * 4 }" as an additional rep prop on the box sequence. Notice how we've
had to ask for the length to be presented in a particular kind of units,
and the cieling and multiply trick rounds up to a multiple of 4 in size.
<sequence>
<element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }"
dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{
cieling(S1.length('bytes'), 4) * 4 }">
<element name="S1" type="string" dfdl:encoding="utf-8"
dfdl:length="fillAvailableSpace" />
</sequence>
</complexType>
</element>
</sequence>
But now we still have an issue, which is that the length of S1 on output
might need to be enlarged with padding characters because the output
length of the box is being rounded up to a multiple of 4 bytes.
One idea for how to solve this is to use layers. I.e, we need another
string S2 because we can't get all the description we need onto just the
string S1.
<sequence>
<element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }"
dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{
cieling(S1.length('bytes'), 4) * 4 }">
<element name="S2" type="string" dfdl:encoding="utf-8"
dfdl:length="fillAvailableSpace"
dfdl:outputValueCalc="{ ../../S1
}" dfdl:padCharacter=" " />
</sequence>
</complexType>
</element>
<element name="S1" type="string" dfdl:inputValueCalc="{ ../box/S2 }" />
</sequence>
The above we have S2, which is the string that really lives in the
representation.
Now hiding the rep stuff and making it into a reusable type definition:
<complexType name="wordLengthStringType">
<sequence>
<annotation><appinfo><dfdl:hidden>
<element name="rep">
<complexType>
<sequence>
<element name="L1" type="int" dfdl:length="4"
dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }"
dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{
cieling(../../../S1.length('bytes'), 4) * 4 }">
<element name="S2" type="string" dfdl:encoding="utf-8"
dfdl:length="fillAvailableSpace"
dfdl:outputValueCalc="{
../../../S1 }" dfdl:padCharacter=" " />
</sequence>
</complexType>
</element>
</sequence>
</complexType>
</element>
</dfdl:hidden></appinfo></annotation>
<element name="S1" type="string" dfdl:inputValueCalc="{ ../rep/box/S2
}" />
</sequence>
</complexType>
Now to use it:
<element name="myString" type="wordLengthStringType"/>
Logical expression myString/S1 is the string's value. (Probably should
rename the element "S1" to "value" so this would be myString/value)
In DFDL v1.0 as currently defined, we do not have any way to make this
into a "real string type", because we don't provide a way to define a
complex type as the representation of a simple type. That's ok. We can
consider that later.
Conclusion:
It does appear that we need outputLengthCalc, which is tantamount to
Steve's concerns that we need input and output variants of many
properties. We need to distinguish input length and output length. In the
above example, dfdl:length is input length, and dfdl:outputLengthCalc is
the property name I'm using for an output length.
Perhaps better naming conventions would be
Use dfdl:length when it's symmetric, dfdl:inputLength and
dfdl:outputLength when it's asymetric.
Logical value comes from the representation when parsing unless
dfdl:inputValue (formerly dfdl:inputValueCalc) in which case the logical
value comes from that expression.
Representation comes from the logical value when unparsing unless
dfdl:outputValue is provided (formerly dfdl:outputValueCalc), in which
case representation comes from that computed value instead.
We also need the expression language to be able to ask what the length of
the representation of an element is, measured in whatever units we need.
We may need to be able to ask for the inputLength and the outputLength
separately. --
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Comments on lengths <awp> below </awp>
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell at uk.ibm.com
Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898
Steve Hanson/UK/IBM at IBMGB
Sent by: dfdl-wg-bounces at ogf.org
19/09/2007 14:30
To
dfdl-wg at ogf.org
cc
Subject
[DFDL-WG] Fw: Notes from 2007-09-12 call
More on expressions, <smh>below</smh>
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 19/09/2007 14:14 -----
Mike Beckerle <beckerle at us.ibm.com>
19/09/2007 13:43
To
Steve Hanson/UK/IBM at IBMGB
cc
dfdl-wg at ogf.org, dfdl-wg-bounces at ogf.org
Subject
Re: [DFDL-WG] Notes from 2007-09-12 call
Comments below in BLUE
Steve Hanson <smh at uk.ibm.com>
Sent by: dfdl-wg-bounces at ogf.org
09/19/2007 06:04 AM
To
dfdl-wg at ogf.org
cc
Subject
Re: [DFDL-WG] Notes from 2007-09-12 call
Some thoughts since last week's call:
1) Expression language
We've not thought much about how expressions will work on output. It's
fine to say something like dfdl:length="..\count+1" when parsing, but what
happens on output. I think we should not try to reverse engineer
expressions, and rely on the user to set output fields correctly. So,
taking my example, on output we would assume count had been set by the
user, apply the expression to calculate the intended length of data, then
apply padding etc rules as needed. Can we generalise that philosophy
across all our uses of expressions? If we can't then perhaps that places a
bound on the actual uses of expressions that we permit.
Inverting will generally not be possible. Just make the example
dfdl:length="{ ../count * ../scale + 1 }" How do you split up the length
into count and scale?
<smh>Agree</smh>
In your example, I would expect the count field to have an
outputValueCalc="{ ../x.length() - 1 }" (I'm assuming the field with the
length calculation formula is named "x".)
<smh>Doesn't outputValueCalc mean that we are deriving the count from the
length of the data supplied for x? That forces the user to pad x to the
correct value, in order to derive count. Which is not how we want things
to work. We want count to define the length of x, so the DFDL serialiser
can pad x according to other DFDL properties. Maybe I'm missing something
about input/outputValueCalc?</smh>
<awp> There are multiple cases to consider. There are some formats that
require the length field to be the physical length of a structure, ie
including padding, code page considerations, etc that it is impossible for
the user to know. For example IMS transaction header has LLbbHeaderData.
DFDL should fill in these lengths.
I would assume that in most cases a field with it's length in other field
is variable length but again it may need to be the physical length.
I tend to agree with Mike that outvaluecalc can be used to set the length
field but how do we get to ignore the dfdl:length specification on the
data field?
I hope this doesn't mean that we need to distinguish between logical and
physical lengths in the expression language.
</awp>
In general, when something uses something else in it's calculation (length
or just the value - inputValueCalc), then the inverse is outputValueCalc
on the contributing parts.
2) dfdl:length for sequences
We have three cases here:
a) Empty sequence - we agreed to disallow this
b) Non-empty normal sequence - what does the length mean here?
It means the box is potentially larger than the contents. If it isn't at
least as big it's an error. If these lengths are data dependent it could
be a processing error. Otherwise a schema definition error.
Draft 025 discusses this in the part on sequences with length.
We also could disallow this case if we want for now, knowing we could add
it back if we want.
One can always convert this case into the one below by wrapping the
sequence's child elements in an array element, with array occurrences
determined by "fillAvailableSpace" policy. If we allow this at all, I
think this should be the way we explain the semantics of it. (Though with
the inserted array the paths would all change which is undesirable. - so
we would say it works like this, but without the paths being changed...)
c) Non-empty sequence used as box array - the motivating scenario
I think we should also disallow b). If we are disallowing a) on the
grounds of not using sequence with a length to model opaque data then we
should also disallow b).
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh at uk.ibm.com
Phone (+44)/(0) 1962-815848
Mike Beckerle <beckerle at us.ibm.com>
Sent by: dfdl-wg-bounces at ogf.org
12/09/2007 21:20
To
dfdl-wg at ogf.org
cc
Subject
[DFDL-WG] Notes from 2007-09-12 call
Mike Beckerle, Alan Powell, Steve Hanson, Suman Kalia attended.
Discussed these questions from Alan about expression language.
1. Accessing hidden values - it seems inconsistent to allow access to
hidden values when xpath is used within the DFDL domain but not when used
outside.
2. Where xpath is allowed in the schema - It is currently allowed in an
arbitrary set of properties (initiator, terminator, separator,
occurseparator, null, etc ). Why not allow it everywhere?
Wr.t. (1) we decided this is correct. path expressions for dfdl properties
can see hidden elements, path expressions in other places (e.g.,
schematron assertions) cannot.
Wr.t (2) we decided that expressions should be allowed in principle
everywhere for the value of any property; however, there may be exceptions
for certain properties. Particularly, it seems some enum-valued properties
are unlikely to ever want to be expressions. Example: dfdl:representation.
However, it was also pointed out that once we put selectors back into the
language you can interleave multiple formats in the same schema, and for
any enumerated property you could just have one selector-chosen format for
each possible value of the enumerated property.
The reason we don't want a blanket statement that you can have expressions
anywhere you need a property value is that there is some potential that
this makes implementations unnecessarily complex due to the excess
flexibility.
Digression: (This added by MikeB - was not part of the call today.)
Consider
dfdl:byteOrder=" if (../../x = 'B') then 'bigEndian' else if
(../../x='L') then 'littleEndian' else 'I don't know' }"
DFDL implementations must be prepared to cope with recieving "I don't
know" as the proposed value for the byteOrder. This is a schema definition
error, but it is happening at run time so becomes a processing error. The
only way to rule this out is to treat enumerated property values not as
strings but as an enum type and force the expressions that compute them to
return an enum type, not a string.
This is a kind of type inference I had hoped implementations would not
need.
Selectors have the advantage of being statically verifiable. i.e., each
selected format is known to use a value of the enum that is valid or a
diagnostic could be issued by the DFDL processor. If we allow an arbitrary
expression to return the value of an enumerated property then it
presumably could also return a nonsense value:
We discussed proposals circulated by MikeB:
Here's an update to the first one. We decided sequences shouldn't be
another way to carry opaque data. Easy and conservative way to fix this is
to require the length of an empty sequence to be zero.
Second proposal to eliminate hexBinary and base64Binary was discussed
lightly. It was suggested that one could have both, and that would make it
easy to explain what the hexBinary type is, because it is a shorthand for
a string with encoding="hex", and similarly for base64Binary. We did not
resolve this issue on the call.
Finally, we discussed regular expression features for DFDL.
There does appear to be need for regexp features to support parsing data
which is delimited by changing data content. E.g. consider "12345Mike
Beckerle". and a two-element sequence. One is a number which continues
until the first non-digit character. The other is a string which begins
with a non-digit character. Regexp length appears to be a good way to
handle this kind of thing.
Alan Powell has the action item to talk with the IBM internal TX product
group. They have a speculative parser and so have fewer regular-expression
features in their language. We want to understand how they deal with the
header, body[], trailer use case. This case is where the data is lines of
text, the header is the first line, the trailer is the last line, the body
records are everything in between and there's no content that can be used
to distinguish the record types. This is handled in some
format-description systems with regexp features. In TX this is handled by
speculative parsing and we want to understand how this comes out and if it
is preferable to adding regexp features.
Mike Beckerle
STSM, Architect, Scalable Computing
IBM Software Group
Information Platform and Solutions
Westborough, MA 01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan
priordan at us.ibm.com
508-599-7046
--
dfdl-wg mailing list
dfdl-wg at ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20071031/2a5f34b7/attachment-0001.html
More information about the dfdl-wg
mailing list