[DFDL-WG] possible issue - can't use pattern to validate DFDL content containing newlines

Mike Beckerle mbeckerle.dfdl at gmail.com
Wed Mar 3 18:32:16 EST 2021


Since data contains any characters, the DFDL infoset allows any characters.

However, XML does not allow any characters.

Furthermore, XML Schema Pattern facets are expressed using this XML Schema
fragment:

<xs:pattern value="...some regex pattern here ..."/>

But XML attributes are normalized by XML readers/parsers. Line endings in
them are converted to single spaces.

So

<xs:pattern value="abc
def"/>

is equivalent to:

<xs:pattern value="abc def"/>

Furthermore

<xs:pattern value="abc&#xA;def"/>

is also normalized to

<xs:pattern value="abc def"/>

As far as I can tell there is no alternate notation to this.

This means, if you want to use a pattern facet to specify that a DFDL
infoset string can contain A-Za-z0-9 spaces and line endings, there is no
way to express this.

This pattern was the example I was dealing with.

<xs:pattern value="[A-Za-z0-9 &#xD;&#xA;]*"/>

If you look at the string for the value attribute of this pattern element,
that string already has the line ending characters converted into spaces.
The attribute value is
"[A-Za-z0-9   ]*" which has 3 spaces before the "]".

I think there is no workaround for this in XML, XSD, or DFDL.

I dug into the Daffodil implementation and in the code that accesses this
attribute, you don't even get a NodeSeq containing a mixture of Text and
Entity nodes. You just get a single Text node. So it is pretty well
hopeless without reaching under the XML parser/reader's guts.

Hence, in DFDL if you want to "validate" that a DFDL string contains
content that includes line-endings with a regex, you have to use
dfdl:assert with failureType="recoverableError" testKind="pattern" and
testPattern with the regex of interest. This is then a DFDL regex, which is
a Java regex, and you can be explicit about line endings allowed.

You can't do it with a pattern facet.

Comments?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20210303/72e10af9/attachment.html>


More information about the dfdl-wg mailing list