[DFDL-WG] possible issue - can't use pattern to validate DFDL content containing newlines

Mike Beckerle mbeckerle.dfdl at gmail.com
Wed Mar 3 20:43:54 EST 2021


Nevermind. I figured this out. Just slow today I guess.

You can't use 
, but you can use \n in the regex. And similarly \r \t,
etc.

The pattern in question is just:

<xs:pattern value="[A-Za-z0-9 \n\r]*"/>

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>



On Wed, Mar 3, 2021 at 6:32 PM Mike Beckerle <mbeckerle.dfdl at gmail.com>
wrote:

> Since data contains any characters, the DFDL infoset allows any characters.
>
> However, XML does not allow any characters.
>
> Furthermore, XML Schema Pattern facets are expressed using this XML Schema
> fragment:
>
> <xs:pattern value="...some regex pattern here ..."/>
>
> But XML attributes are normalized by XML readers/parsers. Line endings in
> them are converted to single spaces.
>
> So
>
> <xs:pattern value="abc
> def"/>
>
> is equivalent to:
>
> <xs:pattern value="abc def"/>
>
> Furthermore
>
> <xs:pattern value="abc&#xA;def"/>
>
> is also normalized to
>
> <xs:pattern value="abc def"/>
>
> As far as I can tell there is no alternate notation to this.
>
> This means, if you want to use a pattern facet to specify that a DFDL
> infoset string can contain A-Za-z0-9 spaces and line endings, there is no
> way to express this.
>
> This pattern was the example I was dealing with.
>
> <xs:pattern value="[A-Za-z0-9 &#xD;&#xA;]*"/>
>
> If you look at the string for the value attribute of this pattern element,
> that string already has the line ending characters converted into spaces.
> The attribute value is
> "[A-Za-z0-9   ]*" which has 3 spaces before the "]".
>
> I think there is no workaround for this in XML, XSD, or DFDL.
>
> I dug into the Daffodil implementation and in the code that accesses this
> attribute, you don't even get a NodeSeq containing a mixture of Text and
> Entity nodes. You just get a single Text node. So it is pretty well
> hopeless without reaching under the XML parser/reader's guts.
>
> Hence, in DFDL if you want to "validate" that a DFDL string contains
> content that includes line-endings with a regex, you have to use
> dfdl:assert with failureType="recoverableError" testKind="pattern" and
> testPattern with the regex of interest. This is then a DFDL regex, which is
> a Java regex, and you can be explicit about line endings allowed.
>
> You can't do it with a pattern facet.
>
> Comments?
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
> www.owlcyberdefense.com
> Please note: Contributions to the DFDL Workgroup's email discussions are
> subject to the OGF Intellectual Property Policy
> <http://www.ogf.org/About/abt_policies.php>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20210303/54bf1974/attachment.html>


More information about the dfdl-wg mailing list