[DFDL-WG] BLOB - binary large object proposal - updated

Mike Beckerle mbeckerle.dfdl at gmail.com
Fri Aug 9 10:25:13 EDT 2019


My suggestions based on this thread are:

I think the dfdlx:blob type is problematic, and we should avoid it in favor
of a xs:string with a dfdlx:largeObjectKind property.

I think this should not be a "Type" as in string or hexBinary, because
hexBinary is such a misleading term, suggesting textualization, etc. There
is nothing "hex" about a BLOB, ever.

I think dfdlx:largeObjectKind="bytes/chars/none" with none the "default"
for now, and "chars" as a future capability for character large objects if
they prove important.
I could be convinced other enums are better than bytes or chars for this.
Eg., BLOB, CLOB might be better. Or perhaps this is
dfdl:largeObjectRep="binary/text/none" analogous to the dfdl:representation
property?

The use of xs:anyURI is unnecessary, and is not a type we have in DFDL as
yet. People should treat this string as opaque. The fact that it is
potentially a meaningful URI is not relevant, and can be an implementation
detail.

I think dfdl:largeObjectDirectory="{ $dfdlx:largeObjectDirectory }" is a
nice idea to save for the future. We may find that numerous other
parameters are required, so I'd prefer not to predefine this one in advance
of clearer direction or whether there are others.

The other thing observed on yesterday's DFDL WG call, was that this has
some overlap with the offset/pointer stuff. Unparsing from a blob file is
an awful lot like data-source indirection where the source of unparsing is
coming from a scattered data structure that is being gathered. There is
some conceptual similarity anyway. Not sure how deep this goes or if it is
just a superficial observation. And I would not suggest waiting for that to
be figured out before proceeding with this experimental BLOB feature.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>



On Thu, Aug 8, 2019 at 8:40 AM Lawrence, Stephen <slawrence at tresys.com>
wrote:

> The intention was that this new type would be an internal built-in type
> and so no extra properties could be placed on the new simple type. One
> drawback that I'm realizing as I implement this feature in Daffodil, is
> that in order to use non DFDL aware XML Validation tools to validate the
> XML infoset, you need to provide and xs:import this new DFDL schema that
> defines the dfdlx:blob type, which feels a little awkward to me for
> something that's considered a built-in for DFDL processors.
>
> Maybe an alternative would be to not have a dfdlx:blob type, allow the
> use of the xs:anyURI type for simple elements, with the implication that
> we treat the element as if it were xs:hexBinary except for the
> infoset/blob output. This doesn't easily support CLOB's, but a new DFDL
> property could determine how an xs:anyURI should be interpreted, e.g.:
>
>   <xs:element name="myBlobData" type="xs:anyURI"
>     dfdl:largeObjectType="xs:hexBinary" ... />
>
>   <xs:element name="myClobData" type="xs:anyURI"
>     dfdl:largeOjectType="xs:string" ... />
>
> So a type of xs:anyURI implies this is going to be some kind of large
> object representation, and it requires the dfdl:largeOjectType property
> that must reference a simple type that defines how the content should be
> turned into an large object. This might also help to support
> restrictions on the blob data, as well as implicit lengths, e.g.:
>
>   <xs:simpleType name="blob10">
>     <xs:restriction base="xs:hexBinary">
>       <xs:maxLength value="20" /
>     </xs:restriction>
>   </xs:simpleType>
>
>   <xs:element name="data" type="xs:anyURI" dfdl:objectType="blob10"
> dfdl:lengthKind="implict" />
>
> DFDL properties could be placed on either the element or the objectType
> simpleType, with the base type of dfdl:largeObjectType determining which
> properties are valid/interpreted, rather than the element type (which
> must be anyURI).
>
> But maybe this all adds unnecessary complexity?
>
>
> Regarding specifying the filename via a DFDL property rather than API,
> we have a use cases where each parse would need to output to a different
> directory so a property might cause problems with this. But perhaps this
> could be handled by a variable, e.g.:
>
>   <xs:element name="data" type="dfdlx:blob"
>     dfdl:blobDirectory="{ $blobDir }" ... />
>
> That said, we had additional use cases where a DFDL blobDirectory
> property would be too restrictive. For example, maybe the blobs should
> be put into a database, or pushed to a data store in the cloud, stored
> in local memory, or not stored anywhere at all but with a special URI
> with offset+length to the original data. We chose to ignore these
> use-cases for simplicity, but these different options would probably
> require a flexible API to support. By going with an API to specify the
> output directory, it makes it a bit easier to support these different
> blob outputs in the future if it was needed.
>
>
> On 8/8/19 5:09 AM, Steve Hanson wrote:
> > Mike
> >
> > Am I allowed to put DFDL properties on the new simple type, or is the
> new type
> > considered to be a built-in type?  I think the latter is clearer and
> simpler to
> > implement.  Support for 'clob' would then just add a new simple type
> restriction
> > 'dfdlx:clob'.
> >
> > Assuming that the feature makes it into a future DFDL 2.0, the schema
> containing
> > the 'blob' simple type would then be in the standard DFDL namespace.
> That's the
> > first example of such a schema, as this is the first time we are
> extending base
> > XML Schema as opposed to defining annotations. If the new type is
> considered a
> > built-in type, then this schema should be part of the DFDL 2.0 standard
> and
> > read-only.
> >
> > Any thoughts on allowing the specification of the filename via DFDL
> property
> > rather than API call?
> >
> > Presumably I could create a local restriction of 'dfdlx:blob'? One
> motivation
> > for so doing would be to validate the length or content of my binary
> data.
> > There's a problem with that though - validation works against the
> infoset, so
> > the allowable facets are those applicable to xs:anyUri and would be
> applied to
> > the file name, not the binary data. It also means that dfdl:lengthKind
> > 'implicit' can't be used.  I don't see a way round this.
> >
> > Regards
> >
> > Steve Hanson
> >
> > IBM Hybrid Integration, Hursley, UK
> > Architect, _IBM DFDL_ <
> http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> > Co-Chair, _OGF DFDL Working Group_ <http://www.ogf.org/dfdl/>_
> > __smh at uk.ibm.com_ <mailto:smh at uk.ibm.com>
> > tel:+44-1962-815848
> > mob:+44-7717-378890
> > Note: I work Tuesday to Friday
> >
> >
> >
> > From: Mike Beckerle <mbeckerle.dfdl at gmail.com>
> > To: DFDL-WG <dfdl-wg at ogf.org>
> > Date: 12/07/2019 18:14
> > Subject: [DFDL-WG] BLOB - binary large object proposal - updated
> > Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org>
> >
> >
> --------------------------------------------------------------------------------
> >
> >
> >
> > This concept, ,which has been discussed before, is in high demand in the
> > Daffodil user community to enable DFDL to be used to parse image file
> formats.
> > The use case is to provide uniform image-metadata access without getting
> bogged
> > down in the large byte-array that makes up most of the file and would be
> very
> > large (and pointless) if rendered into XML or JSON.
> >
> > So our proposal, (which will get turned into an official Experimental
> feature
> > document), has been simplified and revised and is described here:
> >
> > _
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Binary+Large+Objects_
> >
> >
> > Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> > _www.tresys.com_ <http://www.tresys.com>
> > Please note: Contributions to the DFDL Workgroup's email discussions are
> subject
> > to the _OGF Intellectual Property Policy_
> > <http://www.ogf.org/About/abt_policies.php>
> > --
> >   dfdl-wg mailing list
> >   dfdl-wg at ogf.org
> > https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
> > Unless stated otherwise above:
> > IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
> 3AU
> >
> >
> > --
> >   dfdl-wg mailing list
> >   dfdl-wg at ogf.org
> >   https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
>
> --
>   dfdl-wg mailing list
>   dfdl-wg at ogf.org
>   https://www.ogf.org/mailman/listinfo/dfdl-wg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20190809/84e6e20f/attachment.html>


More information about the dfdl-wg mailing list