[DFDL-WG] BLOB - binary large object proposal - updated
Lawrence, Stephen
slawrence at tresys.com
Mon Aug 12 07:16:17 EDT 2019
dfdl:largeObjectKind definitely keeps things simple, but does lose
flexibility (e.g. maxLength). But that may not be needed. I'm in favor
of this.
However, one drawback of using xs:string instead of xs:anyURI relates to
our TDML test rig. If we make the type of blobs/clobs an xs:string and
have it be an opaque identifier then it makes it difficult for our TDML
runner to know how to compare actual vs expected blobs. For example:
<data xsi:type="xs:string">some unique identifier</data>
In this case, the TDML test rig must be schema aware to know that data
is not a string and is actually an opaque identifier. And it must know
how to use that unique identifier to lookup the bytes to do
expected/actual comparisons.
By making the type an xs:anyURI and requiring that the identifier is a
URI, a TDML runner does not need any knowledge of the schema. Since the
xsi:type is an anyURI, it can infer that this must be a blob/clob, and
then it can open the URI to determine the bytes and easly compare
expected vs actual blobs.
And this applies to anyone accessing the infoset as well--not just our
TDML runner. Using a type of xs:anyURI provides a hint to infoset users
that an element shouldn't be treated like a string, but as a blob handle.
- Steve
On 8/9/19 10:25 AM, Mike Beckerle wrote:
>
> My suggestions based on this thread are:
>
> I think the dfdlx:blob type is problematic, and we should avoid it in favor of a
> xs:string with a dfdlx:largeObjectKind property.
>
> I think this should not be a "Type" as in string or hexBinary, because hexBinary
> is such a misleading term, suggesting textualization, etc. There is nothing
> "hex" about a BLOB, ever.
>
> I think dfdlx:largeObjectKind="bytes/chars/none" with none the "default" for
> now, and "chars" as a future capability for character large objects if they
> prove important.
> I could be convinced other enums are better than bytes or chars for this. Eg.,
> BLOB, CLOB might be better. Or perhaps this is
> dfdl:largeObjectRep="binary/text/none" analogous to the dfdl:representation
> property?
>
> The use of xs:anyURI is unnecessary, and is not a type we have in DFDL as yet.
> People should treat this string as opaque. The fact that it is potentially a
> meaningful URI is not relevant, and can be an implementation detail.
>
> I think dfdl:largeObjectDirectory="{ $dfdlx:largeObjectDirectory }" is a nice
> idea to save for the future. We may find that numerous other parameters are
> required, so I'd prefer not to predefine this one in advance of clearer
> direction or whether there are others.
>
> The other thing observed on yesterday's DFDL WG call, was that this has some
> overlap with the offset/pointer stuff. Unparsing from a blob file is an awful
> lot like data-source indirection where the source of unparsing is coming from a
> scattered data structure that is being gathered. There is some conceptual
> similarity anyway. Not sure how deep this goes or if it is just a superficial
> observation. And I would not suggest waiting for that to be figured out before
> proceeding with this experimental BLOB feature.
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
> <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are subject
> to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php>
>
>
>
> On Thu, Aug 8, 2019 at 8:40 AM Lawrence, Stephen <slawrence at tresys.com
> <mailto:slawrence at tresys.com>> wrote:
>
> The intention was that this new type would be an internal built-in type
> and so no extra properties could be placed on the new simple type. One
> drawback that I'm realizing as I implement this feature in Daffodil, is
> that in order to use non DFDL aware XML Validation tools to validate the
> XML infoset, you need to provide and xs:import this new DFDL schema that
> defines the dfdlx:blob type, which feels a little awkward to me for
> something that's considered a built-in for DFDL processors.
>
> Maybe an alternative would be to not have a dfdlx:blob type, allow the
> use of the xs:anyURI type for simple elements, with the implication that
> we treat the element as if it were xs:hexBinary except for the
> infoset/blob output. This doesn't easily support CLOB's, but a new DFDL
> property could determine how an xs:anyURI should be interpreted, e.g.:
>
> <xs:element name="myBlobData" type="xs:anyURI"
> dfdl:largeObjectType="xs:hexBinary" ... />
>
> <xs:element name="myClobData" type="xs:anyURI"
> dfdl:largeOjectType="xs:string" ... />
>
> So a type of xs:anyURI implies this is going to be some kind of large
> object representation, and it requires the dfdl:largeOjectType property
> that must reference a simple type that defines how the content should be
> turned into an large object. This might also help to support
> restrictions on the blob data, as well as implicit lengths, e.g.:
>
> <xs:simpleType name="blob10">
> <xs:restriction base="xs:hexBinary">
> <xs:maxLength value="20" /
> </xs:restriction>
> </xs:simpleType>
>
> <xs:element name="data" type="xs:anyURI" dfdl:objectType="blob10"
> dfdl:lengthKind="implict" />
>
> DFDL properties could be placed on either the element or the objectType
> simpleType, with the base type of dfdl:largeObjectType determining which
> properties are valid/interpreted, rather than the element type (which
> must be anyURI).
>
> But maybe this all adds unnecessary complexity?
>
>
> Regarding specifying the filename via a DFDL property rather than API,
> we have a use cases where each parse would need to output to a different
> directory so a property might cause problems with this. But perhaps this
> could be handled by a variable, e.g.:
>
> <xs:element name="data" type="dfdlx:blob"
> dfdl:blobDirectory="{ $blobDir }" ... />
>
> That said, we had additional use cases where a DFDL blobDirectory
> property would be too restrictive. For example, maybe the blobs should
> be put into a database, or pushed to a data store in the cloud, stored
> in local memory, or not stored anywhere at all but with a special URI
> with offset+length to the original data. We chose to ignore these
> use-cases for simplicity, but these different options would probably
> require a flexible API to support. By going with an API to specify the
> output directory, it makes it a bit easier to support these different
> blob outputs in the future if it was needed.
>
>
> On 8/8/19 5:09 AM, Steve Hanson wrote:
> > Mike
> >
> > Am I allowed to put DFDL properties on the new simple type, or is the new
> type
> > considered to be a built-in type? I think the latter is clearer and
> simpler to
> > implement. Support for 'clob' would then just add a new simple type
> restriction
> > 'dfdlx:clob'.
> >
> > Assuming that the feature makes it into a future DFDL 2.0, the schema
> containing
> > the 'blob' simple type would then be in the standard DFDL namespace.
> That's the
> > first example of such a schema, as this is the first time we are
> extending base
> > XML Schema as opposed to defining annotations. If the new type is
> considered a
> > built-in type, then this schema should be part of the DFDL 2.0 standard and
> > read-only.
> >
> > Any thoughts on allowing the specification of the filename via DFDL property
> > rather than API call?
> >
> > Presumably I could create a local restriction of 'dfdlx:blob'? One
> motivation
> > for so doing would be to validate the length or content of my binary data.
> > There's a problem with that though - validation works against the
> infoset, so
> > the allowable facets are those applicable to xs:anyUri and would be
> applied to
> > the file name, not the binary data. It also means that dfdl:lengthKind
> > 'implicit' can't be used. I don't see a way round this.
> >
> > Regards
> >
> > Steve Hanson
> >
> > IBM Hybrid Integration, Hursley, UK
> > Architect, _IBM DFDL_
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> > Co-Chair, _OGF DFDL Working Group_ <http://www.ogf.org/dfdl/>_
> > __smh at uk.ibm.com_ <mailto:smh at uk.ibm.com <mailto:smh at uk.ibm.com>>
> > tel:+44-1962-815848
> > mob:+44-7717-378890
> > Note: I work Tuesday to Friday
> >
> >
> >
> > From: Mike Beckerle <mbeckerle.dfdl at gmail.com
> <mailto:mbeckerle.dfdl at gmail.com>>
> > To: DFDL-WG <dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>>
> > Date: 12/07/2019 18:14
> > Subject: [DFDL-WG] BLOB - binary large object proposal - updated
> > Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org <mailto:dfdl-wg-bounces at ogf.org>>
> >
> >
> --------------------------------------------------------------------------------
> >
> >
> >
> > This concept, ,which has been discussed before, is in high demand in the
> > Daffodil user community to enable DFDL to be used to parse image file
> formats.
> > The use case is to provide uniform image-metadata access without getting
> bogged
> > down in the large byte-array that makes up most of the file and would be
> very
> > large (and pointless) if rendered into XML or JSON.
> >
> > So our proposal, (which will get turned into an official Experimental
> feature
> > document), has been simplified and revised and is described here:
> >
> >
> _https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Binary+Large+Objects_
>
> >
> >
> > Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> > _www.tresys.com_ <http://www.tresys.com>
> > Please note: Contributions to the DFDL Workgroup's email discussions are
> subject
> > to the _OGF Intellectual Property Policy_
> > <http://www.ogf.org/About/abt_policies.php>
> > --
> > dfdl-wg mailing list
> > dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
> > https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
> > Unless stated otherwise above:
> > IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> >
> >
> > --
> > dfdl-wg mailing list
> > dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
> > https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
>
> --
> dfdl-wg mailing list
> dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
> https://www.ogf.org/mailman/listinfo/dfdl-wg
>
More information about the dfdl-wg
mailing list