[DFDL-WG] BLOB - binary large object proposal - updated

Lawrence, Stephen slawrence at tresys.com
Mon Aug 12 07:16:17 EDT 2019


dfdl:largeObjectKind definitely keeps things simple, but does lose
flexibility (e.g. maxLength). But that may not be needed. I'm in favor
of this.

However, one drawback of using xs:string instead of xs:anyURI relates to
our TDML test rig. If we make the type of blobs/clobs an xs:string and
have it be an opaque identifier then it makes it difficult for our TDML
runner to know how to compare actual vs expected blobs. For example:

  <data xsi:type="xs:string">some unique identifier</data>

In this case, the TDML test rig must be schema aware to know that data
is not a string and is actually an opaque identifier. And it must know
how to use that unique identifier to lookup the bytes to do
expected/actual comparisons.

By making the type an xs:anyURI and requiring that the identifier is a
URI, a TDML runner does not need any knowledge of the schema. Since the
xsi:type is an anyURI, it can infer that this must be a blob/clob, and
then it can open the URI to determine the bytes and easly compare
expected vs actual blobs.

And this applies to anyone accessing the infoset as well--not just our
TDML runner. Using a type of xs:anyURI provides a hint to infoset users
that an element shouldn't be treated like a string, but as a blob handle.

- Steve


On 8/9/19 10:25 AM, Mike Beckerle wrote:
> 
> My suggestions based on this thread are:
> 
> I think the dfdlx:blob type is problematic, and we should avoid it in favor of a 
> xs:string with a dfdlx:largeObjectKind property.
> 
> I think this should not be a "Type" as in string or hexBinary, because hexBinary 
> is such a misleading term, suggesting textualization, etc. There is nothing 
> "hex" about a BLOB, ever.
> 
> I think dfdlx:largeObjectKind="bytes/chars/none" with none the "default" for 
> now, and "chars" as a future capability for character large objects if they 
> prove important.
> I could be convinced other enums are better than bytes or chars for this. Eg., 
> BLOB, CLOB might be better. Or perhaps this is 
> dfdl:largeObjectRep="binary/text/none" analogous to the dfdl:representation 
> property?
> 
> The use of xs:anyURI is unnecessary, and is not a type we have in DFDL as yet. 
> People should treat this string as opaque. The fact that it is potentially a 
> meaningful URI is not relevant, and can be an implementation detail.
> 
> I think dfdl:largeObjectDirectory="{ $dfdlx:largeObjectDirectory }" is a nice 
> idea to save for the future. We may find that numerous other parameters are 
> required, so I'd prefer not to predefine this one in advance of clearer 
> direction or whether there are others.
> 
> The other thing observed on yesterday's DFDL WG call, was that this has some 
> overlap with the offset/pointer stuff. Unparsing from a blob file is an awful 
> lot like data-source indirection where the source of unparsing is coming from a 
> scattered data structure that is being gathered. There is some conceptual 
> similarity anyway. Not sure how deep this goes or if it is just a superficial 
> observation. And I would not suggest waiting for that to be figured out before 
> proceeding with this experimental BLOB feature.
> 
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com 
> <http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are subject 
> to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php>
> 
> 
> 
> On Thu, Aug 8, 2019 at 8:40 AM Lawrence, Stephen <slawrence at tresys.com 
> <mailto:slawrence at tresys.com>> wrote:
> 
>     The intention was that this new type would be an internal built-in type
>     and so no extra properties could be placed on the new simple type. One
>     drawback that I'm realizing as I implement this feature in Daffodil, is
>     that in order to use non DFDL aware XML Validation tools to validate the
>     XML infoset, you need to provide and xs:import this new DFDL schema that
>     defines the dfdlx:blob type, which feels a little awkward to me for
>     something that's considered a built-in for DFDL processors.
> 
>     Maybe an alternative would be to not have a dfdlx:blob type, allow the
>     use of the xs:anyURI type for simple elements, with the implication that
>     we treat the element as if it were xs:hexBinary except for the
>     infoset/blob output. This doesn't easily support CLOB's, but a new DFDL
>     property could determine how an xs:anyURI should be interpreted, e.g.:
> 
>        <xs:element name="myBlobData" type="xs:anyURI"
>          dfdl:largeObjectType="xs:hexBinary" ... />
> 
>        <xs:element name="myClobData" type="xs:anyURI"
>          dfdl:largeOjectType="xs:string" ... />
> 
>     So a type of xs:anyURI implies this is going to be some kind of large
>     object representation, and it requires the dfdl:largeOjectType property
>     that must reference a simple type that defines how the content should be
>     turned into an large object. This might also help to support
>     restrictions on the blob data, as well as implicit lengths, e.g.:
> 
>        <xs:simpleType name="blob10">
>          <xs:restriction base="xs:hexBinary">
>            <xs:maxLength value="20" /
>          </xs:restriction>
>        </xs:simpleType>
> 
>        <xs:element name="data" type="xs:anyURI" dfdl:objectType="blob10"
>     dfdl:lengthKind="implict" />
> 
>     DFDL properties could be placed on either the element or the objectType
>     simpleType, with the base type of dfdl:largeObjectType determining which
>     properties are valid/interpreted, rather than the element type (which
>     must be anyURI).
> 
>     But maybe this all adds unnecessary complexity?
> 
> 
>     Regarding specifying the filename via a DFDL property rather than API,
>     we have a use cases where each parse would need to output to a different
>     directory so a property might cause problems with this. But perhaps this
>     could be handled by a variable, e.g.:
> 
>        <xs:element name="data" type="dfdlx:blob"
>          dfdl:blobDirectory="{ $blobDir }" ... />
> 
>     That said, we had additional use cases where a DFDL blobDirectory
>     property would be too restrictive. For example, maybe the blobs should
>     be put into a database, or pushed to a data store in the cloud, stored
>     in local memory, or not stored anywhere at all but with a special URI
>     with offset+length to the original data. We chose to ignore these
>     use-cases for simplicity, but these different options would probably
>     require a flexible API to support. By going with an API to specify the
>     output directory, it makes it a bit easier to support these different
>     blob outputs in the future if it was needed.
> 
> 
>     On 8/8/19 5:09 AM, Steve Hanson wrote:
>      > Mike
>      >
>      > Am I allowed to put DFDL properties on the new simple type, or is the new
>     type
>      > considered to be a built-in type?  I think the latter is clearer and
>     simpler to
>      > implement.  Support for 'clob' would then just add a new simple type
>     restriction
>      > 'dfdlx:clob'.
>      >
>      > Assuming that the feature makes it into a future DFDL 2.0, the schema
>     containing
>      > the 'blob' simple type would then be in the standard DFDL namespace.
>     That's the
>      > first example of such a schema, as this is the first time we are
>     extending base
>      > XML Schema as opposed to defining annotations. If the new type is
>     considered a
>      > built-in type, then this schema should be part of the DFDL 2.0 standard and
>      > read-only.
>      >
>      > Any thoughts on allowing the specification of the filename via DFDL property
>      > rather than API call?
>      >
>      > Presumably I could create a local restriction of 'dfdlx:blob'? One
>     motivation
>      > for so doing would be to validate the length or content of my binary data.
>      > There's a problem with that though - validation works against the
>     infoset, so
>      > the allowable facets are those applicable to xs:anyUri and would be
>     applied to
>      > the file name, not the binary data. It also means that dfdl:lengthKind
>      > 'implicit' can't be used.  I don't see a way round this.
>      >
>      > Regards
>      >
>      > Steve Hanson
>      >
>      > IBM Hybrid Integration, Hursley, UK
>      > Architect, _IBM DFDL_
>     <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
>      > Co-Chair, _OGF DFDL Working Group_ <http://www.ogf.org/dfdl/>_
>      > __smh at uk.ibm.com_ <mailto:smh at uk.ibm.com <mailto:smh at uk.ibm.com>>
>      > tel:+44-1962-815848
>      > mob:+44-7717-378890
>      > Note: I work Tuesday to Friday
>      >
>      >
>      >
>      > From: Mike Beckerle <mbeckerle.dfdl at gmail.com
>     <mailto:mbeckerle.dfdl at gmail.com>>
>      > To: DFDL-WG <dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>>
>      > Date: 12/07/2019 18:14
>      > Subject: [DFDL-WG] BLOB - binary large object proposal - updated
>      > Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org <mailto:dfdl-wg-bounces at ogf.org>>
>      >
>      >
>     --------------------------------------------------------------------------------
>      >
>      >
>      >
>      > This concept, ,which has been discussed before, is in high demand in the
>      > Daffodil user community to enable DFDL to be used to parse image file
>     formats.
>      > The use case is to provide uniform image-metadata access without getting
>     bogged
>      > down in the large byte-array that makes up most of the file and would be
>     very
>      > large (and pointless) if rendered into XML or JSON.
>      >
>      > So our proposal, (which will get turned into an official Experimental
>     feature
>      > document), has been simplified and revised and is described here:
>      >
>      >
>     _https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Binary+Large+Objects_
> 
>      >
>      >
>      > Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
>      > _www.tresys.com_ <http://www.tresys.com>
>      > Please note: Contributions to the DFDL Workgroup's email discussions are
>     subject
>      > to the _OGF Intellectual Property Policy_
>      > <http://www.ogf.org/About/abt_policies.php>
>      > --
>      >   dfdl-wg mailing list
>      > dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
>      > https://www.ogf.org/mailman/listinfo/dfdl-wg
>      >
>      > Unless stated otherwise above:
>      > IBM United Kingdom Limited - Registered in England and Wales with number
>     741598.
>      > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>      >
>      >
>      > --
>      >   dfdl-wg mailing list
>      > dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
>      > https://www.ogf.org/mailman/listinfo/dfdl-wg
>      >
> 
>     --
>        dfdl-wg mailing list
>     dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
>     https://www.ogf.org/mailman/listinfo/dfdl-wg
> 



More information about the dfdl-wg mailing list