[DFDL-WG] BLOB - binary large object proposal - updated

Steve Hanson smh at uk.ibm.com
Thu Aug 29 05:25:02 EDT 2019


I prefer use of xs:anyUri, it gives a clear indication that this a 
reference to the data and not the data itself.

I prefer dfdl:objectKind - the object is not necessarily large, the author 
might want a reference for other reasons. 

Regards
 
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh at uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 



From:   "Lawrence, Stephen" <slawrence at tresys.com>
To:     "mbeckerle.dfdl at gmail.com" <mbeckerle.dfdl at gmail.com>
Cc:     "dfdl-wg at ogf.org" <dfdl-wg at ogf.org>
Date:   12/08/2019 12:16
Subject:        Re: [DFDL-WG] BLOB - binary large object proposal - 
updated
Sent by:        "dfdl-wg" <dfdl-wg-bounces at ogf.org>



dfdl:largeObjectKind definitely keeps things simple, but does lose
flexibility (e.g. maxLength). But that may not be needed. I'm in favor
of this.

However, one drawback of using xs:string instead of xs:anyURI relates to
our TDML test rig. If we make the type of blobs/clobs an xs:string and
have it be an opaque identifier then it makes it difficult for our TDML
runner to know how to compare actual vs expected blobs. For example:

  <data xsi:type="xs:string">some unique identifier</data>

In this case, the TDML test rig must be schema aware to know that data
is not a string and is actually an opaque identifier. And it must know
how to use that unique identifier to lookup the bytes to do
expected/actual comparisons.

By making the type an xs:anyURI and requiring that the identifier is a
URI, a TDML runner does not need any knowledge of the schema. Since the
xsi:type is an anyURI, it can infer that this must be a blob/clob, and
then it can open the URI to determine the bytes and easly compare
expected vs actual blobs.

And this applies to anyone accessing the infoset as well--not just our
TDML runner. Using a type of xs:anyURI provides a hint to infoset users
that an element shouldn't be treated like a string, but as a blob handle.

- Steve


On 8/9/19 10:25 AM, Mike Beckerle wrote:
> 
> My suggestions based on this thread are:
> 
> I think the dfdlx:blob type is problematic, and we should avoid it in 
favor of a 
> xs:string with a dfdlx:largeObjectKind property.
> 
> I think this should not be a "Type" as in string or hexBinary, because 
hexBinary 
> is such a misleading term, suggesting textualization, etc. There is 
nothing 
> "hex" about a BLOB, ever.
> 
> I think dfdlx:largeObjectKind="bytes/chars/none" with none the "default" 
for 
> now, and "chars" as a future capability for character large objects if 
they 
> prove important.
> I could be convinced other enums are better than bytes or chars for 
this. Eg., 
> BLOB, CLOB might be better. Or perhaps this is 
> dfdl:largeObjectRep="binary/text/none" analogous to the 
dfdl:representation 
> property?
> 
> The use of xs:anyURI is unnecessary, and is not a type we have in DFDL 
as yet. 
> People should treat this string as opaque. The fact that it is 
potentially a 
> meaningful URI is not relevant, and can be an implementation detail.
> 
> I think dfdl:largeObjectDirectory="{ $dfdlx:largeObjectDirectory }" is a 
nice 
> idea to save for the future. We may find that numerous other parameters 
are 
> required, so I'd prefer not to predefine this one in advance of clearer 
> direction or whether there are others.
> 
> The other thing observed on yesterday's DFDL WG call, was that this has 
some 
> overlap with the offset/pointer stuff. Unparsing from a blob file is an 
awful 
> lot like data-source indirection where the source of unparsing is coming 
from a 
> scattered data structure that is being gathered. There is some 
conceptual 
> similarity anyway. Not sure how deep this goes or if it is just a 
superficial 
> observation. And I would not suggest waiting for that to be figured out 
before 
> proceeding with this experimental BLOB feature.
> 
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
> <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.tresys.com&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=yyn8_2c8iwgOiiXgq-ZPoPKMKJo7FKAgHWXNYR-PQ3w&e= 
>
> Please note: Contributions to the DFDL Workgroup's email discussions are 
subject 
> to the OGF Intellectual Property Policy <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ogf.org_About_abt-5Fpolicies.php&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=rDxL7k1L1pv9xIdXwaEB_8Pa9Twy8dwgsicarX3l6QQ&e= 
>
> 
> 
> 
> On Thu, Aug 8, 2019 at 8:40 AM Lawrence, Stephen <slawrence at tresys.com 
> <mailto:slawrence at tresys.com>> wrote:
> 
>     The intention was that this new type would be an internal built-in 
type
>     and so no extra properties could be placed on the new simple type. 
One
>     drawback that I'm realizing as I implement this feature in Daffodil, 
is
>     that in order to use non DFDL aware XML Validation tools to validate 
the
>     XML infoset, you need to provide and xs:import this new DFDL schema 
that
>     defines the dfdlx:blob type, which feels a little awkward to me for
>     something that's considered a built-in for DFDL processors.
> 
>     Maybe an alternative would be to not have a dfdlx:blob type, allow 
the
>     use of the xs:anyURI type for simple elements, with the implication 
that
>     we treat the element as if it were xs:hexBinary except for the
>     infoset/blob output. This doesn't easily support CLOB's, but a new 
DFDL
>     property could determine how an xs:anyURI should be interpreted, 
e.g.:
> 
>        <xs:element name="myBlobData" type="xs:anyURI"
>          dfdl:largeObjectType="xs:hexBinary" ... />
> 
>        <xs:element name="myClobData" type="xs:anyURI"
>          dfdl:largeOjectType="xs:string" ... />
> 
>     So a type of xs:anyURI implies this is going to be some kind of 
large
>     object representation, and it requires the dfdl:largeOjectType 
property
>     that must reference a simple type that defines how the content 
should be
>     turned into an large object. This might also help to support
>     restrictions on the blob data, as well as implicit lengths, e.g.:
> 
>        <xs:simpleType name="blob10">
>          <xs:restriction base="xs:hexBinary">
>            <xs:maxLength value="20" /
>          </xs:restriction>
>        </xs:simpleType>
> 
>        <xs:element name="data" type="xs:anyURI" dfdl:objectType="blob10"
>     dfdl:lengthKind="implict" />
> 
>     DFDL properties could be placed on either the element or the 
objectType
>     simpleType, with the base type of dfdl:largeObjectType determining 
which
>     properties are valid/interpreted, rather than the element type 
(which
>     must be anyURI).
> 
>     But maybe this all adds unnecessary complexity?
> 
> 
>     Regarding specifying the filename via a DFDL property rather than 
API,
>     we have a use cases where each parse would need to output to a 
different
>     directory so a property might cause problems with this. But perhaps 
this
>     could be handled by a variable, e.g.:
> 
>        <xs:element name="data" type="dfdlx:blob"
>          dfdl:blobDirectory="{ $blobDir }" ... />
> 
>     That said, we had additional use cases where a DFDL blobDirectory
>     property would be too restrictive. For example, maybe the blobs 
should
>     be put into a database, or pushed to a data store in the cloud, 
stored
>     in local memory, or not stored anywhere at all but with a special 
URI
>     with offset+length to the original data. We chose to ignore these
>     use-cases for simplicity, but these different options would probably
>     require a flexible API to support. By going with an API to specify 
the
>     output directory, it makes it a bit easier to support these 
different
>     blob outputs in the future if it was needed.
> 
> 
>     On 8/8/19 5:09 AM, Steve Hanson wrote:
>      > Mike
>      >
>      > Am I allowed to put DFDL properties on the new simple type, or is 
the new
>     type
>      > considered to be a built-in type?  I think the latter is clearer 
and
>     simpler to
>      > implement.  Support for 'clob' would then just add a new simple 
type
>     restriction
>      > 'dfdlx:clob'.
>      >
>      > Assuming that the feature makes it into a future DFDL 2.0, the 
schema
>     containing
>      > the 'blob' simple type would then be in the standard DFDL 
namespace.
>     That's the
>      > first example of such a schema, as this is the first time we are
>     extending base
>      > XML Schema as opposed to defining annotations. If the new type is
>     considered a
>      > built-in type, then this schema should be part of the DFDL 2.0 
standard and
>      > read-only.
>      >
>      > Any thoughts on allowing the specification of the filename via 
DFDL property
>      > rather than API call?
>      >
>      > Presumably I could create a local restriction of 'dfdlx:blob'? 
One
>     motivation
>      > for so doing would be to validate the length or content of my 
binary data.
>      > There's a problem with that though - validation works against the
>     infoset, so
>      > the allowable facets are those applicable to xs:anyUri and would 
be
>     applied to
>      > the file name, not the binary data. It also means that 
dfdl:lengthKind
>      > 'implicit' can't be used.  I don't see a way round this.
>      >
>      > Regards
>      >
>      > Steve Hanson
>      >
>      > IBM Hybrid Integration, Hursley, UK
>      > Architect, _IBM DFDL_
>     <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
>      > Co-Chair, _OGF DFDL Working Group_ <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ogf.org_dfdl_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=zV4mVO-_k8lmSDWJBXzaDhi1-H3ze1FRp6YLYn7FDHM&e= 
>_
>      > __smh at uk.ibm.com_ <mailto:smh at uk.ibm.com <mailto:smh at uk.ibm.com>>
>      > tel:+44-1962-815848
>      > mob:+44-7717-378890
>      > Note: I work Tuesday to Friday
>      >
>      >
>      >
>      > From: Mike Beckerle <mbeckerle.dfdl at gmail.com
>     <mailto:mbeckerle.dfdl at gmail.com>>
>      > To: DFDL-WG <dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>>
>      > Date: 12/07/2019 18:14
>      > Subject: [DFDL-WG] BLOB - binary large object proposal - updated
>      > Sent by: "dfdl-wg" <dfdl-wg-bounces at ogf.org <
mailto:dfdl-wg-bounces at ogf.org>>
>      >
>      >
> 
--------------------------------------------------------------------------------
>      >
>      >
>      >
>      > This concept, ,which has been discussed before, is in high demand 
in the
>      > Daffodil user community to enable DFDL to be used to parse image 
file
>     formats.
>      > The use case is to provide uniform image-metadata access without 
getting
>     bogged
>      > down in the large byte-array that makes up most of the file and 
would be
>     very
>      > large (and pointless) if rendered into XML or JSON.
>      >
>      > So our proposal, (which will get turned into an official 
Experimental
>     feature
>      > document), has been simplified and revised and is described here:
>      >
>      >
> 
_https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_DAFFODIL_Proposal-253A-2BBinary-2BLarge-2BObjects-5F&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=JxLz3sp40T1X-UzhjSiHRPmWRqwL3GVgkgzT2hwgiGM&e= 

> 
>      >
>      >
>      > Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
>      > _www.tresys.com_ <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.tresys.com&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=yyn8_2c8iwgOiiXgq-ZPoPKMKJo7FKAgHWXNYR-PQ3w&e= 
>
>      > Please note: Contributions to the DFDL Workgroup's email 
discussions are
>     subject
>      > to the _OGF Intellectual Property Policy_
>      > <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ogf.org_About_abt-5Fpolicies.php&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=rDxL7k1L1pv9xIdXwaEB_8Pa9Twy8dwgsicarX3l6QQ&e= 
>
>      > --
>      >   dfdl-wg mailing list
>      > dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
>      > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e= 

>      >
>      > Unless stated otherwise above:
>      > IBM United Kingdom Limited - Registered in England and Wales with 
number
>     741598.
>      > Registered office: PO Box 41, North Harbour, Portsmouth, 
Hampshire PO6 3AU
>      >
>      >
>      > --
>      >   dfdl-wg mailing list
>      > dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
>      > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e= 

>      >
> 
>     --
>        dfdl-wg mailing list
>     dfdl-wg at ogf.org <mailto:dfdl-wg at ogf.org>
>     
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e= 

> 

--
  dfdl-wg mailing list
  dfdl-wg at ogf.org
  
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e= 




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/dfdl-wg/attachments/20190829/15b71925/attachment-0001.html>


More information about the dfdl-wg mailing list