[DFDL-WG] Agenda for OGF DFDL WG call 8 Septembeer 2010 15:00UK (10:00 ET)

Tue Sep 7 10:59:38 CDT 2010

1. Current Actions 

2. xs:minLength
The spec currently states

When an element declaration specifies a default value, and has type 
xs:string, then xs:minLength must be specified and must be 1 or greater. 
It is a schema definition error otherwise.  

The process for defaults and nils means this restriction is no longer 
needed.

3. Is UTF-16 a fixed width or variable width encoding
Appendix A: About UTF-16 and Unicode Character Codes above 0xFFFF
When we define UTF-16 to be a fixed-width double-byte wide character set 
we say that each UTF-16 codepoint is represented by 2 bytes. Notice the 
careful use of the term 'codepoint' here. Unicode/ISO10646 characters can 
have character codes as large as 0x10FFFF which requires 3 bytes to store 
(21 bits actually); however in UTF-16 characters with more than 2 bytes of 
code are encoded as two codepoints, called a surrogate pair; hence, UTF-16 
is fixed-width, 2 bytes per codepoint. It is not 2 bytes per Unicode 
character. UTF-16 is really a variable-width encoding, but the characters 
that require the surrogate-pair treatment are so infrequently used that 
UTF-16 is most often treated like a 16-bit fixed-width character set. It 
is the acknowledgement of the existence of surrogate pairs that leads to 
the ?codepoint? vs. ?character code? distinction.
UTF-32 is a fixed width encoding with a full 4-bytes per character code. 
It represents all of Unicode with the same width per character.
Hence, when we refer to lengths in character strings we will often refer 
to length in characters, but we qualify that it means 2-byte codepoints 
when the character set encoding is UTF-16. Hence, when the property 
lengthUnitKind is 'characters' and the charset is 'UTF-16', then the units 
are actually 16-bit codepoints, not Unicode characters. 

Current Actions:
No
Action 
066
Investigate format for defining test cases
25/11:IBM to see if it is possible to publish its test case format.
04/12: no update
...
17/02: IBM is willing in principle to publish the test case format and 
some of the test cases. May need some time to build a 'compliance suite'
24/03: No progress
03/03: Discussions have been taking place on the subset of tests that will 
be provided.
10/03: work is progressing
17/03: work is progressing
31/03: work is progressing
14/04: And XML test case format has been defined and is being tested.
21/04. Schema for TDML defined. Need to define how this and the test cases 
will be made public
05/05: Work still progressing
12/05: Work still progressing
02/06: Work still progressing on technical and legal considerations
...
25/08: Will chase to allow Daffodil access to test cases.   The WG should 
define how implementation confirm that they 'conform to DFDL v1'
01/09: IBM still progressing the legal aspect. Intends to publish 100 or 
so tests as soon as it can, ahead of a full compliance suite.
085
ALL: publicise Public comments phase to ensure a good review..
14/04: see minutes
21/04: Press release, OMG and other standards bodies.
05/05: Alan and Steve H have contacted other standards bodies. Will ask 
them to add comments on spec
15/05: still no public comments
02/06: No public comments
16/06: Public comments period has ended with no external comments. Alan 
had posted changes made in draft 041. Steve suggested send a note to the 
WG highlighting these changes.  Steve also suggested requesting an 
extension as other IBM groups may review. We discussed whether this was 
necessary as changes will need to be made during the implementation phase 
anyway. Alan to ask OGF what the process is for changes post public 
comment.
23/06: Still no comments. Alan will contact OGF to understand the rest of 
the process.
30/06: Alan has emailed Joel asking what the process is now public comment 
period is over and can we update the published version with WG updates. No 
response yet.
07/07: No response. Alan will chase up
14/07: No response from Joel. Sent email to Greg Newby by no response.
21/07: Still no response.
04/08: Joel has responded that it is up to the WG to decide if the changes 
are significant enough to need additional review. Alan to contact David 
Martin and Erwin Laure for guidance if we split the specification.
11/08: Received a  response from Joel that the WG can decide if a re- 
public review is necessary before becoming a 'proposed recommendation'. 
Alan responded that the WG agreed that a re-review was not necessary. The 
next stage is for  OGF review committee to approve publication.
11/08: Specification is now 'awaiting author changes' before being 
submitted to the OGF technical committee for approval as a 'proposed 
specification'.
Alan would like to have the updated specification complete by Sept 10th. 
The WG needs to complete all actions by then or decide that they do not 
need to be included in this phase of the process.
01/09: Alan and Steve have discussed and propose Sept 30th for completion 
of draft 43 and closure of all actions.
099
Splitting the specification in simpler sections.
07/07: Steve sent a proposal but not discussed. Alan will arrange a 
separate call.
14/07:Discussed Steve's proposal and Suman's and Alan's comments.
Need to add choice, validation, facets.
Also how does an implementation declare which subsets it supports. 
Suggested levels and/or profiles. Steve highlighted a problem when a DFDL 
schema from an implementation of just the core functions was moved to a 
full DFDL implementation what should happen about the missing properties. 
Does the full implementation need to be aware of subsets of functions? 
Should it raise a schema definition error for use of a function not in the 
subset. 
21/07: no progress
04/08: Steve had updated proposed groups of function. 
(Subset_proposal_v2.ppt). We discussed whether its is better to have 
discrete sets of functions or expanding levels of function. 
Purpose of subsetting is:
1. Allow simpler implementations.  (main purpose)
2. Simplify tooling
3. Simplify specification. 
Steve to contact previous members of WG to check if we have the correct 
subsets
11/08: Steve sent an email to previous members of the WG asking for 
opinions on splitting the specification. Bob McGrath from National Center 
For Supercomputing responded that they had implemented about 80% of the 
function. Alejandro will send a description of the function they have 
implemented.
Action will be raised to track the Daffodil implementation
11/08: not discussed
01/09: NCSA implementation description received. Making the unparser 
optional is a good idea (NCSA do not need one) . Work will progress on the 
subsets. 
101
Semantics of 'fixed' 
21/07: Discussed whether not matching the 'fixed' value should be a 
validation error or processing error. Decided that for consistency it 
should be a validation error.
It would be useful however to avoid having to duplication of facet 
information in an assert which could become unwieldy for, say, a large 
enumeration.
Suggestions
- a parser option that 'converted all validation errors to processing 
errors'
- a dfdl expression function that  'applied all facets' or 'applied 
specific facet' to a particular element.
Stephanie will produce some examples of how this could be used..
04/08: Stephanie had produced examples but they were not discussed due to 
lack of time
11/08: We started to discuss Stephanie's HIPPA example but ran out of 
time.
25/08: Not discussed
01/09: Discuss next week 
107
teston/testoff dfdl expression functions.
Are these functions still needed. They were introduced to allow individual 
bits to be set in a byte. Steve to look at TLog and ISO 8583 formats that 
use existence flags to see if they are still required.
04/08: Not discussed
11/08: Not discussed
25/08: Not discussed 
01/09: Steve to progress by Sept 30th
108
dfdl:hidden 
There has been some discussion on whether the 'hidden' global group should 
be indicated in some way.
04/08: A lively discussion. The specification is works as currently 
defined so whether changes need to be made to make tooling easier. There 
shouldn't be 'conventions' in particular tooling as they must be able to 
properly deal with schema from other tools that would not obey those 
conventions. Steve stated that it is often dangerous to hide too much from 
users when they can see they underlying schema. To be continued.
25/08: there has been some offline discussions about simplifying how 
hidden elements are implemented. The proposal is 
dfdl:hidden property on xs:element only
xs:minOccurs and xs:maxOccurs MUST be 0 when hidden
dfdl:minOccurs and dfdl:maxOccurs  for hidden elements only.
An element is 'required' when dfdl:minOccurs >0  and normal default 
processing occurs.
The schema, without dfdl annotations,  must match the infoset so 
assumption is that non-DFDL tools, such as mappers, will ignore/not show 
elements with xs:minOccurs and xs:maxOccurs = '0'
01/09: The above proposal is flawed due to use of maxOccurs = 0 (this was 
identified back in 2008 hence current spec). 
Bob confirmed that NCSA models use hidden in a big way, so punting hidden 
beyond 1.0 is not an option. 
Two candidates:
- As per spec but with syntactic improvements to make it clear that the 
two xs:sequences do not take any dfdl:sequence properties
- Place a flag directly on a local element and force minOccurs to be 0. 
Simpler syntax but the semantic changes, as the element *could* be legally 
in the infoset, although a DFDL parser would never put it there.
Steve will circulate the two proposals for next week. 
Bob to talk to Alejandro as the NCSA implementation is currently more 
flexible than the spec, allowing the groupref to point to a choice, and an 
elementref. Are these really needed?
111
Daffodil DFDL parser
11/08: Bob and Alejandro described the new implementation that they have 
developed. It is a new code base and is not based on the Deffudle 
prototype. It is written in scala and implements approximately 80% of the 
features in the public comments draft of DFDL V1. Alejandro will send a 
list of the features not implemented.
We discussed the scenarios that motivated the development which was to 
extract data from various sources and transform into canonical formats.
Bob offered to make Daffodil available for the WG to assess the 
functionality. IBM WG members will get approval the company  to allow them 
to receive Daffodil.
Bob raised the question that if Daffodil becomes the public implementation 
of DFDL then we will need to work out how that would be funded and 
managed.
It would be helpful if IBM test cases were available to Daffodil. IBM will 
investigate
25/08: Alejandro had sent a list of the functions that he has implemented 
and Steve ahd responding indicating the extra functions he thought were 
essential.
Since then Alejandro has implemented some of the missing functions, such 
as escape schemes, pre-defined variables, binary decimal numbers, etc, and 
will update his list.
Bob is planning to make the parser available on the internet to allow 
testing.
His organisation is being reorganised and he doesn't know what the 
priority of  Daffodill will be so it is essential that we move quickly. It 
would help if IBM could indicate its support for Daffodil in some 
semi-formal way.
01/09: Alejandro updating Daffodil to include escape schemes, unordered 
sequences and ignoreCase.
Daffodil being placed under formal source control in anticipation of 
external release.
Bob has a start October deadline to create a report on what has been done 
for his sponsors.
It would be great if we could get Daffodil on the web and have run some 
IBM tests so it could be highlighted at OGF 30 at end October.
112
DFDL certification process
25/08: Discussed how to certify DFDL implementations. Alan to investigate 
if OGF have a defined process.
01/09: In progress, spec needs to state what conformance means, as part of 
this work
113
2. Regular Expressions. 
25/08: The DFDL regular expressions should provide lookahead and 
backreferences. Is the current regular expression language sufficient? 
Discussed two aspects: 
a. Is the XML regular expression language the correct one to use. Tim 
asked if DFDL needs to specify an language at all and should leave it to 
implementers to pick one. That would inhibit portability of schema. 
b. A regular expression property on an assert/discriminator as an 
alternative to the test expression. Either a DFDL expression or a regular 
expression could be specified but not both.
01/09: There are many variations of regexp language, it seems wise to 
specify one that we know contains functions like lookaround, which makes 
it easy to say things like 'give me everything up to but not including x'. 
  This rules out XML Schema and POSIX, it needs Perl 5 or Java. 
Tim to convince Steve (via example) that use of regexp in asserts is 
needed in 1.0.
114
3. OGF 30 
25/08: OGF30 takes place on October 25-29 in Brussels.  Should we have a 
WG session?
09/01: Given emergence of NCSA implementation and spec completion target 
of 30th Sept it makes sense to host a session at OGF 30. 

Regards

Alan Powell

Development - MQSeries, Message Broker, ESB
IBM Software Group, Application and Integration Middleware Software
-------------------------------------------------------------------------------------------------------------------------------------------
IBM
MP211, Hursley Park
Hursley, SO21 2JN
United Kingdom
Phone: +44-1962-815073
e-mail: alan_powell at uk.ibm.com

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ogf.org/pipermail/dfdl-wg/attachments/20100907/b056b523/attachment-0001.html