[occi-wg] Preliminary findings on query paging
Gary Mazz
garymazzaferro at gmail.com
Wed Nov 14 11:39:34 EST 2012
Hi
This is a preliminary assessment of paging for OCCI/JSON on HTTP/REST.
This is not a completed list of findings.
There was some debate and interest in categorizing the different methods
used to implement a query page mechanism. The I've examined several
specifications and standards sourced from SNIA, IETF, OMG, OGF, W3C ANSI
and others. The odata specification is not considered due to protected
proprietary technologies. Pertinent specifications an excepts are noted
below.
The preliminary findings discovered 9 methods for controlling the
delivery of page results from query operations.
1. HTTP Range/Content-Range header (HTTP 1.1)
2. OFFSET/LIMIT parameters (SPARQL)
3. RANGE predicate (Xquery/Xpath)
4. pagedResultsControl.size/Cookie (LDAP)
5. Links to pages (ATOM)
6. LIMIT<offset>, <rowcount> (SQL)
7. TOP <rowcount> (SQL)
8. WHERE ROWNUM eval Value (SQL)
9. OFFSET <value> (SQL)
We can distill this list down to four (4) basic approaches
1. Query parameters specifying result row Range (start_offset, end_offset)
2. Query parameters specifying result row offset and count
3. Query parameters specifying count, assuming first query and a opaque
handle for context
4. Links to pages
NOTE: We could force further consolidation of items one (1) and two (2)
on the list above. I elected to separate each item based on the
uniqueness of their semantics.
Some additional information:
There seems to be a growing trend in the html5 communities to adopt the
use of the http 1.1 RANGE header to provide query pagination. For
example, the dojo javascript library.
An open, patent free approach to query pagination for URI query
expression does not seems to exist.
If OOCI/JSON elects to adopt that approach, it will need clear
definition describing the behaviors surrounding insertions, deletes and
modification of OCCI information records. I have included below an
example document showing a the level detail we should consider for the
OCCI specification.
Since REST is a context free approach to API and there a no ordering and
sequence constraints defined for the delivery, modification and
insertion of OCCI information, there may be opportunities where
duplicate and missing OCCI information can occur. We should also
consider a well defined approach to reconcile inconsistencies in paged
operations.
I'll be following up with more information so we can develop
recommendations to fully resolve paging issues for URI, HTTP and OCCI
operations in payload data.
cheers,
gary mazzaferro
http://snia.org/sites/default/files/CDMI%20v1.0.2.pdf
Cloud Data Management Interface
(CDMI^(TM)) Version 1.0.2
5.13.3 Range Support
The server shall support HTTP Range headers and partial content
responses (see Section 14.16 of RFC
2616).
http://www.w3.org/TR/DOM-Level-2-Traversal-Range/traversal.html
Document Object Model (DOM) Level 2 Traversal and Range Specification
Version 1.0
W3C Recommendation 13 November, 2000
1. Document Object Model Traversal
1.1. Overview
1.1.1. NodeIterators
1.1.2. NodeFilters
1.1.3. TreeWalker
1.2. Formal Interface Definition NodeIterator, NodeFilter,
TreeWalker, DocumentTraversal
2. Document Object Model Range
2.1. Introduction
2.2. Definitions and Notation
2.2.1. Position
2.2.2. Selection and Partial Selection
2.2.3. Notation
2.3. Creating a Range
2.4. Changing a Range's Position
2.5. Comparing Range Boundary-Points
2.6. Deleting Content with a Range
2.7. Extracting Content
2.8. Cloning Content
2.9. Inserting Content
2.10. Surrounding Content
2.11. Miscellaneous Members
2.12. Range modification under document mutation
2.12.1. Insertions
2.12.2. Deletions
2.13. Formal Description of the Range Interface Range,
DocumentRange, RangeException, RangeExceptionCode
http://www.w3.org/TR/xquery/
XQuery 1.0: An XML Query Language (Second Edition)
W3C Recommendation 14 December 2010 (Link errors corrected 3 January
2011)
2.1 Path Expressions
XQuery path expressions use the abbreviated syntax of XPath,
extended with a new "dereference" operator and a new type of
predicate called a "range predicate" (described below). XPath syntax
is used in several XML-related applications including XSLT [XSLT]
and XPointer [XPointer].
In XQuery, the result of a path expression is an ordered list of
nodes (of course, each node includes its descendant nodes, so the
result can be thought of as an ordered forest.) The top-level nodes
in the path expression result are ordered according to their
position in the original hierarchy, in top-down, left-to-right
order. The result of a path expression may contain duplicate values
(i.e., multiple nodes with the same type and content), but it will
not contain duplicate nodes (i.e., multiple nodes with the same node
identity).
A path expression consists of a series of steps. Each step
represents movement through a document in a particular direction,
and each step can apply one or more predicates to eliminate nodes
that fail to satisfy a given condition. The result of each step is a
list of nodes that serves as a starting point for the next step.
A path expression can begin with an expression that identifies a
specific node, such as the function document(string), which returns
the root node of a named document. A query can also contain a path
expression beginning with "/" or "//" which represents an implicit
root node, determined by the environment in which the query is executed.
A complete discussion of XPath abbreviated syntax can be found in
[XPath]. Briefly, the following symbols are used:
. Denotes the current node.
.. Denotes the parent of the current node.
/ Denotes the root node, or a separator between steps in a path.
// Denotes descendants of the current node.
@ Denotes attributes of the current node.
* Denotes "any" (node with unrestricted name).
[ ] Brackets enclose a Boolean expression that serves as a
predicate for a given step.
[n] When a predicate consists of an integer, it serves to select
the element with the given ordinal number from a list of elements.
......
It is sometimes desirable to isolate a list of elements whose
ordinal numbers span a certain range. For this purpose, XQuery
provides a RANGE predicate that is adapted from XQL [XQL], as
illustrated in the following example. The first element in a list
has ordinal number 1. The ordinal numbers of elements in a list are
not affected by the presence of other types of nodes such as
comments or processing instructions.
(Q2) Find all the figures in chapters 2 through 5 of the document
named "zoo.xml."
document("zoo.xml")/chapter[RANGE 2 TO 5]//figure
In addition to the operators of the XPath abbreviated syntax, XQuery
introduces an operator called the dereference operator ("->"). When
a dereference operator follows an IDREF-type attribute, it returns
the element(s) that are referenced by the attribute. A dereference
operator is followed by a "name test" that specifies the target
element. Following the usual XPath convention, a name test of "*"
allows the target element to be of any type.
2.1.2 Dynamic Context
[Definition: The dynamic context of an expression is defined as
information that is available at the time the expression is
evaluated.] If evaluation of an expression relies on some part of
the dynamic context that has not been assigned a value, a dynamic
error is raised [err:XPDY0002].
The individual components of the dynamic context are summarized
below. Further rules governing the semantics of these components can
be found in C.2 Dynamic Context Components.
The dynamic context consists of all the components of the static
context, and the additional components listed below.
[Definition: The first three components of the dynamic context
(context item, context position, and context size) are called the
focus of the expression. ] The focus enables the processor to keep
track of which items are being processed by the expression.
Certain language constructs, notably the path expression E1/E2 and
the predicate E1[E2], create a new focus for the evaluation of a
sub-expression. In these constructs, E2 is evaluated once for each
item in the sequence that results from evaluating E1. Each time E2
is evaluated, it is evaluated with a different focus. The focus for
evaluating E2 is referred to below as the inner focus, while the
focus for evaluating E1 is referred to as the outer focus. The inner
focus exists only while E2 is being evaluated. When this evaluation
is complete, evaluation of the containing expression continues with
its original focus unchanged.
[Definition: The context item is the item currently being processed.
An item is either an atomic value or a node.][Definition: When the
context item is a node, it can also be referred to as the context
node.] The context item is returned by an expression consisting of a
single dot (.). When an expression E1/E2 or E1[E2] is evaluated,
each item in the sequence obtained by evaluating E1 becomes the
context item in the inner focus for an evaluation of E2.
[Definition: The context position is the position of the context
item within the sequence of items currently being processed.] It
changes whenever the context item changes. When the focus is
defined, the value of the context position is an integer greater
than zero. The context position is returned by the expression
fn:position(). When an expression E1/E2 or E1[E2] is evaluated, the
context position in the inner focus for an evaluation of E2 is the
position of the context item in the sequence obtained by evaluating
E1. The position of the first item in a sequence is always 1 (one).
The context position is always less than or equal to the context size.
[Definition: The context size is the number of items in the sequence
of items currently being processed.] Its value is always an integer
greater than zero. The context size is returned by the expression
fn:last(). When an expression E1/E2 or E1[E2] is evaluated, the
context size in the inner focus for an evaluation of E2 is the
number of items in the sequence obtained by evaluating E1.
Example: XPath="newsSection/news[position() < 6]
http://www.w3.org/TR/sparql11-query/#sparqlOffsetLimit
SPARQL 1.1 Query Language
W3C Proposed Recommendation 08 November 2012
18.2.5.5 OFFSET and LIMIT
If the query contains "OFFSET start" or "LIMIT length"
M := Slice(M, start, length)
start defaults to 0
length defaults to (size(M)-start).
http://www.ietf.org/rfc/rfc2696.txt
Network Working Group
Request for Comments: 2696
LDAP Control Extension for Simple Paged Results Manipulation
2. The Control
This control is included in the searchRequest and searchResultDone
messages as part of the controls field of the LDAPMessage, as defined
in Section 4.1.12 of [LDAPv3]. The structure of this control is as
follows:
pagedResultsControl ::= SEQUENCE {
controlType 1.2.840.113556.1.4.319,
criticality BOOLEAN DEFAULT FALSE,
controlValue searchControlValue
}
The searchControlValue is an OCTET STRING wrapping the BER-encoded
version of the following SEQUENCE:
realSearchControlValue ::= SEQUENCE {
size INTEGER (0..maxInt),
-- requested page size from client
-- result set size estimate from server
cookie OCTET STRING
}
3. Client-Server Interaction
An LDAP client application that needs to control the rate at which
results are returned MAY specify on the searchRequest a
pagedResultsControl with size set to the desired page size and cookie
set to the zero-length string. The page size specified MAY be greater
than zero and less than the sizeLimit value specified in the
searchRequest.
If the page size is greater than or equal to the sizeLimit value, the
server should ignore the control as the request can be satisfied in a
single page. If the server does not support this control, the server
MUST return an error of unsupportedCriticalExtension if the client
requested it as critical, otherwise the server SHOULD ignore the
control. The remainder of this section assumes the server does not
ignore the client's pagedResultsControl.
Each time the server returns a set of results to the client when
processing a search request containing the pagedResultsControl, the
server includes the pagedResultsControl control in the
searchResultDone message. In the control returned to the client, the
size MAY be set to the server's estimate of the total number of
entries in the entire result set. Servers that cannot provide such an
estimate MAY set this size to zero (0). The cookie MUST be set to an
empty value if there are no more entries to return (i.e., the page of
search results returned was the last), or, if there are more entries
to return, to an octet string of the server's choosing,used to resume
the search.
The client MUST consider the cookie to be an opaque structure and
make no assumptions about its internal organization or value. When
the client wants to retrieve more entries for the result set, it MUST
send to the server a searchRequest with all values identical to the
initial request with the exception of the messageID, the cookie, and
optionally a modified pageSize. The cookie MUST be the octet string
on the last searchResultDone response returned by the server.
Returning cookies from previous searchResultDone responses besides
the last one is undefined, as the server implementation may restrict
cookies from being reused.
http://tools.ietf.org/html/rfc2616
RFC 2616 HTTP/1.1 June 1999
13.5.4 Combining Byte Ranges
A response might transfer only a subrange of the bytes of an entity-
body, either because the request included one or more Range
specifications, or because a connection was broken prematurely.
After
several such transfers, a cache might have received several
ranges of
the same entity-body.
If a cache has a stored non-empty set of subranges for an
entity, and
an incoming response transfers another subrange, the cache MAY
combine the new subrange with the existing set if both the following
conditions are met:
- Both the incoming response and the cache entry have a cache
validator.
- The two cache validators match using the strong comparison
function (see section 13.3.3).
If either requirement is not met, the cache MUST use only the most
recent partial response (based on the Date values transmitted with
every response, and using the incoming response if these values are
equal or missing), and MUST discard the other partial information.
14.16 Content-Range
The Content-Range entity-header is sent with a partial
entity-body to
specify where in the full entity-body the partial body should be
applied. Range units are defined in section 3.12.
Content-Range = "Content-Range" ":" content-range-spec
content-range-spec = byte-content-range-spec
byte-content-range-spec = bytes-unit SP
byte-range-resp-spec "/"
( instance-length | "*" )
byte-range-resp-spec = (first-byte-pos "-" last-byte-pos)
| "*"
instance-length = 1*DIGIT
The header SHOULD indicate the total length of the full entity-body,
unless this length is unknown or difficult to determine. The
asterisk
"*" character means that the instance-length is unknown at the time
when the response was generated.
Unlike byte-ranges-specifier values (see section 14.35.1), a byte-
range-resp-spec MUST only specify one range, and MUST contain
absolute byte positions for both the first and last byte of the
range.
A byte-content-range-spec with a byte-range-resp-spec whose last-
byte-pos value is less than its first-byte-pos value, or whose
instance-length value is less than or equal to its last-byte-pos
value, is invalid. The recipient of an invalid byte-content-range-
spec MUST ignore it and any content transferred along with it.
A server sending a response with status code 416 (Requested
range not
satisfiable) SHOULD include a Content-Range field with a byte-range-
resp-spec of "*". The instance-length specifies the current
length of
the selected resource. A response with status code 206 (Partial
Content) MUST NOT include a Content-Range field with a byte-range-
resp-spec of "*".
Examples of byte-content-range-spec values, assuming that the entity
contains a total of 1234 bytes:
. The first 500 bytes:
bytes 0-499/1234
. The second 500 bytes:
bytes 500-999/1234
. All except for the first 500 bytes:
bytes 500-1233/1234
. The last 500 bytes:
bytes 734-1233/1234
When an HTTP message includes the content of a single range (for
example, a response to a request for a single range, or to a request
for a set of ranges that overlap without any holes), this content is
transmitted with a Content-Range header, and a Content-Length header
showing the number of bytes actually transferred. For example,
HTTP/1.1 206 Partial content
Date: Wed, 15 Nov 1995 06:25:24 GMT
Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
Content-Range: bytes 21010-47021/47022
Content-Length: 26012
Content-Type: image/gif
When an HTTP message includes the content of multiple ranges (for
example, a response to a request for multiple non-overlapping
ranges), these are transmitted as a multipart message. The multipart
media type used for this purpose is "multipart/byteranges" as
defined
in appendix 19.2. See appendix 19.6.3 for a compatibility issue.
A response to a request for a single range MUST NOT be sent
using the
multipart/byteranges media type. A response to a request for
multiple ranges, whose result is a single range, MAY be sent as a
multipart/byteranges media type with one part. A client that cannot
decode a multipart/byteranges message MUST NOT ask for multiple
byte-ranges in a single request.
When a client requests multiple byte-ranges in one request, the
server SHOULD return them in the order that they appeared in the
request.
If the server ignores a byte-range-spec because it is syntactically
invalid, the server SHOULD treat the request as if the invalid Range
header field did not exist. (Normally, this means return a 200
response containing the full entity).
If the server receives a request (other than one including an If-
Range request-header field) with an unsatisfiable Range request-
header field (that is, all of whose byte-range-spec values have a
first-byte-pos value greater than the current length of the selected
resource), it SHOULD return a response code of 416 (Requested range
not satisfiable) (section 10.4.17).
Note: clients cannot depend on servers to send a 416 (Requested
range not satisfiable) response instead of a 200 (OK)
response for
an unsatisfiable Range request-header, since not all servers
implement this request-header.
14.35 Range
14.35.1 Byte Ranges
Since all HTTP entities are represented in HTTP messages as
sequences
of bytes, the concept of a byte range is meaningful for any HTTP
entity. (However, not all clients and servers need to support byte-
range operations.)
Byte range specifications in HTTP apply to the sequence of bytes in
the entity-body (not necessarily the same as the message-body).
A byte range operation MAY specify a single range of bytes, or a set
of ranges within a single entity.
ranges-specifier = byte-ranges-specifier
byte-ranges-specifier = bytes-unit "=" byte-range-set
byte-range-set = 1#( byte-range-spec | suffix-byte-range-spec )
byte-range-spec = first-byte-pos "-" [last-byte-pos]
first-byte-pos = 1*DIGIT
last-byte-pos = 1*DIGIT
The first-byte-pos value in a byte-range-spec gives the byte-offset
of the first byte in a range. The last-byte-pos value gives the
byte-offset of the last byte in the range; that is, the byte
positions specified are inclusive. Byte offsets start at zero.
If the last-byte-pos value is present, it MUST be greater than or
equal to the first-byte-pos in that byte-range-spec, or the byte-
range-spec is syntactically invalid. The recipient of a byte-range-
set that includes one or more syntactically invalid byte-range-spec
values MUST ignore the header field that includes that byte-range-
set.
If the last-byte-pos value is absent, or if the value is greater
than
or equal to the current length of the entity-body, last-byte-pos is
taken to be equal to one less than the current length of the entity-
body in bytes.
By its choice of last-byte-pos, a client can limit the number of
bytes retrieved without knowing the size of the entity.
suffix-byte-range-spec = "-" suffix-length
suffix-length = 1*DIGIT
A suffix-byte-range-spec is used to specify the suffix of the
entity-body, of a length given by the suffix-length value. (That is,
this form specifies the last N bytes of an entity-body.) If the
entity is shorter than the specified suffix-length, the entire
entity-body is used.
If a syntactically valid byte-range-set includes at least one byte-
range-spec whose first-byte-pos is less than the current length of
the entity-body, or at least one suffix-byte-range-spec with a non-
zero suffix-length, then the byte-range-set is satisfiable.
Otherwise, the byte-range-set is unsatisfiable. If the
byte-range-set
is unsatisfiable, the server SHOULD return a response with a status
of 416 (Requested range not satisfiable). Otherwise, the server
SHOULD return a response with a status of 206 (Partial Content)
containing the satisfiable ranges of the entity-body.
Examples of byte-ranges-specifier values (assuming an entity-body of
length 10000):
- The first 500 bytes (byte offsets 0-499, inclusive): bytes=0-
499
- The second 500 bytes (byte offsets 500-999, inclusive):
bytes=500-999
- The final 500 bytes (byte offsets 9500-9999, inclusive):
bytes=-500
- Or bytes=9500-
- The first and last bytes only (bytes 0 and 9999): bytes=0-0,-1
- Several legal but not canonical specifications of the
second 500
bytes (byte offsets 500-999, inclusive):
bytes=500-600,601-999
bytes=500-700,601-999
14.35.2 Range Retrieval Requests
HTTP retrieval requests using conditional or unconditional GET
methods MAY request one or more sub-ranges of the entity, instead of
the entire entity, using the Range request header, which applies to
the entity returned as the result of the request:
Range = "Range" ":" ranges-specifier
A server MAY ignore the Range header. However, HTTP/1.1 origin
servers and intermediate caches ought to support byte ranges when
possible, since Range supports efficient recovery from partially
failed transfers, and supports efficient partial retrieval of large
entities.
If the server supports the Range header and the specified range or
ranges are appropriate for the entity:
- The presence of a Range header in an unconditional GET modifies
what is returned if the GET is otherwise successful. In other
words, the response carries a status code of 206 (Partial
Content) instead of 200 (OK).
- The presence of a Range header in a conditional GET (a request
using one or both of If-Modified-Since and If-None-Match, or
one or both of If-Unmodified-Since and If-Match) modifies what
is returned if the GET is otherwise successful and the
condition is true. It does not affect the 304 (Not Modified)
response returned if the conditional is false.
In some cases, it might be more appropriate to use the If-Range
header (see section 14.27) in addition to the Range header.
If a proxy that supports ranges receives a Range request, forwards
the request to an inbound server, and receives an entire entity in
reply, it SHOULD only return the requested range to its client. It
SHOULD store the entire received response in its cache if that is
consistent with its cache allocation policies.
14.27 If-Range
If a client has a partial copy of an entity in its cache, and wishes
to have an up-to-date copy of the entire entity in its cache, it
could use the Range request-header with a conditional GET (using
either or both of If-Unmodified-Since and If-Match.) However, if the
condition fails because the entity has been modified, the client
would then have to make a second request to obtain the entire
current
entity-body.
The If-Range header allows a client to "short-circuit" the second
request. Informally, its meaning is `if the entity is unchanged,
send
me the part(s) that I am missing; otherwise, send me the entire new
entity'.
If-Range = "If-Range" ":" ( entity-tag | HTTP-date )
If the client has no entity tag for an entity, but does have a Last-
Modified date, it MAY use that date in an If-Range header. (The
server can distinguish between a valid HTTP-date and any form of
entity-tag by examining no more than two characters.) The If-Range
header SHOULD only be used together with a Range header, and MUST be
ignored if the request does not include a Range header, or if the
server does not support the sub-range operation.
If the entity tag given in the If-Range header matches the current
entity tag for the entity, then the server SHOULD provide the
specified sub-range of the entity using a 206 (Partial content)
response. If the entity tag does not match, then the server SHOULD
return the entire entity using a 200 (OK) response.
http://www.ietf.org/rfc/rfc5005.txt
Feed Paging and Archiving
Request for Comments: 5005 September 2007
3. Paged Feeds
A paged feed is a set of linked feed documents that together contain
the entries of a logical feed, without any guarantees about the
stability of each document's contents.
Paged feeds are lossy; that is, it is not possible to guarantee that
clients will be able to reconstruct the contents of the logical feed
at a particular time. Entries may be added or changed as the pages
of the feed are accessed, without the client becoming aware of them.
Therefore, clients SHOULD NOT present paged feeds as coherent or
complete, or make assumptions to that effect.
Paged feeds can be useful when the number of entries is very large,
infinite, or indeterminate. Clients can "page" through the feed,
only accessing a subset of the feed's entries as necessary.
For example, a search engine might make query results available as a
paged feed, so that queries with very large result sets do not
overwhelm the server, the network, or the client.
The feed documents in a paged feed are tied together with the
following link relations:
o "first" - A URI that refers to the furthest preceding document in
a series of documents.
o "last" - A URI that refers to the furthest following document in a
series of documents.
o "previous" - A URI that refers to the immediately preceding
document in a series of documents.
o "next" - A URI that refers to the immediately following document
in a series of documents.
Paged feed documents MUST have at least one of these link relations
present, and should contain as many as practical and applicable.
Example: Atom-formatted Paged Feed
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Example Feed</title>
<link href="http://example.org/"/>
<link rel="self" href="http://example.org/index.atom"/>
<link rel="next" href="http://example.org/index.atom?page=2"/>
<updated>2003-12-13T18:30:02Z</updated>
<author>
<name>John Doe</name>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
<entry>
<title>Atom-Powered Robots Run Amok</title>
<link href="http://example.org/2003/12/13/atom03"/>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2003-12-13T18:30:02Z</updated>
<summary>Some text.</summary>
</entry>
</feed>
This specification does not address duplicate entries in paged feeds.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ogf.org/pipermail/occi-wg/attachments/20121114/83a8a405/attachment-0001.html>
More information about the occi-wg
mailing list