[ogsa-wg] RE: Modeling State: Technical Questions

Wed Apr 6 17:20:16 CDT 2005

And people have submitted hundreds of thousands of jobs at once in LSF
queues, and been delighted by the fact that Œbkill 0¹ means kill all of
them. :-)

Stepping back from the "is this a good thing" argument a bit.

In order to support _basic_ execution services, I think we should focus on
the fundamental operations required to meet most use cases (which I believe
is control of one job at a time).

As we get some implementation experience, I believe we'll see the need for
additional interfaces which can provide operations on groups of jobs. This
might be something like one call which gives me a handle to a group of jobs
(perhaps generated from a list of resource IDs, or from some kind of query)
and then the "simple" operation can be used to operate on this job group.

-- Chris

On 6/4/05 10:15, "Ian Foster" <foster at mcs.anl.gov> wrote:

> For what it's worth, the Globus user community has been running thousands of
> instances of our GRAM job submission service for quite a few years, with many
> many millions of jobs running through them, and as far as I am aware, no-one
> has ever asked for the ability to manage more than one job at a time.
> Certainly the lack of this facility hasn't seemed to stop anyone.
> 
> Lots of caveats can be applied here: maybe people did ask, and I didn't hear;
> maybe they didn't think to ask; maybe our workloads are special (although
> there is a great variety). But it is a data point.
> 
> Ian.
> 
> 
> 
> 
> At 11:59 AM 4/6/2005 +0100, Mark McKeown wrote:
> 
>> Hi Paul,
>>          Moving the question from can I suspend multiple
>> jobs by sending a single message to a resource (either
>> REST or WS-Resource) to weither this is a good thing.
>> 
>> 
>> There is a balance between simplicity and efficiency -
>> using a single message intoduces more complexities, as
>> Steve Loughran illustrated, but is potentially more
>> efficient than sending mutliple messages.
>> 
>> 
>> Remembering that "Early optimisation is the root of all
>> evil" (Knuth) - is adding support for suspending mutiple
>> jobs using a single message an example of early
>> optimisation?
>> 
>> 
>> I would imagine that this should be a straight forward
>> question since there is already considerable experience
>> in using computational grids. Are users demanding the
>> ability to suspend mutliple jobs using a single message?
>> Is it for improved efficiency reasons? From my experience
>> no, but others on this list will have considerably more
>> experience.
>> 
>> 
>> Could this be a case of "worse is better", simplicity
>> is more important than efficiency?
>> 
>> Perhaps there are other reasons for using a single message
>> to interact with multiple jobs?
>> 
>> cheers
>> Mark
>> 
>> 
>> 
>>> Ian,
>>> 
>>> 
>>> 
>>> I agree that this is good progress. So let's bank that and see if we can
>>> we can agree on one more thing, and then I'll ask a question.
>>> 
>>> 
>>> 
>>> Considering your list of abilities (a, b & c) below, do we agree that in
>>> terms of expressiveness, the ordering is:
>>> 
>>> 
>>> 
>>> c>b>a
>>> 
>>> 
>>> 
>>> i.e. using approach c, a client can request operations on:
>>> 
>>>   a) single jobs: "where (jobid = urn:guid:364)"
>>> 
>>>   b) sets of jobs: "where (jobid = urn:guid:364) or (jobid =
>>> urn:guid:401)"
>>> 
>>> 
>>> 
>>> If there is agreement on this, then we could move on to discussing why
>>> it is felt necessary to provide more than just c for the job submission
>>> service.
>>> 
>>> 
>>> 
>>> Regards
>>> 
>>> Paul
>>> 
>>> 
>>> 
>>> Ian wrote...
>>> 
>>>> Savas:
>>> 
>>>> 
>>> 
>>>> It seems that we are in agreement, then, that we want the ability to:
>>> 
>>>> 
>>> 
>>>> a) Request operations on individual jobs identified by some sort of
>>> "jobid"
>>> 
>>>> 
>>> 
>>>> b) Request operations on sets of jobs identified by a user-supplied
>>> list of "jobids"
>>> 
>>>> 
>>> 
>>>> c) Request operations on sets of jobs identified by more abstract
>>> criteria
>>> 
>>>> 
>>> 
>>>> We also agree that (as I expressed in the email that started this
>>> discussion) such >requests can be expressed in a few different ways,
>>> with somewhat different >characteristics.
>>> 
>>>> 
>>> 
>>>> That's progress I hope.
>>> 
>>>> 
>>> 
>>>> Ian.
>>> 
>>> 
>>> 
>>> ________________________________
>>> 
>>> From: Ian Foster [mailto:foster at mcs.anl.gov]
>>> Sent: 05 April 2005 17:59
>>> To: Savas Parastatidis; Steve Loughran
>>> Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg;
>>> dave.pearson at oracle.com; gray at microsoft.com; humphrey at cs.virginia.edu;
>>> grimshaw at virginia.edu; aherbert at microsoft.com; gcf at indiana.edu;
>>> mark.linesch at hp.com; Frank Siebenlist; Tony Hey; Dave Berry; Paul Watson
>>> Subject: RE: [ogsa-wg] RE: Modeling State: Technical Questions
>>> 
>>> 
>>> 
>>> [I'm feeling increasingly bad about sending email to all of the people
>>> CCed here, who may not be interested in these issues at all but got
>>> addressed by Tony long ago...]
>>> 
>>> Savas:
>>> 
>>> It seems that we are in agreement, then, that we want the ability to:
>>> 
>>> a) Request operations on individual jobs identified by some sort of
>>> "jobid"
>>> 
>>> b) Request operations on sets of jobs identified by a user-supplied list
>>> of "jobids"
>>> 
>>> c) Request operations on sets of jobs identified by more abstract
>>> criteria
>>> 
>>> We also agree that (as I expressed in the email that started this
>>> discussion) such requests can be expressed in a few different ways, with
>>> somewhat different characteristics.
>>> 
>>> That's progress I hope.
>>> 
>>> Ian.
>>> 
>>> At 02:44 PM 4/5/2005 +0100, Savas Parastatidis wrote:
>>> 
>>> 
>>> 
>>> 
>>> Dear Ian,
>>> 
>>> 
>>> 
>>> I dont think that the approach I proposed forces the user to do more
>>> than they would have to do anyway if EPRs were used. It is still the
>>> case that someone has to manage the EPRs to the resources in WSRF. This
>>> is similar to what happens in the real world. The online bookstore will
>>> ask for my credit card number (a URI), or the book store will as for an
>>> ISBN (another URI) or multiple ISBNs if I want to buy multiple books.
>>> The banking service will ask for my bank account number (another URI
>>> perhaps).
>>> 
>>> 
>>> 
>>> Also, there is no reason why a kill all my jobsmessage couldnt also be
>>> supported. But please note that this message is now addressed to the
>>> service (the container of resources) and not, as in the case of WSRF, to
>>> a specific resource. This is no different from what I am advocating.
>>> 
>>> 
>>> 
>>> Also& to Steves point about partial failure. If one wishes atomic
>>> transaction semantics, I dont see the difference from the two
>>> approaches&
>>> 
>>> 
>>> 
>>> Atomic
>>> 
>>>   Msg -> resource 1
>>> 
>>>   Msg -> resource 2
>>> 
>>>   Msg -> resource 3
>>> 
>>> End Atomic
>>> 
>>> 
>>> 
>>> Vs
>>> 
>>> 
>>> 
>>> Msg
>>> 
>>>   Atomic
>>> 
>>>     Resource 1
>>> 
>>>     Resource 2
>>> 
>>>     Resource 3
>>> 
>>>   End Atomic
>>> 
>>> 
>>> 
>>> In fact, I would argue that the latter is better because:
>>> 
>>> 
>>> 
>>> 1. It uses fewer messages (and, Steve, I am not assuming only HTTP and
>>> the optimisations that may be supported)
>>> 
>>> 
>>> 
>>> 2.  I can more easily deal with the failures in an application
>>> specific-manner since my atomic TX semantics do not span multiple msgs.
>>> 
>>> 
>>> 
>>> (Anyway& who wants to do atomic TXs over the Web anyway? :-)
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> --
>>> Savas Parastatidis
>>> http://savas.parastatidis.name <http://savas.parastatidis.name/>
>>> 
>>> 
>>> 
>>> 
>>> From: Ian Foster [mailto:foster at mcs.anl.gov]
>>> Sent: Tuesday, April 05, 2005 2:22 PM
>>> To: Steve Loughran; Savas Parastatidis
>>> Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg;
>>> dave.pearson at oracle.com; gray at microsoft.com; humphrey at cs.virginia.edu;
>>> grimshaw at virginia.edu; aherbert at microsoft.com; gcf at indiana.edu;
>>> mark.linesch at hp.com; Frank Siebenlist; Tony Hey; Dave Berry
>>> Subject: Re: [ogsa-wg] RE: Modeling State: Technical Questions
>>> 
>>> 
>>> 
>>> Steve's note raises a key point for me: do we really want to force the
>>> user (as Savas seems to be advocating) to keep track of jobs running at
>>> a remote site?
>>> 
>>> I'd rather send a request "kill all my jobs" or "kill all my jobs that
>>> have run for more than a day" to the factory than carefully keep track
>>> of all jobs that I have active, and how long they have been running, so
>>> that I can send the big document (or stream) discussed below.
>>> 
>>> Ian.
>>> 
>>> 
>>> At 02:10 PM 4/5/2005 +0100, Steve Loughran wrote:
>>> 
>>> Savas Parastatidis wrote:
>>> 
>>> Dear all,
>>> I think something needs to be clarified with regards to handling
>>> multiple jobs with one message. The beauty of document-oriented
>>> interactions is that you can do things like...
>>> <job-details-request>
>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-001</job-id>
>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-010</job-id>
>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id>
>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-029</job-id>
>>> </job-details-request>
>>> Or
>>> <job-suspend-request>
>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id>
>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-005</job-id>
>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-008</job-id>
>>> </job-suspend-request>
>>> The schema for the above document can allow anything from 0 to N number
>>> of <job-id> elements.
>>> 
>>> 
>>> the trouble with any bulk operation is you have to handle partial
>>> failure. You need either atomic operations (not long lived transactions
>>> over HTTP Savas, I wouldn't be that daft), or a way of indicating that
>>> only a bit went wrong
>>> 
>>> Hence the 207 Multi-Status response in WebDav, the "something failed,
>>> look in the message". WebDav is still single instance (here a RESTy
>>> URL), but you can set >1 property and so have partial failure.
>>> 
>>> SOAP just has SOAPFault and extensions; no explicit multiple failure
>>> response. WS-RF-ResourceProperties has a similar problem with
>>> SetResourceProperties, but a different failure model in which any
>>> failure to set can result in a WS-BaseFault, indicating which failed,
>>> but providing no apparent information on which worked.
>>> 
>>> It seems to me that if you want to bulk stuff, you do need ways of (a)
>>> handling partial failure and (b) declaring what happens on partial
>>> failure. For the curions, WebDav's failure mode on file operations
>>> (MOVE, COPY) is explicitly declared to be that of failed file operations
>>> of Win98 on a FAT32 filesystem  [1,2]
>>> 
>>> Alternatively, you dont go for bulk operations, neither on a multiple
>>> jobs, or on multiple properties of a job (remember, WS-RF doesn't
>>> declare atomic/transacted property operations, so all you do here is
>>> increase the window of instability, a window that already exists).
>>> Instead you just stream a series of operations over the same HTTP1.1
>>> connection -assuming that everything is accessible at the same far-end
>>> host, and get a series of (potentially out of order, we are talking
>>> HTTP1.1) responses.
>>> 
>>> This could be efficient, and you could do better handling of failure.
>>> But you do need a SOAP stack that can keep an HTTP1.1 channel open for
>>> multiple requests. Axis doesnt, even if you get httpclient to do the
>>> HTTP work; I don't know about .NET/WSE. You also need developers to
>>> model the communication correctly. Manipulating JAXRPC proxies as if
>>> they represent remote objects is *clearly* the wrong way to do it. You'd
>>> almost want to model a queue of requests waiting to be POSTed, a queue
>>> you can fill up then push out. Something like this, in your Java-era
>>> language of choice :-
>>> 
>>> //different queues for SOAP, REST
>>> Queue q=new Soap12RequestQueue();
>>> 
>>> q.add(new StatePut(job1.uri,Job.LIFECYCLE,Job.SUSPENDED));
>>> //let the queue reorder stuff if it wants to
>>> q.add(new
>>> StatePut(job2.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_OPTIMAL);
>>> q.add(new
>>> StatePut(job3.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_LAST);
>>> 
>>> q.setEventHandler(this);
>>> q.nonBlockingSubmit();
>>> 
>>> No, there is no code behind this example, and I am avoiding any hints as
>>> to what the even handler would look like. I think the key point is that
>>> once you embrace remote operations as async actions, then you can model
>>> the manipulations differently.  Note also that I am representing job
>>> suspension not as an explicit suspend() operation, but as a request to
>>> put a job into the suspended state. This API could work with our friend
>>> REST just as easily as with WS-RF...
>>> 
>>> Anyway Savas, to conclude: do you have any evidence that a single
>>> document is suboptimal compared to a sequences of requests over an open
>>> HTTP/1.1 connection? That is, assuming we ignore the SHOULD in the
>>> HTTP1.1 specification " Clients SHOULD NOT pipeline requests using
>>> non-idempotent methods or non-idempotent sequences of methods" [3]
>>> 
>>> -Steve
>>> 
>>> 
>>> [1] WebDav http://www.ietf.org/rfc/rfc2518.txt S8.9.2
>>> 
>>> "after encountering an error moving a non-collection
>>>    resource as part of an infinite depth move, the server SHOULD try to
>>>    finish as much of the original move operation as possible."
>>> 
>>> [2]
>>> http://lists.w3.org/Archives/Public/w3c-dist-auth/1997JulSep/0177.html
>>> 
>>> [3] RFC2616 HTTP1.1
>>> 
>>> _______________________________________________________________
>>> Ian Foster                    www.mcs.anl.gov/~foster
>>> <http://www.mcs.anl.gov/~foster>
>>> Math & Computer Science Div.  Dept of Computer Science
>>> Argonne National Laboratory   The University of Chicago
>>> Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
>>> Tel: 630 252 4619             Fax: 630 252 1997
>>>         Globus Alliance, www.globus.org <http://www.globus.org/>
>>> <http://www.globus.org/>
>>> 
>>> _______________________________________________________________
>>> Ian Foster                    www.mcs.anl.gov/~foster
>>> <http://www.mcs.anl.gov/~foster>
>>> Math & Computer Science Div.  Dept of Computer Science
>>> Argonne National Laboratory   The University of Chicago
>>> Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
>>> Tel: 630 252 4619             Fax: 630 252 1997
>>>         Globus Alliance, www.globus.org <http://www.globus.org/>
>>> <http://www.globus.org/>
>>> 
>>> 
> _______________________________________________________________
> Ian Foster                    www.mcs.anl.gov/~foster
> <http://www.mcs.anl.gov/~foster>
> Math & Computer Science Div.  Dept of Computer Science
> Argonne National Laboratory   The University of Chicago
> Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
> Tel: 630 252 4619             Fax: 630 252 1997
>         Globus Alliance, www.globus.org <http://www.globus.org/>
>