[ogsa-wg] RE: Modeling State: Technical Questions

Thu Apr 7 07:35:27 CDT 2005

Chris,

On 6 Apr 2005, at 23:20, Christopher Smith wrote:

> And people have submitted hundreds of thousands of jobs at once in LSF
> queues, and been delighted by the fact that ‘bkill 0’ means kill all of
> them. :-)

And I'm sure a few were seriously distressed, having bkilled a week's  
worth of work :-).

Otherwise +1

> Stepping back from the "is this a good thing" argument a bit.
>
> In order to support _basic_ execution services, I think we should  
> focus on
> the fundamental operations required to meet most use cases (which I  
> believe
> is control of one job at a time).
>
> As we get some implementation experience, I believe we'll see the need  
> for
> additional interfaces which can provide operations on groups of jobs.  
> This
> might be something like one call which gives me a handle to a group of  
> jobs
> (perhaps generated from a list of resource IDs, or from some kind of  
> query)
> and then the "simple" operation can be used to operate on this job  
> group.
>
> -- Chris
>
>
>
> On 6/4/05 10:15, "Ian Foster" <foster at mcs.anl.gov> wrote:
>
>> For what it's worth, the Globus user community has been running  
>> thousands of
>> instances of our GRAM job submission service for quite a few years,  
>> with many
>> many millions of jobs running through them, and as far as I am aware,  
>> no-one
>> has ever asked for the ability to manage more than one job at a time.
>> Certainly the lack of this facility hasn't seemed to stop anyone.
>>
>> Lots of caveats can be applied here: maybe people did ask, and I  
>> didn't hear;
>> maybe they didn't think to ask; maybe our workloads are special  
>> (although
>> there is a great variety). But it is a data point.
>>
>> Ian.
>>
>>
>>
>>
>> At 11:59 AM 4/6/2005 +0100, Mark McKeown wrote:
>>
>>> Hi Paul,
>>>          Moving the question from can I suspend multiple
>>> jobs by sending a single message to a resource (either
>>> REST or WS-Resource) to weither this is a good thing.
>>>
>>>
>>> There is a balance between simplicity and efficiency -
>>> using a single message intoduces more complexities, as
>>> Steve Loughran illustrated, but is potentially more
>>> efficient than sending mutliple messages.
>>>
>>>
>>> Remembering that "Early optimisation is the root of all
>>> evil" (Knuth) - is adding support for suspending mutiple
>>> jobs using a single message an example of early
>>> optimisation?
>>>
>>>
>>> I would imagine that this should be a straight forward
>>> question since there is already considerable experience
>>> in using computational grids. Are users demanding the
>>> ability to suspend mutliple jobs using a single message?
>>> Is it for improved efficiency reasons? From my experience
>>> no, but others on this list will have considerably more
>>> experience.
>>>
>>>
>>> Could this be a case of "worse is better", simplicity
>>> is more important than efficiency?
>>>
>>> Perhaps there are other reasons for using a single message
>>> to interact with multiple jobs?
>>>
>>> cheers
>>> Mark
>>>
>>>
>>>
>>>> Ian,
>>>>
>>>>
>>>>
>>>> I agree that this is good progress. So let's bank that and see if  
>>>> we can
>>>> we can agree on one more thing, and then I'll ask a question.
>>>>
>>>>
>>>>
>>>> Considering your list of abilities (a, b & c) below, do we agree  
>>>> that in
>>>> terms of expressiveness, the ordering is:
>>>>
>>>>
>>>>
>>>> c>b>a
>>>>
>>>>
>>>>
>>>> i.e. using approach c, a client can request operations on:
>>>>
>>>>   a) single jobs: "where (jobid = urn:guid:364)"
>>>>
>>>>   b) sets of jobs: "where (jobid = urn:guid:364) or (jobid =
>>>> urn:guid:401)"
>>>>
>>>>
>>>>
>>>> If there is agreement on this, then we could move on to discussing  
>>>> why
>>>> it is felt necessary to provide more than just c for the job  
>>>> submission
>>>> service.
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Paul
>>>>
>>>>
>>>>
>>>> Ian wrote...
>>>>
>>>>> Savas:
>>>>
>>>>>
>>>>
>>>>> It seems that we are in agreement, then, that we want the ability  
>>>>> to:
>>>>
>>>>>
>>>>
>>>>> a) Request operations on individual jobs identified by some sort of
>>>> "jobid"
>>>>
>>>>>
>>>>
>>>>> b) Request operations on sets of jobs identified by a user-supplied
>>>> list of "jobids"
>>>>
>>>>>
>>>>
>>>>> c) Request operations on sets of jobs identified by more abstract
>>>> criteria
>>>>
>>>>>
>>>>
>>>>> We also agree that (as I expressed in the email that started this
>>>> discussion) such >requests can be expressed in a few different ways,
>>>> with somewhat different >characteristics.
>>>>
>>>>>
>>>>
>>>>> That's progress I hope.
>>>>
>>>>>
>>>>
>>>>> Ian.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>
>>>> From: Ian Foster [mailto:foster at mcs.anl.gov]
>>>> Sent: 05 April 2005 17:59
>>>> To: Savas Parastatidis; Steve Loughran
>>>> Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder;  
>>>> ogsa-wg;
>>>> dave.pearson at oracle.com; gray at microsoft.com;  
>>>> humphrey at cs.virginia.edu;
>>>> grimshaw at virginia.edu; aherbert at microsoft.com; gcf at indiana.edu;
>>>> mark.linesch at hp.com; Frank Siebenlist; Tony Hey; Dave Berry; Paul  
>>>> Watson
>>>> Subject: RE: [ogsa-wg] RE: Modeling State: Technical Questions
>>>>
>>>>
>>>>
>>>> [I'm feeling increasingly bad about sending email to all of the  
>>>> people
>>>> CCed here, who may not be interested in these issues at all but got
>>>> addressed by Tony long ago...]
>>>>
>>>> Savas:
>>>>
>>>> It seems that we are in agreement, then, that we want the ability  
>>>> to:
>>>>
>>>> a) Request operations on individual jobs identified by some sort of
>>>> "jobid"
>>>>
>>>> b) Request operations on sets of jobs identified by a user-supplied  
>>>> list
>>>> of "jobids"
>>>>
>>>> c) Request operations on sets of jobs identified by more abstract
>>>> criteria
>>>>
>>>> We also agree that (as I expressed in the email that started this
>>>> discussion) such requests can be expressed in a few different ways,  
>>>> with
>>>> somewhat different characteristics.
>>>>
>>>> That's progress I hope.
>>>>
>>>> Ian.
>>>>
>>>> At 02:44 PM 4/5/2005 +0100, Savas Parastatidis wrote:
>>>>
>>>>
>>>>
>>>>
>>>> Dear Ian,
>>>>
>>>>
>>>>
>>>> I dont think that the approach I proposed forces the user to do more
>>>> than they would have to do anyway if EPRs were used. It is still the
>>>> case that someone has to manage the EPRs to the resources in WSRF.  
>>>> This
>>>> is similar to what happens in the real world. The online bookstore  
>>>> will
>>>> ask for my credit card number (a URI), or the book store will as  
>>>> for an
>>>> ISBN (another URI) or multiple ISBNs if I want to buy multiple  
>>>> books.
>>>> The banking service will ask for my bank account number (another URI
>>>> perhaps).
>>>>
>>>>
>>>>
>>>> Also, there is no reason why a kill all my jobsmessage couldnt also  
>>>> be
>>>> supported. But please note that this message is now addressed to the
>>>> service (the container of resources) and not, as in the case of  
>>>> WSRF, to
>>>> a specific resource. This is no different from what I am advocating.
>>>>
>>>>
>>>>
>>>> Also& to Steves point about partial failure. If one wishes atomic
>>>> transaction semantics, I dont see the difference from the two
>>>> approaches&
>>>>
>>>>
>>>>
>>>> Atomic
>>>>
>>>>   Msg -> resource 1
>>>>
>>>>   Msg -> resource 2
>>>>
>>>>   Msg -> resource 3
>>>>
>>>> End Atomic
>>>>
>>>>
>>>>
>>>> Vs
>>>>
>>>>
>>>>
>>>> Msg
>>>>
>>>>   Atomic
>>>>
>>>>     Resource 1
>>>>
>>>>     Resource 2
>>>>
>>>>     Resource 3
>>>>
>>>>   End Atomic
>>>>
>>>>
>>>>
>>>> In fact, I would argue that the latter is better because:
>>>>
>>>>
>>>>
>>>> 1. It uses fewer messages (and, Steve, I am not assuming only HTTP  
>>>> and
>>>> the optimisations that may be supported)
>>>>
>>>>
>>>>
>>>> 2.  I can more easily deal with the failures in an application
>>>> specific-manner since my atomic TX semantics do not span multiple  
>>>> msgs.
>>>>
>>>>
>>>>
>>>> (Anyway& who wants to do atomic TXs over the Web anyway? :-)
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> --
>>>> Savas Parastatidis
>>>> http://savas.parastatidis.name <http://savas.parastatidis.name/>
>>>>
>>>>
>>>>
>>>>
>>>> From: Ian Foster [mailto:foster at mcs.anl.gov]
>>>> Sent: Tuesday, April 05, 2005 2:22 PM
>>>> To: Steve Loughran; Savas Parastatidis
>>>> Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder;  
>>>> ogsa-wg;
>>>> dave.pearson at oracle.com; gray at microsoft.com;  
>>>> humphrey at cs.virginia.edu;
>>>> grimshaw at virginia.edu; aherbert at microsoft.com; gcf at indiana.edu;
>>>> mark.linesch at hp.com; Frank Siebenlist; Tony Hey; Dave Berry
>>>> Subject: Re: [ogsa-wg] RE: Modeling State: Technical Questions
>>>>
>>>>
>>>>
>>>> Steve's note raises a key point for me: do we really want to force  
>>>> the
>>>> user (as Savas seems to be advocating) to keep track of jobs  
>>>> running at
>>>> a remote site?
>>>>
>>>> I'd rather send a request "kill all my jobs" or "kill all my jobs  
>>>> that
>>>> have run for more than a day" to the factory than carefully keep  
>>>> track
>>>> of all jobs that I have active, and how long they have been  
>>>> running, so
>>>> that I can send the big document (or stream) discussed below.
>>>>
>>>> Ian.
>>>>
>>>>
>>>> At 02:10 PM 4/5/2005 +0100, Steve Loughran wrote:
>>>>
>>>> Savas Parastatidis wrote:
>>>>
>>>> Dear all,
>>>> I think something needs to be clarified with regards to handling
>>>> multiple jobs with one message. The beauty of document-oriented
>>>> interactions is that you can do things like...
>>>> <job-details-request>
>>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-001</job-id>
>>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-010</job-id>
>>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id>
>>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-029</job-id>
>>>> </job-details-request>
>>>> Or
>>>> <job-suspend-request>
>>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id>
>>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-005</job-id>
>>>>  <job-id>urn:ogsa:job:guid:bla-bla-bla-008</job-id>
>>>> </job-suspend-request>
>>>> The schema for the above document can allow anything from 0 to N  
>>>> number
>>>> of <job-id> elements.
>>>>
>>>>
>>>> the trouble with any bulk operation is you have to handle partial
>>>> failure. You need either atomic operations (not long lived  
>>>> transactions
>>>> over HTTP Savas, I wouldn't be that daft), or a way of indicating  
>>>> that
>>>> only a bit went wrong
>>>>
>>>> Hence the 207 Multi-Status response in WebDav, the "something  
>>>> failed,
>>>> look in the message". WebDav is still single instance (here a RESTy
>>>> URL), but you can set >1 property and so have partial failure.
>>>>
>>>> SOAP just has SOAPFault and extensions; no explicit multiple failure
>>>> response. WS-RF-ResourceProperties has a similar problem with
>>>> SetResourceProperties, but a different failure model in which any
>>>> failure to set can result in a WS-BaseFault, indicating which  
>>>> failed,
>>>> but providing no apparent information on which worked.
>>>>
>>>> It seems to me that if you want to bulk stuff, you do need ways of  
>>>> (a)
>>>> handling partial failure and (b) declaring what happens on partial
>>>> failure. For the curions, WebDav's failure mode on file operations
>>>> (MOVE, COPY) is explicitly declared to be that of failed file  
>>>> operations
>>>> of Win98 on a FAT32 filesystem  [1,2]
>>>>
>>>> Alternatively, you dont go for bulk operations, neither on a  
>>>> multiple
>>>> jobs, or on multiple properties of a job (remember, WS-RF doesn't
>>>> declare atomic/transacted property operations, so all you do here is
>>>> increase the window of instability, a window that already exists).
>>>> Instead you just stream a series of operations over the same HTTP1.1
>>>> connection -assuming that everything is accessible at the same  
>>>> far-end
>>>> host, and get a series of (potentially out of order, we are talking
>>>> HTTP1.1) responses.
>>>>
>>>> This could be efficient, and you could do better handling of  
>>>> failure.
>>>> But you do need a SOAP stack that can keep an HTTP1.1 channel open  
>>>> for
>>>> multiple requests. Axis doesnt, even if you get httpclient to do the
>>>> HTTP work; I don't know about .NET/WSE. You also need developers to
>>>> model the communication correctly. Manipulating JAXRPC proxies as if
>>>> they represent remote objects is *clearly* the wrong way to do it.  
>>>> You'd
>>>> almost want to model a queue of requests waiting to be POSTed, a  
>>>> queue
>>>> you can fill up then push out. Something like this, in your Java-era
>>>> language of choice :-
>>>>
>>>> //different queues for SOAP, REST
>>>> Queue q=new Soap12RequestQueue();
>>>>
>>>> q.add(new StatePut(job1.uri,Job.LIFECYCLE,Job.SUSPENDED));
>>>> //let the queue reorder stuff if it wants to
>>>> q.add(new
>>>> StatePut(job2.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_OPTIMA 
>>>> L);
>>>> q.add(new
>>>> StatePut(job3.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_LAST);
>>>>
>>>> q.setEventHandler(this);
>>>> q.nonBlockingSubmit();
>>>>
>>>> No, there is no code behind this example, and I am avoiding any  
>>>> hints as
>>>> to what the even handler would look like. I think the key point is  
>>>> that
>>>> once you embrace remote operations as async actions, then you can  
>>>> model
>>>> the manipulations differently.  Note also that I am representing job
>>>> suspension not as an explicit suspend() operation, but as a request  
>>>> to
>>>> put a job into the suspended state. This API could work with our  
>>>> friend
>>>> REST just as easily as with WS-RF...
>>>>
>>>> Anyway Savas, to conclude: do you have any evidence that a single
>>>> document is suboptimal compared to a sequences of requests over an  
>>>> open
>>>> HTTP/1.1 connection? That is, assuming we ignore the SHOULD in the
>>>> HTTP1.1 specification " Clients SHOULD NOT pipeline requests using
>>>> non-idempotent methods or non-idempotent sequences of methods" [3]
>>>>
>>>> -Steve
>>>>
>>>>
>>>> [1] WebDav http://www.ietf.org/rfc/rfc2518.txt S8.9.2
>>>>
>>>> "after encountering an error moving a non-collection
>>>>    resource as part of an infinite depth move, the server SHOULD  
>>>> try to
>>>>    finish as much of the original move operation as possible."
>>>>
>>>> [2]
>>>> http://lists.w3.org/Archives/Public/w3c-dist-auth/1997JulSep/ 
>>>> 0177.html
>>>>
>>>> [3] RFC2616 HTTP1.1
>>>>
>>>> _______________________________________________________________
>>>> Ian Foster                    www.mcs.anl.gov/~foster
>>>> <http://www.mcs.anl.gov/~foster>
>>>> Math & Computer Science Div.  Dept of Computer Science
>>>> Argonne National Laboratory   The University of Chicago
>>>> Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
>>>> Tel: 630 252 4619             Fax: 630 252 1997
>>>>         Globus Alliance, www.globus.org <http://www.globus.org/>
>>>> <http://www.globus.org/>
>>>>
>>>> _______________________________________________________________
>>>> Ian Foster                    www.mcs.anl.gov/~foster
>>>> <http://www.mcs.anl.gov/~foster>
>>>> Math & Computer Science Div.  Dept of Computer Science
>>>> Argonne National Laboratory   The University of Chicago
>>>> Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
>>>> Tel: 630 252 4619             Fax: 630 252 1997
>>>>         Globus Alliance, www.globus.org <http://www.globus.org/>
>>>> <http://www.globus.org/>
>>>>
>>>>
>> _______________________________________________________________
>> Ian Foster                    www.mcs.anl.gov/~foster
>> <http://www.mcs.anl.gov/~foster>
>> Math & Computer Science Div.  Dept of Computer Science
>> Argonne National Laboratory   The University of Chicago
>> Argonne, IL 60439, U.S.A.     Chicago, IL 60637, U.S.A.
>> Tel: 630 252 4619             Fax: 630 252 1997
>>         Globus Alliance, www.globus.org <http://www.globus.org/>
>>
>
>
-- 

Take care:

     Dr. David Snelling < David . Snelling . UK . Fujitsu . com >
     Fujitsu Laboratories of Europe
     Hayes Park Central
     Hayes End Road
     Hayes, Middlesex  UB4 8FE

     +44-208-606-4649 (Office)
     +44-208-606-4539 (Fax)
     +44-7768-807526  (Mobile)